Arxiv今日论文 | 2026-04-13

本篇博文主要内容为 2026-04-13 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共90篇(Computation and Language (cs.CL))
人工智能共181篇(Artificial Intelligence (cs.AI))
计算机视觉共147篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共184篇(Machine Learning (cs.LG))
多智能体系统共16篇(Multiagent Systems (cs.MA))
信息检索共23篇(Information Retrieval (cs.IR))
人机交互共32篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL

【速读】：该论文旨在解决多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）策略从模拟网络攻防对抗环境向真实安全运营中心（Security Operations Center, SOC）部署时面临的“Sim2Real差距”问题。传统仿真器在抽象网络协议物理特性、依赖同步时间步长以及提供干净状态向量等方面存在局限，导致训练出的策略难以直接迁移至真实场景。其解决方案的关键在于提出NetForge_RL——一个高保真网络防御仿真平台，将网络防御建模为异步、连续时间的不完全可观测半马尔可夫决策过程（Partially Observable Semi-Markov Decision Process, POSMDP），并引入双模式引擎实现高吞吐MARL训练与零样本（zero-shot）实时攻击评估的无缝衔接；同时设计Continuous-Time Graph MARL（CT-GMARL）算法，利用固定步长神经常微分方程（Neural Ordinary Differential Equations, ODEs）处理非规则采样告警数据，从而显著提升策略在真实环境中的泛化能力与防御效能。

链接: https://arxiv.org/abs/2604.09523
作者: Igor Jankowski
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 26 pages, 14 figures, 5 tables

点击查看摘要

Abstract:The transition of Multi-Agent Reinforcement Learning (MARL) policies from simulated cyber wargames to operational Security Operations Centers (SOCs) is fundamentally bottlenecked by the Sim2Real gap. Legacy simulators abstract away network protocol physics, rely on synchronous ticks, and provide clean state vectors rather than authentic, noisy telemetry. To resolve these limitations, we introduce NetForge_RL: a high-fidelity cyber operations simulator that reformulates network defense as an asynchronous, continuous-time Partially Observable Semi-Markov Decision Process (POSMDP). NetForge enforces Zero-Trust Network Access (ZTNA) constraints and requires defenders to process NLP-encoded SIEM telemetry. Crucially, NetForge bridges the Sim2Real gap natively via a dual-mode engine, allowing high-throughput MARL training in a mock hypervisor and zero-shot evaluation against live exploits in a Docker hypervisor. To navigate this continuous-time POSMDP, we propose Continuous-Time Graph MARL (CT-GMARL), utilizing fixed-step Neural Ordinary Differential Equations (ODEs) to process irregularly sampled alerts. We evaluate our framework against discrete baselines (R-MAPPO, QMIX). Empirical results demonstrate that CT-GMARL achieves a converged median Blue reward of 57,135 - a 2.0x improvement over R-MAPPO and 2.1x over QMIX. Critically, CT-GMARL restores 12x more compromised services than the strongest baseline by avoiding the “scorched earth” failure mode of trivially minimizing risk by destroying network utility. On zero-shot transfer to the live Docker environment, CT-GMARL policies achieve a median reward of 98,026, validating the Sim2Real bridge.

[MA-1] Strategic Algorithmic Monoculture:Experimental Evidence from Coordination Games

【速读】：该论文旨在解决多智能体环境中智能体间协调机制的问题，特别是区分并量化“基础算法同质化”（primary algorithmic monoculture）与“策略性算法同质化”（strategic algorithmic monoculture）对协作效果的影响。其解决方案的关键在于设计了一个简洁的实验范式，能够清晰分离这两种同质化机制，并在人类和大型语言模型（LLM）主体中进行验证，发现LLM在基础动作相似性上表现出高同质性，且能像人类一样根据协调激励动态调整行为策略，但在需要维持多样性以获取奖励的情境下，LLM的表现落后于人类。

链接: https://arxiv.org/abs/2604.09502
作者: Gonzalo Ballestero,Hadi Hosseini,Samarth Khanna,Ran I. Shorrer
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:AI agents increasingly operate in multi-agent environments where outcomes depend on coordination. We distinguish primary algorithmic monoculture – baseline action similarity – from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives. We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects. LLMs exhibit high levels of baseline similarity (primary monoculture) and, like humans, they regulate it in response to coordination incentives (strategic monoculture). While LLMs coordinate extremely well on similar actions, they lag behind humans in sustaining heterogeneity when divergence is rewarded.

[MA-2] Risk-seeking conservative policy iteration with agent -state based policies for Dec-POMDPs with guaranteed convergence

【速读】：该论文旨在解决在受限计算能力下，如何在分布式部分可观测马尔可夫决策过程（Decentralized Partially Observable Markov Decision Process, Dec-POMDP）中寻找具有有限记忆状态（agent states）的次优策略问题。由于最优解通常依赖于完整的观测与动作历史，而实际应用中受限于内存资源，因此需要在有限状态空间内逼近最优性能。其解决方案的关键在于提出一种基于迭代最优响应（iterated best response）的算法，通过引入激励风险偏好的修正目标函数（modified objective），结合保守策略迭代（conservative policy iteration）更新机制，在多项式时间内实现单调改进并收敛至局部最优解。实验表明，该方法在多个基准测试中达到接近最优性能，且随着可用代理状态数量增加，性能持续提升，从而为在内存约束下高效求解Dec-POMDP提供了新的建模与优化思路。

链接: https://arxiv.org/abs/2604.09495
作者: Amit Sinha,Matthieu Geist,Aditya Mahajan
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Optimally solving decentralized decision-making problems modeled as Dec-POMDPs is known to be NEXP-complete. These optimal solutions are policies based on the entire history of observations and actions of an agent. However, some applications may require more compact policies because of limited compute capabilities, which can be modeled by considering a limited number of memory states (or agent states). While such an agent-state based policy class may not contain the optimal solution, it is still of practical interest to find the best agent-state policy within the class. We focus on an iterated best response style algorithm which guarantees monotonic improvements and convergence to a local optimum in polynomial runtime in the Dec-POMDP model size. In order to obtain a better local optimum, we use a modified objective which incentivizes risk-seeking alongside a conservative policy iteration update. Our empirical results show that our approach performs as well as state-of-the-art approaches on several benchmark Dec-POMDPs, achieving near-optimal performance while having polynomial runtime despite the limited memory. We also show that using more agent states (a larger memory) leads to greater performance. Our approach provides a novel way of incorporating memory constraints on the agents in the Dec-POMDP problem.

[MA-3] Decentralized Opinion-Integrated Decision making at Unsignalized Intersections via Signed Networks

【速读】：该论文旨在解决无信号交叉路口中联网自动驾驶车辆（Connected Autonomous Vehicles, CAVs）的去中心化决策问题，尤其针对现有集中式方法在混合行驶意图和协调器失效情况下难以扩展的局限性。其解决方案的关键在于提出一种闭环意见动态决策模型，通过双符号网络实现车辆间意图交换：一是基于冲突拓扑的通信网络，二是由承诺驱动的信念网络，从而在无需中央协调器的情况下促进合作。车辆连续的意见状态调节速度优化器权重，在承诺前进行动态调整；随后，一个闭式预测可行性门将每辆车的决策冻结为“GO”或“YIELD”承诺，并沿信念网络传播以预置邻近车辆行为，避免物理冲突。最终的通行顺序由几何可行性与到达优先级自然决定，无需联合优化或求解器，验证结果表明该方法在多种复杂冲突拓扑下均能实现无碰撞协调且最后车辆离场时间优于先到先服务（FCFS）策略。

链接: https://arxiv.org/abs/2604.09351
作者: Bhaskar Varma,Ying Shuai Quan,Karl D. von Ellenrieder,Paolo Falcone
机构: Free University of Bozen-Bolzano (博岑-博尔扎诺自由大学); Chalmers University of Technology (查尔默斯理工大学)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: Submitted to CDC 2026 with L-CSS Parallel option

点击查看摘要

Abstract:In this letter, we consider the problem of decentralized decision making among connected autonomous vehicles at unsignalized intersections, where existing centralized approaches do not scale gracefully under mixed maneuver intentions and coordinator failure. We propose a closed-loop opinion-dynamic decision model for intersection coordination, where vehicles exchange intent through dual signed networks: a conflict topology based communication network and a commitment-driven belief network that enable cooperation without a centralized coordinator. Continuous opinion states modulate velocity optimizer weights prior to commitment; a closed-form predictive feasibility gate then freezes each vehicle’s decision into a GO or YIELD commitment, which propagates back through the belief network to pre-condition neighbor behavior ahead of physical conflicts. Crossing order emerges from geometric feasibility and arrival priority without the use of joint optimization or a solver. The approach is validated across three scenarios spanning fully competitive, merge, and mixed conflict topologies. The results demonstrate collision-free coordination and lower last-vehicle exit times compared to first come first served (FCFS) in all conflict non-trivial configurations.

[MA-4] SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多轮对话场景中因角色一致性缺失而导致的稳定性问题，如人格漂移（persona drift）、角色混淆（role confusion）和“回声效应”（echoing）等，这些问题严重影响了LLM用于生成合成对话数据时的可靠性。解决方案的关键在于提出一种稳定优先的框架SPASM（Stable Persona-driven Agent Simulation for Multi-turn dialogue generation），其核心创新是引入自我中心上下文投影（Egocentric Context Projection, ECP）：通过将对话历史以视角无关的表示存储，并在生成前确定性地投影到每个代理的自我中心视角，从而在不修改模型权重的前提下显著提升长程对话中的角色一致性，实验证明ECP能有效减少人格漂移并消除回声现象。

链接: https://arxiv.org/abs/2604.09212
作者: Han Luo,Guy Laban
机构: University of Leeds (利兹大学); Southwest Jiaotong University (西南交通大学); Ben-Gurion University of the Negev (本-古里安大学)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Accepted to Findings of the Association for Computational Linguistics (ACL 2026). Our code and data are available at this https URL

点击查看摘要

Abstract:Large language models are increasingly deployed in multi-turn settings such as tutoring, support, and counseling, where reliability depends on preserving consistent roles, personas, and goals across long horizons. This requirement becomes critical when LLMs are used to generate synthetic dialogues for training and evaluation, since LLM–LLM conversations can accumulate identity-related failures such as persona drift, role confusion, and “echoing”, where one agent gradually mirrors its partner. We introduce SPASM (Stable Persona-driven Agent Simulation for Multi-turn dialogue generation), a modular, stability-first framework that decomposes simulation into (i) persona creation via schema sampling, plausibility validation, and natural-language persona crafting, (ii) Client–Responder dialogue generation, and (iii) termination detection for coherent stopping. To improve long-horizon stability without changing model weights, we propose Egocentric Context Projection (ECP): dialogue history is stored in a perspective-agnostic representation and deterministically projected into each agent’s egocentric view before generation. Across three LLM backbones (GPT-4o-mini, DeepSeek-V3.2, Qwen-Plus) and nine Client–Responder pairings, we construct a dataset of 4,500 personas and 45,000 conversations (500 personas X 10 conversations per pairing). Ablations show ECP substantially reduces persona drift and, under human validation, eliminates echoing; embedding analyses recover persona structure and reveal strong responder-driven interaction geometry. Our code is available at this https URL.

[MA-5] MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

【速读】：该论文旨在解决三维场景中具身推理（grounded reasoning）的挑战，即如何在复杂3D环境中准确识别与查询相关的对象和区域，并基于其空间和几何关系进行灵活推理。现有方法通常依赖于特定领域的微调或手工设计的推理流程，限制了模型在新环境中的零样本泛化能力。解决方案的关键在于提出一个无需训练的多智能体框架MAG-3D，通过动态协调三个专家代理实现：规划代理（planning agent）负责任务分解与推理流程调度，接地代理（grounding agent）执行自由形式的3D接地及关键帧检索，编码代理（coding agent）则利用可执行程序进行灵活几何推理与显式验证。这种协作式设计使得模型能够在多样场景下实现高效、灵活且无需再训练的3D具身推理。

链接: https://arxiv.org/abs/2604.09167
作者: Henry Zheng,Chenyue Fang,Rui Huang,Siyuan Wei,Xiao Liu,Gao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.

[MA-6] Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks

【速读】：该论文旨在解决灾难后无人机作为空中基站（Aerial Base Station, ABS）在用户移动性和业务需求剧烈变化下，因环境非平稳性导致深度强化学习策略出现塑性损失（plasticity loss）的问题，表现为表征崩溃（representation collapse）和神经元休眠（neuron dormancy），从而难以适应动态变化的服务质量（QoS）权衡。解决方案的关键在于提出一种增强塑性的多智能体专家混合模型（Plasticity Enhanced Multi-Agent Mixture of Experts, PE-MAMoE），其核心机制包括：1）每个无人机配备一个稀疏门控专家混合行为器（sparsely gated mixture of experts actor），每步由路由器选择单一专家执行动作；2）引入非参数相位控制器（non-parametric Phase Controller），在相位切换时注入短暂的仅专家随机扰动，重置动作对数标准差、退火熵与学习率，并调度路由器温度，以在不破坏安全行为的前提下重新激活策略的可塑性。该方法通过理论上的动态遗憾边界证明了跟踪误差与环境变化及累积噪声能量相关，并在模拟环境中显著提升性能指标，验证了其有效性。

链接: https://arxiv.org/abs/2604.09028
作者: Wen Qiu,Zhiqiang He,Wei Zhao,Hiroshi Masui
机构: Kitami Institute of Technology (北见工业大学); University of Electro-Communications (电波通信大学); Anhui University of Technology (安徽工业大学)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: 20 pages, 12 figures, 3 tables

点击查看摘要

Abstract:Unmanned aerial vehicles serving as aerial base stations can rapidly restore connectivity after disasters, yet abrupt changes in user mobility and traffic demands shift the quality of service trade-offs and induce strong non-stationarity. Deep reinforcement learning policies suffer from plasticity loss under such shifts, as representation collapse and neuron dormancy impair adaptation. We propose plasticity enhanced multi-agent mixture of experts (PE-MAMoE), a centralized training with decentralized execution framework built on multi-agent proximal policy optimization. PE-MAMoE equips each UAV with a sparsely gated mixture of experts actor whose router selects a single specialist per step. A non-parametric Phase Controller injects brief, expert-only stochastic perturbations after phase switches, resets the action log-standard-deviation, anneals entropy and learning rate, and schedules the router temperature, all to re-plasticize the policy without destabilizing safe behaviors. We derive a dynamic regret bound showing the tracking error scales with both environment variation and cumulative noise energy. In a phase-driven simulator with mobile users and 3GPP-style channels, PE-MAMoE improves normalized interquartile mean return by 26.3% over the best baseline, increases served-user capacity by 12.8%, and reduces collisions by approximately 75%. Diagnostics confirm persistently higher expert feature rank and periodic dormant-neuron recovery at regime switches.

[MA-7] Social Reality Construction via Active Inference: Modeling the Dialectic of Conformity and Creativity

【速读】：该论文试图解决的问题是：如何在计算模型中统一刻画社会主体既内化集体规范又通过创造性行动重塑规范的双向互动过程。现有模型未能在统一框架下描述这一辩证建构社会现实的过程。解决方案的关键在于提出一种基于主动推理（active inference）的多智能体仿真模型，其中每个智能体维护一个内部生成模型，通过与邻近智能体通信形成社会先验（social priors），创造新颖观测并选择性地将他人的创造纳入记忆；该机制使社会表征与外部观察分布之间形成循环互构关系，并驱动文化 niches 的局部构建，从而实现社会共识的内生形成与分化。

链接: https://arxiv.org/abs/2604.09026
作者: Kentaro Nomura,Takato Horii
机构: The University of Osaka, Japan; IRCN, The University of Tokyo, Japan
类目: Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注: Submitted to ALIFE 2026 conference

点击查看摘要

Abstract:Social agents both internalize collective norms and reshape them through creative action, yet computational models have not captured this bidirectional process within a unified framework. We propose a multi-agent simulation model grounded in active inference that formalizes the dialectical constitution of social reality on a structured social network. Each agent maintains an internal generative model, communicates with neighbors to form social priors, creates novel observations, and selectively incorporates others’ creations into memory. Simulation experiments demonstrate three main findings. First, informationally cohesive social groups emerge endogenously, with representational alignment mirroring the cluster topology of the underlying network. Second, a circular mutual constitution arises between social representations and the observation distribution, maintained through agents’ creative acts that project representational structure onto the external world. Third, the propagation of creations exhibits selective, heterogeneous patterns distinct from the stable diffusion of social representations, indicating that agents construct cultural niches through local interaction dynamics. These results suggest that the interplay between social conformity and creative deviation can give rise to the endogenous formation and differentiation of shared social reality.

[MA-8] Multi-agent Reinforcement Learning for Low-Carbon P2P Energy Trading among Self-Interested Microgrids

【速读】：该论文旨在解决可再生能源（renewable generation）出力不确定性和负荷需求动态变化对日前调度带来的挑战，以提升可再生能源渗透率并保障日内电力平衡。其解决方案的关键在于构建一个基于多智能体强化学习（multi-agent reinforcement learning）的框架，使具有自利性质的微电网（microgrid）在点对点（peer-to-peer, P2P）电力交易中自主优化报价策略（同时决策价格与电量），并通过储能套利（storage arbitrage）机制在随时间变化的主网电价下最大化自身收益；同时设计了一种市场清结算机制，协调交易行为并确保激励相容（incentive compatibility），从而实现社区层面经济福利提升、可再生能源利用率提高及高碳电力依赖降低的协同优化目标。

链接: https://arxiv.org/abs/2604.08973
作者: Junhao Ren,Honglin Gao,Lan Zhao,Qiyu Kang,Gaoxi Xiao,Yajuan Sun
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: Accepted by IEEE ICC 2026, 6 pages, 2 figures

点击查看摘要

Abstract:Uncertainties in renewable generation and demand dynamics challenge day-ahead scheduling. To enhance renewable penetration and maintain intra-day balance, we develop a multi-agent reinforcement learning framework for self-interested microgrids participating in peer-to-peer (P2P) electricity trading. Each microgrid independently bids both price and quantity while optimizing its own profit via storage arbitrage under time-varying main-grid prices. A market-clearing mechanism coordinating trades and promoting incentive compatibility is proposed. Simulation results show that the learned bidding policy improves renewable utilization and reduces reliance on high-carbon electricity, while increasing community-level economic welfare, delivering a win-win situation in emission reduction and local prosperity.

[MA-9] Litmus (Re)Agent : A Benchmark and Agent ic System for Predictive Evaluation of Multilingual Models

【速读】：该论文旨在解决预测性多语言评估（predictive multilingual evaluation）问题，即在缺乏目标语言直接基准测试结果的情况下，估计模型在特定任务中的表现。这一问题在多语言部署中尤为突出，因评估覆盖稀疏且不同语言、任务和模型家族之间的公开证据分布不均。解决方案的关键在于提出一个受控基准（controlled benchmark），包含1,500个问题，涵盖六项任务和五种证据情景，将可获取的证据与真实标签分离，从而评估需从不完整文献证据中推断缺失结果的系统性能；同时引入Litmus (Re)Agent，一种基于有向无环图（DAG）编排的智能体系统，通过将查询分解为假设、检索证据并利用特征感知聚合进行预测合成，在转移密集型场景中显著优于其他六种系统，证明结构化智能体推理是应对不完整证据下多语言性能估计的有效路径。

链接: https://arxiv.org/abs/2604.08970
作者: Avni Mittal,Shanu Kumar,Sandipan Dandapat,Monojit Choudhury
机构: Microsoft Corporation, India; Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); Indian Institute of Technology Hyderabad
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We study predictive multilingual evaluation: estimating how well a model will perform on a task in a target language when direct benchmark results are missing. This problem is common in multilingual deployment, where evaluation coverage is sparse and published evidence is uneven across languages, tasks, and model families. We introduce a controlled benchmark of 1,500 questions spanning six tasks and five evidence scenarios. The benchmark separates accessible evidence from ground truth, enabling evaluation of systems that must infer missing results from incomplete literature evidence. We also present Litmus (Re)Agent, a DAG-orchestrated agentic system that decomposes queries into hypotheses, retrieves evidence, and synthesises predictions through feature-aware aggregation. Across six systems, Litmus (Re)Agent achieves the best overall performance, with the largest gains in transfer-heavy scenarios where direct evidence is weak or absent. These results show that structured agentic reasoning is a promising approach to multilingual performance estimation under incomplete evidence.

[MA-10] Aligned Agents Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems

【速读】：该论文旨在解决多智能体系统（Multi-Agent Systems, MAS）在复杂工作流中因结构特性导致的偏见累积问题，尤其是现有认知中认为多智能体协作可自然稀释偏见的假设是否成立。其核心问题是：MAS的拓扑结构与反馈回路如何影响偏见的传播与放大，进而威胁系统的伦理鲁棒性。解决方案的关键在于提出一个开放式的评估基准 Discrim-Eval-Open，通过强制跨群体比较判断来绕过个体模型中立性的假设，并在此基础上实证发现：即使单个智能体无偏，结构性复杂性反而常加剧偏见，且存在“触发脆弱性”（Trigger Vulnerability），即引入纯客观上下文即可显著加速极化。这一发现揭示了结构复杂性并不等同于伦理稳健性，为MAS的伦理设计提供了关键基线。

链接: https://arxiv.org/abs/2604.08963
作者: Keyu Li,Jin Gao,Dequan Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Multi-Agent Systems (MAS) are increasingly deployed for complex workflows, their emergent properties-particularly the accumulation of bias-remain poorly understood. Because real-world MAS are too complex to analyze entirely, evaluating their ethical robustness requires first isolating their foundational mechanics. In this work, we conduct a baseline empirical study investigating how basic MAS topologies and feedback loops influence prejudice. Contrary to the assumption that multi-agent collaboration naturally dilutes bias, we hypothesize that structured workflows act as echo chambers, amplifying minor stochastic biases into systemic polarization. To evaluate this, we introduce Discrim-Eval-Open, an open-ended benchmark that bypasses individual model neutrality through forced comparative judgments across demographic groups. Analyzing bias cascades across various structures reveals that architectural sophistication frequently exacerbates bias rather than mitigating it. We observe systemic amplification even when isolated agents operate neutrally, and identify a ‘Trigger Vulnerability’ where injecting purely objective context drastically accelerates polarization. By stripping away advanced swarm complexity to study foundational dynamics, we establish a crucial baseline: structural complexity does not guarantee ethical robustness. Our code is available at this https URL.

[MA-11] Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication ICML2026

【速读】：该论文旨在解决多智能体系统在部分可观测环境下协同决策时的信息不对称问题，即如何通过有效通信机制使各智能体共享互补的私有信息以提升整体任务性能。其核心解决方案是提出SeqComm-DFL框架，关键在于将顺序通信与决策导向学习（decision-focused learning）统一建模：通过价值感知的消息生成机制和序列式Stackelberg条件化策略，确保消息按优先级生成并基于前序智能体的决策进行条件调整；同时引入由“亲社会排序”决定的引导潜力（guidance potential），并结合QMIX因子分解的通信增强型世界模型，利用隐式微分实现端到端高效训练。理论分析表明，通信价值与协调差距呈信息论意义上的正相关，并证明了双层优化的收敛率为 $\mathcal{O}(1/\sqrt{T})$ 。

链接: https://arxiv.org/abs/2604.08944
作者: Benjamin Amoh,Geoffrey Parker,Wesley Marrero
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 15 pages, 6 figures, 3 tables. Includes appendix. Submitted to ICML 2026. Code available at this https URL

点击查看摘要

Abstract:Multi-agent coordination under partial observability requires agents to share complementary private information. While recent methods optimize messages for intermediate objectives (e.g., reconstruction accuracy or mutual information), rather than decision quality, we introduce \textbfSeqComm-DFL, unifying the sequential communication with decision-focused learning for task performance. Our approach features \emphvalue-aware message generation with sequential Stackelberg conditioning: messages maximize receiver decision quality and are generated in priority order, with agents conditioning on their predecessors. The \emphguidance potential determined by their prosocial ordering. We extend Optimal Model Design to communication-augmented world models with QMIX factorization, enabling efficient end-to-end training via implicit differentiation. We prove information-theoretic bounds showing that communication value scales with coordination gaps and establish \mathcalO(1/\sqrtT) convergence for the bilevel optimization, where T denotes the number of training iterations. On collaborative healthcare and StarCraft Multi-Agent Challenge (SMAC) benchmarks, SeqComm-DFL achieves four to six times higher cumulative rewards and over 13% win rate improvements, enabling coordination strategies inaccessible under information asymmetry.

[MA-12] Enhancing LLM Problem Solving via Tutor-Student Multi-Agent Interaction

【速读】：该论文旨在解决如何通过结构化的角色分工互动，提升大型语言模型（Large Language Model, LLM）在复杂问题求解中的性能，而无需依赖更强的监督模型或异构模型集成。其核心解决方案在于构建一个基于发展心理学中“同伴教学”（Peer Tutoring）和“支架式学习”（Scaffolding）原理的多智能体系统——PETITE，其中同一LLM被赋予不对称角色：学生代理（Student Agent）负责生成并迭代优化代码解决方案，导师代理（Tutor Agent）则提供无真值答案参考的结构化反馈。这种角色差异化交互机制使模型在不增加计算资源的前提下显著提升编码任务准确率，同时大幅降低token消耗，验证了发展性角色结构对LLM问题求解能力增强的有效性和高效性。

链接: https://arxiv.org/abs/2604.08931
作者: Nurullah Eymen Özdemir,Erhan Oztop
机构: Ozyegin University (奥泽金大学); Symbiotic Intelligent Systems Research Center, Institute for Open and Transdisciplinary, Research Initiatives, The University of Osaka (大阪大学开放与跨学科研究倡议研究所)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 7 pages, 3 figures, This work is under review for conference appearance

点击查看摘要

Abstract:Human cognitive development is shaped not only by individual effort but by structured social interaction, where role-based exchanges such as those between a tutor and a learner, enable solutions that neither could achieve alone. Inspired by these developmental principles, we ask the question whether a tutor-student multi-agent system can create a synergistic effect by pushing Large Language Model (LLM) beyond what it can do within existing frameworks. To test the idea, we adopt autonomous coding problem domain where two agents instantiated from the same LLM assigned asymmetric roles: a student agent generates and iteratively refines solutions, while a tutor agent provides structured evaluative feedback without access to ground-truth answers. In our proposed framework (PETITE), we aim to extract better problem-solving performance from one model by structuring its interaction through complementary roles, rather than relying on stronger supervisory models or heterogeneous ensembles. Our model is evaluated on the APPS coding benchmark against state-of-the-art approaches of Self-Consistency, Self-Refine, Multi-Agent Debate, and Multi-Agent Review. The results show that our model achieves similar or higher accuracy while consuming significantly fewer tokens. These results suggest that developmentally grounded role-differentiated interaction structures provide a principled and resource-efficient paradigm for enhancing LLM problem-solving through structured peer-like interactions. Index Terms- Peer Tutoring, Scaffolding, Large Language Models, Multi-Agent Systems, Code Generation

[MA-13] Beyond the Individual: Virtualizing Multi-Disciplinary Reasoning for Clinical Intake via Collaborative Agents ACL2026

【速读】：该论文旨在解决初诊门诊中因单医生在时间压力下决策而导致的认知偏差和证据捕捉不全的问题，从而影响临床决策质量。其解决方案的关键在于提出一个名为Aegle的同步虚拟多学科团队（Multi-Disciplinary Team, MDT）框架，采用基于图结构的多智能体架构，将咨询状态形式化为结构化的SOAP表示，分离证据收集与诊断推理过程以提升可追溯性和偏倚控制；通过调度器动态激活专科智能体进行解耦并行推理，并由聚合器整合成连贯的临床记录，从而实现MDT级别的推理能力在门诊场景中的高效应用。

链接: https://arxiv.org/abs/2604.08927
作者: Huangwei Chen,Wu Li,Junhao Jia,Yining Chen,Xiaotao Pang,Ya-Long Chen,Li Gonghui,Haishuai Wang,Jiajun Bu,Lei Wu
机构: Zhejiang Key Laboratory of Accessible Perception and Intelligent Systems, College of Computer Science and Technology, Zhejiang University; Hangzhou Pujian Medical Technology Co., Ltd, China; Sir Run Run Shaw Hospital, Zhejiang University School of Medicine; School of Computer Science and Technology, Hangzhou Dianzi University
类目: Multiagent Systems (cs.MA)
备注: accepted to ACL 2026 Findings

点击查看摘要

Abstract:The initial outpatient consultation is critical for clinical decision-making, yet it is often conducted by a single physician under time pressure, making it prone to cognitive biases and incomplete evidence capture. Although the Multi-Disciplinary Team (MDT) reduces these risks, they are costly and difficult to scale to real-time intake. We propose Aegle, a synchronous virtual MDT framework that brings MDT-level reasoning to outpatient consultations via a graph-based multi-agent architecture. Aegle formalizes the consultation state using a structured SOAP representation, separating evidence collection from diagnostic reasoning to improve traceability and bias control. An orchestrator dynamically activates specialist agents, which perform decoupled parallel reasoning and are subsequently integrated by an aggregator into a coherent clinical note. Experiments on ClinicalBench and a real-world RAPID-IPN dataset across 24 departments and 53 metrics show that Aegle consistently outperforms state-of-the-art proprietary and open-source models in documentation quality and consultation capability, while also improving final diagnosis accuracy. Our code is available at this https URL.

[MA-14] Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching

【速读】：该论文旨在解决临床试验（Clinical Trials）招募患者时面临的低召回率、低精度及可解释性差的问题，尤其是在海量试验数据中精准匹配患者与试验资格标准的挑战。其解决方案的关键在于提出SatIR方法，该方法基于约束满足（Constraint Satisfaction）框架，结合Satisfiability Modulo Theories（SMT）和关系代数，对临床试验和患者记录中的关键约束进行高效形式化表示与匹配；同时引入大语言模型（Large Language Models, LLMs）将模糊的临床推理、隐含假设和不完整病历转化为明确、可控且可解释的形式化约束，从而显著提升检索性能与透明度。

链接: https://arxiv.org/abs/2604.08849
作者: Cyrus Zhou,Yufei Jin,Yilin Xu,Yu-Chiang Wang,Chieh-Ju Chao,Monica S. Lam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Multiagent Systems (cs.MA); Symbolic Computation (cs.SC)
备注: Under review

点击查看摘要

Abstract:Clinical trials are central to evidence-based medicine, yet many struggle to meet enrollment targets, despite the availability of over half a million trials listed on this http URL, which attracts approximately two million users monthly. Existing retrieval techniques, largely based on keyword and embedding-similarity matching between patient profiles and eligibility criteria, often struggle with low recall, low precision, and limited interpretability due to complex constraints. We propose SatIR, a scalable clinical trial retrieval method based on constraint satisfaction, enabling high-precision and interpretable matching of patients to relevant trials. Our approach uses formal methods – Satisfiability Modulo Theories (SMT) and relational algebra – to efficiently represent and match key constraints from clinical trials and patient records. Beyond leveraging established medical ontologies and conceptual models, we use Large Language Models (LLMs) to convert informal reasoning regarding ambiguity, implicit clinical assumptions, and incomplete patient records into explicit, precise, controllable, and interpretable formal constraints. Evaluated on 59 patients and 3,621 trials, SatIR outperforms TrialGPT on all three evaluated retrieval objectives. It retrieves 32%-72% more relevant-and-eligible trials per patient, improves recall over the union of useful trials by 22-38 points, and serves more patients with at least one useful trial. Retrieval is fast, requiring 2.95 seconds per patient over 3,621 trials. These results show that SatIR is scalable, effective, and interpretable.

[MA-15] Multi-User Large Language Model Agents

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）及其代理在多用户、多主场景下的适配性问题，即现有系统普遍基于单主交互范式设计，难以有效处理多个用户同时参与时产生的目标冲突、信息不对称与隐私约束等挑战。其解决方案的关键在于首次系统性地将多用户LLM代理交互形式化为一个多主决策问题，并提出一个统一的多用户交互协议，结合三种针对性的压力测试场景（指令遵循、隐私保护与协作协调）对前沿LLM的能力进行评估，从而揭示当前模型在优先级稳定、隐私维护和协作效率方面的系统性缺陷，为未来多用户智能体的设计提供理论基础与实证依据。

链接: https://arxiv.org/abs/2604.08567
作者: Shu Yang,Shenzhe Zhu,Hao Zhu,José Ramón Enríquez,Di Wang,Alex Pentland,Michiel A. Bakker,Jiaxin Pei
机构: Stanford University (斯坦福大学); KAUST; University of Toronto (多伦多大学); MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models (LLMs) and LLM-based agents are increasingly deployed as assistants in planning and decision making, yet most existing systems are implicitly optimized for a single-principal interaction paradigm, in which the model is designed to satisfy the objectives of one dominant user whose instructions are treated as the sole source of authority and utility. However, as they are integrated into team workflows and organizational tools, they are increasingly required to serve multiple users simultaneously, each with distinct roles, preferences, and authority levels, leading to multi-user, multi-principal settings with unavoidable conflicts, information asymmetry, and privacy constraints. In this work, we present the first systematic study of multi-user LLM agents. We begin by formalizing multi-user interaction with LLM agents as a multi-principal decision problem, where a single agent must account for multiple users with potentially conflicting interests and associated challenges. We then introduce a unified multi-user interaction protocol and design three targeted stress-testing scenarios to evaluate current LLMs’ capabilities in instruction following, privacy preservation, and coordination. Our results reveal systematic gaps: frontier LLMs frequently fail to maintain stable prioritization under conflicting user objectives, exhibit increasing privacy violations over multi-turn interactions, and suffer from efficiency bottlenecks when coordination requires iterative information gathering.

自然语言处理

[NLP-0] Large Language Models Generate Harmful Content Using a Distinct Unified Mechanism

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在对齐训练后仍存在安全防护脆弱性的问题，特别是“越狱”攻击频繁绕过安全机制以及微调导致的“涌现式错位”（emergent misalignment）现象。其解决方案的关键在于通过靶向权重剪枝（targeted weight pruning）作为因果干预手段，揭示有害内容生成在模型内部具有结构化的、可压缩的权重集合，且该集合与良性能力分离；研究发现对齐模型会更显著地压缩有害生成权重，这解释了为何微调某一领域时可能触发广泛错位，并表明有害生成能力与模型识别和解释此类内容的能力是解耦的——这一发现为构建更系统、更稳健的安全机制提供了内在结构基础。

链接: https://arxiv.org/abs/2604.09544
作者: Hadas Orgad,Boyi Wei,Kaden Zheng,Martin Wattenberg,Peter Henderson,Seraphina Goldfarb-Tarrant,Yonatan Belinkov
机构: Kempner Institute, Harvard University (肯普纳研究所，哈佛大学); Princeton University (普林斯顿大学); Harvard University (哈佛大学); Cohere (Cohere); Technion—IIT (以色列理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment’’ that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally–despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

[NLP-1] VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在空间理解与视角识别等视觉感知任务中表现不佳的问题，其核心原因可能是自然图像数据集对低层次视觉技能的监督信号不足。解决方案的关键在于提出一种名为 VisionFoundry 的任务感知合成数据生成流水线，该方法仅需输入任务关键词（如 Depth Order），即可利用大语言模型（LLMs）自动生成问题、答案及文本到图像（Text-to-Image, T2I）提示词，进而通过 T2I 模型合成图像，并借助专有 VLM 验证一致性，整个过程无需参考图像或人工标注。基于此方法构建的 VisionFoundry-10K 数据集显著提升了模型在 MMVP 和 CV-Bench-3D 等视觉感知基准上的性能，验证了针对性合成监督是缓解 VLM 视觉瓶颈的有效路径。

链接: https://arxiv.org/abs/2604.09531
作者: Guanyu Zhou,Yida Yin,Wenhao Chai,Shengbang Tong,Xingyu Fu,Zhuang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

[NLP-2] VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning ACL2026

【速读】：该论文旨在解决大视觉语言模型（Large Vision Language Models, LVLMs）在高风险场景中因幻觉和错误响应却表现出高置信度而导致的可靠性问题。现有基于文本的置信度校准方法通常通过单一整体置信分数优化，无法区分感知错误与推理错误，且受语言先验主导，导致校准效果不佳。解决方案的关键在于提出VL-Calibration框架，其核心创新是通过强化学习显式地将置信度解耦为视觉置信度和推理置信度，并引入一种无需真实感知标签的内在视觉确定性估计机制——结合图像扰动下的KL散度（用于衡量视觉定位一致性）与token熵（用于衡量内部不确定性），同时采用token级优势重加权策略聚焦于高视觉确定性的token，从而抑制无依据的幻觉并保留有效感知。实验表明，该方法显著提升了校准精度与视觉推理准确性，并具备跨模型规模与架构的泛化能力。

链接: https://arxiv.org/abs/2604.09529
作者: Wenyi Xiao,Xinchi Xu,Leilei Gan
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages, ACL 2026 Main. Repository: this https URL

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.

[NLP-3] You Cant Fight in Here! This is BBS!

【速读】：该论文旨在解决当前语言模型（Language Models, LMs）研究中普遍存在的两个认知误区：一是“字符串统计假人”（String Statistics Strawman），即错误地认为语言模型因基于统计学习而无法具备语言学上的理解能力；二是“尽善尽美假设”（As Good As it Gets Assumption），即认为当前LM研究在2026年已达到极限，无法进一步揭示人类语言的本质。解决方案的关键在于推动语言科学进入一个更广阔的AI时代研究范式，通过系统回应这些批评观点，构建一个既能深化对人类语言理解又能提升语言模型科学性的跨学科研究体系，从而实现语言科学与语言模型研究的协同进化和实质性进步。

链接: https://arxiv.org/abs/2604.09501
作者: Richard Futrell,Kyle Mahowald
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at Behavioral and Brain Sciences as a response to the commentaries to the accepted target article “How Linguistics Learned to Stop Worrying and Love the Language Models”, whose preprint appears here: arXiv:2501.17047

点击查看摘要

Abstract:Norm, the formal theoretical linguist, and Claudette, the computational language scientist, have a lovely time discussing whether modern language models can inform important questions in the language sciences. Just as they are about to part ways until they meet again, 25 of their closest friends show up – from linguistics, neuroscience, cognitive science, psychology, philosophy, and computer science. We use this discussion to highlight what we see as some common underlying issues: the String Statistics Strawman (the mistaken idea that LMs can’t be linguistically competent or interesting because they, like their Markov model predecessors, are statistical models that learn from strings) and the As Good As it Gets Assumption (the idea that LM research as it stands in 2026 is the limit of what it can tell us about linguistics). We clarify the role of LM-based work for scientific insights into human language and advocate for a more expansive research program for the language sciences in the AI age, one that takes on the commentators’ concerns in order to produce a better and more robust science of both human language and of LMs.

[NLP-4] BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）评估中依赖刚性词汇匹配方法所带来的偏差问题，这类方法易将模型的真正问题求解能力与对预设格式的遵守程度混淆，从而导致评估结果不可靠。为应对这一挑战，作者提出BERT-as-a-Judge，其核心创新在于采用基于编码器的判别式框架，在参考文本基础上评估生成答案的语义正确性，对输出表述差异具有鲁棒性，并仅需在合成标注的“问题-候选答案-参考答案”三元组上进行轻量级训练。该方案在保持接近大型LLM裁判性能的同时显著降低计算开销，实现了可靠性和可扩展性的良好平衡。

链接: https://arxiv.org/abs/2604.09497
作者: Hippolyte Gisserot-Boukhlef,Nicolas Boizard,Emmanuel Malherbe,Céline Hudelot,Pierre Colombo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model’s true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge’s performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.

[NLP-5] Agent ic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL

【速读】：该论文旨在解决将自然语言（Natural Language, NL）准确映射到Jira Query Language (JQL) 的问题，尤其针对字段引用歧义、实例特定的分类值（如组件名或修复版本）以及复杂布尔谓词带来的挑战。现有单次遍历的大语言模型（Large Language Models, LLMs）无法发现目标Jira实例中实际存在的分类值，也无法通过实时数据源验证生成的查询，导致在改写或模糊请求下的准确性受限。解决方案的关键在于提出首个大规模、基于执行的文本到JQL基准测试Jackal（包含10万条经验证的NL-JQL对），并引入Agentic Jackal——一个工具增强型代理系统，其核心组件包括：1）通过Jira MCP服务器实现的实时查询执行能力，用于反馈修正；2）JiraAnchor，一种基于嵌入相似度搜索的语义检索工具，用于解析自然语言提及的分类值。实验证明，该方法显著提升了7/9个前沿LLM的执行准确率，尤其在分类值识别和组件字段匹配上提升明显，表明语义歧义而非数值解析错误是当前主要瓶颈。

链接: https://arxiv.org/abs/2604.09470
作者: Vishnu Murali,Anmol Gulati,Elias Lumer,Kevin Frank,Sindy Campagna,Vamse Kumar Subbiah
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and complex Boolean predicates. Single-pass LLMs cannot discover which categorical values (e.g., component names or fix versions) actually exist in a given Jira instance, nor can they verify generated queries against a live data source, limiting accuracy on paraphrased or ambiguous requests. No open, execution-based benchmark exists for mapping natural language to JQL. We introduce Jackal, the first large-scale, execution-based text-to-JQL benchmark comprising 100,000 validated NL-JQL pairs on a live Jira instance with over 200,000 issues. To establish baselines on Jackal, we propose Agentic Jackal, a tool-augmented agent that equips LLMs with live query execution via the Jira MCP server and JiraAnchor, a semantic retrieval tool that resolves natural-language mentions of categorical values through embedding-based similarity search. Among 9 frontier LLMs evaluated, single-pass models average only 43.4% execution accuracy on short natural-language queries, highlighting that text-to-JQL remains an open challenge. The agentic approach improves 7 of 9 models, with a 9.0% relative gain on the most linguistically challenging variant; in a controlled ablation isolating JiraAnchor, categorical-value accuracy rises from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Our analysis identifies inherent semantic ambiguities, such as issue-type disambiguation and text-field selection, as the dominant failure modes rather than value-resolution errors, pointing to concrete directions for future work. We publicly release the benchmark, all agent transcripts, and evaluation code to support reproducibility.

[NLP-6] Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities

【速读】：该论文旨在解决语言模型（Language Models, LMs）在语言处理中的作用及其与心理语言学（Psycholinguistics）研究关系的理论争议问题。作者基于Marr的认知分析层次框架，批判性地审视了两个核心主张：一是预测上下文中的语言信息是否构成语言处理的核心机制；二是大规模语言模型（Large Language Models, LLMs）是否已成为心理语言学进展不可或缺的工具。论文的关键解决方案在于提出将LLMs与传统心理语言学模型的优势相结合，以推动未来更整合、更具解释力的语言理解研究方向，从而弥合计算建模与人类语言处理机制之间的鸿沟。

链接: https://arxiv.org/abs/2604.09466
作者: Sathvik Nair,Colin Phillips
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, Behavioral Brain Sciences Commentary on Futrell Mahowald (forthcoming)

点击查看摘要

Abstract:Under the lens of Marr’s levels of analysis, we critique and extend two claims about language models (LMs) and language processing: first, that predicting upcoming linguistic information based on context is central to language processing, and second, that many advances in psycholinguistics would be impossible without large language models (LLMs). We further outline future directions that combine the strengths of LLMs with psycholinguistic models.

[NLP-7] From Reasoning to Agent ic: Credit Assignment in Reinforcement Learning for Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在强化学习（Reinforcement Learning, RL）中面临的信用分配（Credit Assignment, CA）难题，即如何从长轨迹中识别出导致特定结果的动作。问题在两个场景下尤为突出：一是推理型RL（reasoning RL），需在单个思维链生成过程中（500–30K+ tokens）对token和步骤进行细粒度信用分配；二是代理型RL（agentic RL），涉及多轮环境交互带来的随机转移、部分可观测性和超过100轮（100K–1M tokens）的长期决策，使基于episode的信用信息变得低效。解决方案的关键在于提出一个二维分类体系（按分配粒度：token/segment/step/turn/multi-agent，按方法论：蒙特卡洛、时序差分、模型驱动、博弈论、信息论），系统梳理了2024至2026年间发表的47种CA方法（41种核心方法，6种辅助技术），并提供三个可复用资源：结构化论文数据库、报告检查清单及基准协议规范，从而揭示了从推理型到代理型RL的范式转变正推动信用分配机制向全新的方向演进——如事后反事实分析、特权不对称评论家和轮次级马尔可夫决策过程（MDP）重构等方法，这些方法在推理型RL中并无直接先例。

链接: https://arxiv.org/abs/2604.09459
作者: Chenchen Zhang
机构: Independent Researcher
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards – yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500–30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K–1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches – hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations – that have no direct precedent in reasoning RL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.09459 [cs.CL] (or arXiv:2604.09459v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.09459 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-8] Many-Tier Instruction Hierarchy in LLM Agents

【速读】：该论文旨在解决大语言模型代理（LLM agents）在多源指令冲突场景下难以可靠遵循最高权限指令的问题，这在真实世界的复杂代理环境中尤为突出。当前主流的指令层级（Instruction Hierarchy, IH）方法受限于固定且少量的权限级别（通常少于五级），并依赖刚性的角色标签（如“系统”或“用户”），无法适应现实中更广泛、动态的指令来源和上下文。为此，作者提出多层指令层级（Many-Tier Instruction Hierarchy, ManyIH），其核心创新在于允许任意数量的权限层级，并设计了首个专门针对该范式的基准测试工具 ManyIH-Bench，该基准包含多达12个权限层级、853个代理任务（涵盖编程与指令遵循），并通过LLM生成与人工验证相结合的方式构建高保真、高难度的测试案例。实验表明，即使是最先进的模型在指令冲突规模扩大时准确率仍不足40%，凸显了对细粒度、可扩展的指令冲突解决机制的迫切需求。

链接: https://arxiv.org/abs/2604.09443
作者: Jingyu Zhang,Tianjian Li,William Jurayj,Hongyuan Zhan,Benjamin Van Durme,Daniel Khashabi
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.

[NLP-9] UIPress: Bringing Optical Token Compression to UI-to-Code Generation

【速读】：该论文旨在解决UI-to-Code生成任务中视觉token效率低的问题，即如何在不牺牲代码生成质量的前提下，显著降低从单张截图生成数千个结构化HTML/CSS token时的预填充延迟（prefill latency）。现有压缩方法要么依赖任务无关的启发式选择策略，要么仅通过置零低注意力特征而不真正缩短序列长度，无法有效减少推理延迟。为此，作者提出UIPress，一种插入于冻结ViT编码器与LLM解码器之间的轻量级学习压缩模块，其关键创新在于结合深度可分离卷积（depthwise-separable convolutions）、基于元素的空域重加权（element-guided spatial reweighting）和Transformer精炼机制，将约6700个视觉token压缩至固定预算256个，并辅以LoRA微调解码器以弥合表征差距。整体系统仅引入约2170万参数（占8B基模型的0.26%），在Design2Code基准上实现CLIP分数提升达+7.5%，且首次实现了端到端的编码器侧学习压缩范式，相较最强基线提速9.1倍。

链接: https://arxiv.org/abs/2604.09442
作者: Dasen Dai,Shuoqi Li,Ronghao Chen,Huacan Wang,Biao Wu,Qizhen Lan
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence – neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress \sim 6,700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only \sim 21.7M trainable parameters (0.26% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5% and the strongest inference-time method by +4.6%, while delivering 9.1 \times time-to-first-token speedup. To the best of our knowledge, UIPress is the first encoder-side learned compression method for the UI-to-Code task.

[NLP-10] Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在下游任务中适应性不足的问题，特别是如何在仅有少量任务特定示例的情况下有效调整模型行为。其解决方案的关键在于提出自动化指令修订（Automated Instruction Revision, AIR），这是一种基于规则归纳的方法，通过从有限样本中自动提炼出简洁、可解释的指令规则来优化模型输出，从而实现对特定任务行为的精准适配。该方法在标签映射类任务中表现优异，尤其适用于任务逻辑可通过紧凑规则描述的场景，而其他方法如KNN检索和微调则在知识密集型或数据标注规律性强的任务中更具优势。

链接: https://arxiv.org/abs/2604.09418
作者: Solomiia Bilyk,Volodymyr Getmanskyi,Taras Firman
机构: Eleks Ltd.(Eleks有限公司); Ukrainian Catholic University (乌克兰天主教大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks using limited task-specific examples. We position AIR within the broader landscape of adaptation strategies, including prompt optimization, retrieval-based methods, and fine-tuning. We then compare these approaches across a diverse benchmark suite designed to stress different task requirements, such as knowledge injection, structured extraction, label remapping, and logical reasoning. The paper argues that adaptation performance is strongly task-dependent: no single method dominates across all settings. Across five benchmarks, AIR was strongest or near-best on label-remapping classification, while KNN retrieval performed best on closed-book QA, and fine-tuning dominated structured extraction and event-order reasoning. AIR is most promising when task behavior can be captured by compact, interpretable instruction rules, while retrieval and fine-tuning remain stronger in tasks dominated by source-specific knowledge or dataset-specific annotation regularities.

[NLP-11] Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder ICLR2026

【速读】：该论文旨在解决在小规模训练环境下，如何有效平衡数据集大小与计算成本的问题，尤其是在资源受限场景（如小型研究实验室）中优化模型性能。其关键解决方案在于采用一个高度简化的仅注意力机制解码器架构（attention-only decoder），通过在幂次为2的递增子集上训练，系统性地隔离数据规模对模型性能的影响，从而揭示出平滑且符合缩放定律（scaling law）的行为特征，同时发现使用约30%的数据即可达到接近全量数据90%的验证阶段词元级准确率，为实际应用提供了可操作的数据效率策略。

链接: https://arxiv.org/abs/2604.09389
作者: Götz-Henrik Wiegand,Lorena Raichle,Rico Städeli,Tomas Hrycej,Bernhard Bermeitinger,Siegfried Handschuh
机构: University of St. Gallen (圣加仑大学); Institute of Computer Science in Vorarlberg (福拉尔贝格州计算机科学研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Presented as a paper at 3rd DATA-FM workshop @ ICLR 2026, Brazil. Published at 13th IEEE Swiss Conference on Data Science and AI (SDS 2026)

点击查看摘要

Abstract:Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although scaling laws describe this trend at large scale, their implications in controlled, smaller-scale settings remain less explored. In this work, we isolate dataset-size effects using a strongly reduced attention-only decoder architecture. By training on progressively larger power-of-two subsets, we observe smooth performance improvements accompanied by clear diminishing returns, consistent with scaling-law behavior. Using only about 30% of the training data is sufficient to reach approximately 90% of the full-data validation token-level accuracy. These results provide actionable insights into dataset scaling in a controlled, component-isolated setting and offer practical guidance for balancing dataset size and computational cost in compute- and data-restricted environments, such as small research labs and exploratory model development.

[NLP-12] ask-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在不同任务和查询中表现性能与计算成本差异显著的问题，尤其是现有路由系统在冷启动场景下（即缺乏目标任务领域训练数据时）泛化能力差的局限性。其解决方案的关键在于提出一种多层级任务特征引导的数据合成框架，通过构建层次化任务分类体系并生成多样化的问答对来逼近测试时的查询分布；在此基础上，设计了TRouter方法，该方法利用潜在的任务类型变量建模查询条件下的成本与性能，并引入由合成任务分类体系导出的先验正则化项，从而在冷启动和已有领域数据两种场景下均提升了路由决策的有效性。

链接: https://arxiv.org/abs/2604.09377
作者: Hui Liu,Bin Zou,Kecheng Chen,Jie Liu,Wenya Wang,Haoliang Li
机构: City University of Hong Kong (香港城市大学); University of Hong Kong (香港大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: 30 pages, Accepted by ACL 2026 Main

点击查看摘要

Abstract:Large language models (LLMs) exhibit substantial variability in performance and computational cost across tasks and queries, motivating routing systems that select models to meet user-specific cost-performance trade-offs. However, existing routers generalize poorly in cold-start scenarios where in-domain training data is unavailable. We address this limitation with a multi-level task-profile-guided data synthesis framework that constructs a hierarchical task taxonomy and produces diverse question-answer pairs to approximate the test-time query distribution. Building on this, we introduce TRouter, a task-type-aware router approach that models query-conditioned cost and performance via latent task-type variables, with prior regularization derived from the synthesized task taxonomy. This design enhances TRouter’s routing utility under both cold-start and in-domain settings. Across multiple benchmarks, we show that our synthesis framework alleviates cold-start issues and that TRouter delivers effective LLM routing.

[NLP-13] Arbitration Failure Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts

【速读】：该论文旨在解决视觉语言模型（Vision-Language Model, VLM）在面对图像与先验知识冲突时，为何会错误回答的问题，即区分问题源于感知（perception）还是仲裁（arbitration）机制。研究表明，VLMs 能够准确编码视觉信息（早期层中视觉属性可线性解码，AUC 达 0.86），但其最终输出错误并非由于编码不足，而是由于在推理过程中未能正确利用视觉信号进行仲裁。解决方案的关键在于识别并干预模型内部的多模态仲裁过程：通过 Multimodal Arbitration Crossover (MAC) 分析和逐层 Logit Lens 探测发现，视觉信号虽被充分编码，但最终决策层的 logits 差异才是预测接地（grounding）结果的关键；进一步因果分析表明，图像 token 承担几乎全部因果影响，而文本 token 几乎无作用；由此提出无需训练的激活引导（activation steering）策略，在早期层施加线性或稀疏自编码器引导的干预，可提升视觉接地性能达 +3.8%，验证了“VLM 已经看得很好，但问题在于如何行动”的核心结论。

链接: https://arxiv.org/abs/2604.09364
作者: Farhad Nooralahzadeh,Omid Rohanian,Yi Zhang,Jonathan Fürst,Kurt Stockinger
机构: Zurich University of Applied Sciences (苏黎世应用科学大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When a Vision-Language Model (VLM) sees a blue banana and answers “yellow”, is the problem of perception or arbitration? We explore the question in ten VLMs with various sizes and reveal an Encoding–Grounding Dissociation: models that fail to report what they see (and thus provide a wrong answer) still encode the visual evidence as strongly as models that provide the correct answer. Using Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing, we track the competition between visual and prior signals across every layer of each model. We show that visual attributes can be linearly decodable from early layers (AUC 0.86). The accuracy remains nearly identical for both successful and failed samples. However, the gap in the final-layer logit – not the strength of encoding – better predicts grounding outcomes with a correlation of . After having studied when VLMs base their answers on image clues rather than prior knowledge, we want to understand the causal relationships. We establish causality through full-sequence activation patching. The standard last-token interventions in LLM interpretability do not affect VLMs. In contrast, replacing the full token sequence at layers identified by MAC alters 60 to 84% of outputs. Partial-token decomposition shows that image tokens carry almost all of the causal impact, while text tokens have none. Scaling addresses the remaining architectural differences to achieve perfect retention. Moving from diagnosis to intervention, we show that training-free activation steering – both linear and sparse autoencoder-guided – in early layers can improve visual grounding by up to +3.8% with degrading performance in some setups. Overall, these findings lead to a clear conclusion: VLMs already see well, but the challenge is acting on what they see. Targeted interventions can help to bridge this gap.

[NLP-14] Visually-Guided Policy Optimization for Multimodal Reasoning ACL2026

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在强化学习中因文本主导特性导致的视觉忠实性不足问题，尤其是推理过程中视觉注意力激活稀疏及随时间步长出现视觉遗忘（temporal visual forgetting）的现象。解决方案的关键在于提出一种名为视觉引导策略优化（Visually-Guided Policy Optimization, VGPO）的新框架：其核心是引入视觉注意力补偿机制（Visual Attention Compensation），通过视觉相似性定位并增强视觉线索，并逐步提升后续推理步骤中的视觉期望以缓解视觉遗忘；同时结合双粒度优势重加权策略（dual-grained advantage re-weighting），在轨迹内层面突出高视觉激活的token，在轨迹间层面优先选择视觉累积表现更优的路径，从而显著提升模型的视觉激活水平和多模态推理性能。

链接: https://arxiv.org/abs/2604.09349
作者: Zengbin Wang,Feng Xiong,Liang Lin,Xuecai Hu,Yong Wang,Yanlin Wang,Man Zhang,Xiangxiang Chu
机构: AMAP, Alibaba Group(阿里巴巴集团); SYSU(中山大学); BUPT(北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.

[NLP-15] Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym

【速读】：该论文旨在解决当前空间推理（spatial reasoning）能力评估缺乏有效基准的问题，特别是现有基准多采用一次性（one-shot）测试方式，无法反映人类在交互式环境中逐步决策的特性。为应对这一挑战，作者提出Spatial-Gym，一个基于Gymnasium的强化学习环境，将空间约束推理任务解耦为2D网格路径规划的序列决策问题，并支持可选回溯机制，从而更贴近真实场景下的认知过程。其关键创新在于构建了一个结构化、可诊断的评测框架，能够系统性地分析模型在不同推理策略（如一步求解、分步推理、带回溯）下的表现差异，揭示了模型在难度适应性、视觉输入敏感性和链式思维（chain-of-thought）有效性等方面的局限性，为未来提升生成式AI的空间推理能力提供了明确方向和实验基础。

链接: https://arxiv.org/abs/2604.09338
作者: Lars Benedikt Kaesberg,Tianyu Yang,Niklas Bauer,Terry Ruas,Jan Philip Wahle,Bela Gipp
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by constraining global planning. Backtracking improves episode completion, but increases solve rate only for weaker models; stronger models rarely backtrack and do not benefit from it. Our experiments have three key findings: (1) models fail to scale reasoning effort with difficulty, (2) vision models receiving images of the spatial environment reduce solve rate by 73%, and (3) extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference even in the step-by-step setting. Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning.

[NLP-16] EthicMind: A Risk-Aware Framework for Ethical-Emotional Alignment in Multi-Turn Dialogue ACL2026

【速读】：该论文旨在解决智能对话系统在情感敏感和伦理敏感场景中，因缺乏对用户情绪与伦理风险的协同感知而导致的行为不一致问题。现有模型通常孤立处理共情与伦理安全，难以根据多轮交互中动态变化的伦理风险和用户情绪调整响应策略。其解决方案的关键在于提出一个名为 \textscEthicMind 的风险感知框架，该框架在推理阶段将伦理风险信号与用户情绪进行联合分析，制定高层级响应策略，并生成兼顾伦理引导与情感参与的上下文敏感回复，且无需额外训练模型。

链接: https://arxiv.org/abs/2604.09265
作者: Jiawen Deng,Wei Li,Wentao Zhang,Ziyun Jiao,Fuji Ren
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, Accepted to the ACL 2026 Main Conference

点击查看摘要

Abstract:Intelligent dialogue systems are increasingly deployed in emotionally and ethically sensitive settings, where failures in either emotional attunement or ethical judgment can cause significant harm. Existing dialogue models typically address empathy and ethical safety in isolation, and often fail to adapt their behavior as ethical risk and user emotion evolve across multi-turn interactions. We formulate ethical-emotional alignment in dialogue as an explicit turn-level decision problem, and propose \textscEthicMind, a risk-aware framework that implements this formulation in multi-turn dialogue at inference time. At each turn, \textscEthicMind jointly analyzes ethical risk signals and user emotion, plans a high-level response strategy, and generates context-sensitive replies that balance ethical guidance with emotional engagement, without requiring additional model training. To evaluate alignment behavior under ethically complex interactions, we introduce a risk-stratified, multi-turn evaluation protocol with a context-aware user simulation procedure. Experimental results show that \textscEthicMind achieves more consistent ethical guidance and emotional engagement than competitive baselines, particularly in high-risk and morally ambiguous scenarios.

[NLP-17] ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

【速读】：该论文旨在解决跨学科研究中基于大规模文档集合回答自然语言提问时，传统依赖人工设计标注方案并逐项标注语料库所导致的效率低下与易错问题。其核心挑战在于如何自动化生成结构化证据（structured evidence），以支撑复杂分析任务。解决方案的关键在于提出ScheMatiQ框架，该框架利用大语言模型（LLM）的推理能力，直接从问题和文档集合中自动推导出数据模式（schema）和可验证的结构化数据库，并通过交互式网页界面支持专家对提取过程进行引导与修正，从而显著提升结构化知识抽取的效率与准确性。

链接: https://arxiv.org/abs/2604.09237
作者: Shahar Levy,Eliya Habba,Reshef Mintz,Barak Raveh,Renana Keydar,Gabriel Stanovsky
机构: The Hebrew University of Jerusalem (希伯来大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: this http URL

[NLP-18] Do LLM s Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在通过强化学习人类反馈（Reinforcement Learning from Human Feedback, RLHF）内化安全策略时，其政策缺乏形式化定义且难以验证的问题。现有评估基准仅从外部标准衡量模型行为，无法检验模型是否真正理解并执行自身声明的安全边界。解决方案的关键在于提出符号-神经一致性审计（Symbolic-Neural Consistency Audit, SNCA），该框架首先通过结构化提示提取模型自述的安全规则，将其形式化为三类类型化的谓词（绝对型、条件型、自适应型），进而利用确定性比较方法对危害基准进行行为合规性测量，从而量化模型“所言”与“所行”之间的系统性差距。

链接: https://arxiv.org/abs/2604.09189
作者: Avni Mittal
机构: Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model’s self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.

[NLP-19] Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中仍普遍存在幻觉（hallucination）的问题，尤其是当相关文档已被成功检索时，模型仍可能生成不准确的答案。现有评估方法主要关注答案层面或段落层面的准确性，难以揭示证据在生成过程中如何被使用。为此，作者提出了一种面向问答（QA）的细粒度诊断框架（facet-level diagnostics framework），将每个问题分解为原子推理维度（reasoning facets），并构建“细粒度×片段”矩阵（Facet x Chunk matrix），结合检索相关性与基于自然语言推理的忠实度评分（faithfulness scores）来评估证据的充分性和 grounding 程度。解决方案的关键在于通过三种受控推理模式——严格 RAG（Strict RAG）、软 RAG（Soft RAG）和仅 LLM 生成（LLM-only）——对比分析检索与生成之间的错位（retrieval-generation misalignment），从而识别出证据缺失、证据错位及先验驱动覆盖等细粒度失败模式。实验证明，RAG 中的幻觉更多源于证据整合不当而非检索不准，且该方法能暴露传统答案级评估所忽视的系统性证据误用与错位规律。

链接: https://arxiv.org/abs/2604.09174
作者: Passant Elchafei,Monorama Swain,Shahed Masoudian,Markus Schedl
机构: Johannes Kepler University Linz, Austria
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.

[NLP-20] hink Less Know More: State-Aware Reasoning Compression with Knowledge Guidance for Efficient Reasoning

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRM）在复杂任务中因Chain-of-Thought（CoT）推理链过长而导致的“过度思考”问题，即推理步骤冗余、推理延迟高，且现有CoT压缩方法难以在准确率与效率之间取得平衡，缺乏对推理过程中不同阶段冗余和推理偏差的细粒度适应能力。其解决方案的关键在于提出State-Aware Reasoning Compression with Knowledge Guidance (STACK) 框架，通过显式建模不同推理阶段的冗余来源，并引入检索增强引导机制，在不确定或存在偏差的推理状态下采用知识引导压缩，在高度自信但冗长的状态下采用自提示压缩，同时结合基于答案收敛性的早期停止机制抑制冗余验证；此外，该框架还设计了一种基于奖励差异驱动的训练策略，融合近端策略优化（Proximal Policy Optimization, PPO）与直接偏好优化（Direct Preference Optimization, DPO），使模型能够学习状态感知的压缩策略，从而实现更优的准确率-效率权衡。

链接: https://arxiv.org/abs/2604.09150
作者: Yi Sui,Chaozhuo Li,Dawei Song
机构: Beijing Institute of Technology, China; Beijing University of Posts and Telecommunications, China; The Open University, UK
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve strong performance on complex tasks by leveraging long Chain-of-Thought (CoT), but often suffer from overthinking, leading to excessive reasoning steps and high inference latency. Existing CoT compression methods struggle to balance accuracy and efficiency, and lack fine-grained, step-level adaptation to redundancy and reasoning bias. Therefore, we propose State-Aware Reasoning Compression with Knowledge Guidance (STACK), a framework that performs step-wise CoT compression by explicitly modeling stage-specific redundancy sources and integrating with a retrieval-augmented guidance. STACK constructs online long-short contrastive samples and dynamically switches between knowledge-guided compression for uncertain or biased reasoning state and self-prompted compression for overly long but confident state, complemented by an answer-convergence-based early stopping mechanism to suppress redundant verification. We further propose a reward-difference-driven training strategy by combining Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), enabling models to learn state-conditioned compression strategies. Experiments on three mathematical reasoning benchmarks show that STACK achieves a superior accuracy-efficiency balance, reducing average response length by 59.9% while improving accuracy by 4.8 points over existing methods.

[NLP-21] Prototype-Regularized Federated Learning for Cross-Domain Aspect Sentiment Triplet Extraction

【速读】：该论文旨在解决Aspect Sentiment Triplet Extraction (ASTE)任务中跨域知识迁移的难题，即现有方法通常在孤立的数据集上训练，难以联合捕获不同领域间的共享特征表示，且受数据隐私限制无法进行集中式数据聚合。解决方案的关键在于提出一种基于原型的联邦学习框架——Prototype-based Cross-Domain Span Prototype extraction (PCD-SpanProto)，其核心创新是让分布式客户端交换类别级原型（prototype）而非完整模型参数，同时设计加权性能感知聚合策略与对比正则化模块，以增强全局原型在领域异质性下的鲁棒性，并提升类内紧凑性与类间可分性之间的平衡，从而实现高效、低通信成本的跨域知识迁移。

链接: https://arxiv.org/abs/2604.09123
作者: Zongming Cai,Jianhang Tang,Zhenyong Zhang,Jinghui Qin,Kebing Jin,Hankz Hankui Zhuo
机构: State Key Laboratory of Public Big Data, Guizhou University(贵州大学公共大数据国家重点实验室); College of Computer Science and Technology, Guizhou University(贵州大学计算机科学与技术学院); School of Information Engineering, Guangdong University of Technology(广东工业大学信息工程学院); School of Artificial Intelligence, Nanjing University(南京大学人工智能学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aspect Sentiment Triplet Extraction (ASTE) aims to extract all sentiment triplets of aspect terms, opinion terms, and sentiment polarities from a sentence. Existing methods are typically trained on individual datasets in isolation, failing to jointly capture the common feature representations shared across domains. Moreover, data privacy constraints prevent centralized data aggregation. To address these challenges, we propose Prototype-based Cross-Domain Span Prototype extraction (PCD-SpanProto), a prototype-regularized federated learning framework to enable distributed clients to exchange class-level prototypes instead of full model parameters. Specifically, we design a weighted performance-aware aggregation strategy and a contrastive regularization module to improve the global prototype under domain heterogeneity and the promotion between intra-class compactness and inter-class separability across clients. Extensive experiments on four ASTE datasets demonstrate that our method outperforms baselines and reduces communication costs, validating the effectiveness of prototype-based cross-domain knowledge transfer.

[NLP-22] Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agent ic Speech Recognition

【速读】：该论文旨在解决自动语音识别（ASR）领域中两个长期被忽视的关键问题：一是传统评估指标词错误率（WER）仅关注词级别的准确性，而无法有效衡量句子层面的语义正确性；二是交互式纠错机制——人类交流中不可或缺的环节——在ASR研究中缺乏系统性探索。解决方案的核心在于构建一个基于智能体（agent）框架的交互式ASR系统，其中关键创新包括：利用大语言模型作为裁判（LLM-as-a-Judge）来实现语义感知的评估，替代单一的词级指标；并设计由大语言模型驱动的智能体框架，模拟人类多轮交互过程，通过语义反馈实现识别结果的迭代优化。实验表明，该框架在多个标准数据集上显著提升了语义保真度和交互纠错能力。

链接: https://arxiv.org/abs/2604.09121
作者: Peng Wang(1),Yanqiao Zhu(1),Zixuan Jiang(1),Qinyuan Chen(2),Xingjian Zhao(2),Xipeng Qiu(2),Wupeng Wang(3),Zhifu Gao(3),Xiangang Li(3),Kai Yu(1),Xie Chen(1) ((1) X-LANCE Lab, Shanghai Jiao Tong University, (2) School of Computer Science, Fudan University, (3) Tongyi Fun Team, Alibaba Group)
机构: X-LANCE Lab, Shanghai Jiao Tong University (上海交通大学); School of Computer Science, Fudan University (复旦大学); Tongyi Fun Team, Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability. We will release the code to facilitate future research in interactive and agentic ASR.

[NLP-23] Few-Shot Contrastive Adaptation for Audio Abuse Detection in Low-Resource Indic Languages

【速读】：该论文旨在解决多语言、低资源环境下语音形式的仇恨言论（abusive speech）检测问题，传统方法依赖自动语音识别（ASR）与文本分类的流水线，存在转录错误及丢失语音韵律信息的局限性。其解决方案的关键在于利用对比语言-音频预训练模型（Contrastive Language-Audio Pre-training, CLAP），直接从音频中提取跨语言表征，并通过少量标注数据进行轻量级投影适配（projection-only adaptation），从而在不依赖大量标注语料的情况下实现有效的跨语言音频滥用内容识别。实验表明，CLAP在10种印地语系语言上展现出强健的跨语言音频表示能力，且少样本适配性能接近全监督系统，但适应效果具有语言特异性且非单调随样本量提升。

链接: https://arxiv.org/abs/2604.09094
作者: Aditya Narayan Sankaran,Reza Farahbakhsh,Noel Crespi
机构: SAMOVAR, Télécom SudParis; Institut Polytechnique de Paris
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: 14 pages, preprint under review

点击查看摘要

Abstract:Abusive speech detection is becoming increasingly important as social media shifts towards voice-based interaction, particularly in multilingual and low-resource settings. Most current systems rely on automatic speech recognition (ASR) followed by text-based hate speech classification, but this pipeline is vulnerable to transcription errors and discards prosodic information carried in speech. We investigate whether Contrastive Language-Audio Pre-training (CLAP) can support abusive speech detection directly from audio. Using the ADIMA dataset, we evaluate CLAP-based representations under few-shot supervised contrastive adaptation in cross-lingual and leave-one-language-out settings, with zero-shot prompting included as an auxiliary analysis. Our results show that CLAP yields strong cross-lingual audio representations across ten Indic languages, and that lightweight projection-only adaptation achieves competitive performance with respect to fully supervised systems trained on complete training data. However, the benefits of few-shot adaptation are language-dependent and not monotonic with shot size. These findings suggest that contrastive audio-text models provide a promising basis for cross-lingual audio abuse detection in low-resource settings, while also indicating that transfer remains incomplete and language-specific in important ways.

[NLP-24] Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLM s through Logical Consistency

【速读】：该论文旨在解决大语言模型在面对来自不同来源（如系统策略、用户请求、工具输出和检索到的上下文）的多指令冲突时，如何在保持任务效用和行为一致性的同时，有效遵循指令优先级的问题。现有研究多关注对抗性攻击场景下的指令层次问题，而忽视了实际应用中常见的良性指令冲突。解决方案的关键在于提出神经符号层次对齐（Neuro-Symbolic Hierarchical Alignment, NSHA），其核心机制是在推理阶段将指令解析建模为约束满足问题（Constraint Satisfaction Problem, CSP），通过求解器引导的推理获得在层次约束下最一致的适用指令集合；在训练阶段，则利用自动构建的监督信号，将求解器决策蒸馏至模型参数中，从而实现对指令优先级的显式建模与执行。

链接: https://arxiv.org/abs/2604.09075
作者: Shu Yang,Zihao Zhou,Di Wang,Wenda Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models increasingly operate under multiple instructions from heterogeneous sources with different authority levels, including system policies, user requests, tool outputs, and retrieved context. While prior work on instruction hierarchy highlights the importance of respecting instruction priorities, it mainly focuses on adversarial attacks and overlooks the benign but common instruction conflicts that arise in real-world applications. In such settings, models must not only avoid security violations but also preserve task utility and behavioral consistency when instructions partially or implicitly conflict. We propose Neuro-Symbolic Hierarchical Alignment (NSHA) for hierarchical instruction-following by explicitly modeling and enforcing instruction priorities. At inference time, we introduce solver-guided reasoning that formulates instruction resolution as a constraint satisfaction problem, enabling the model to derive a maximally consistent set of applicable instructions under hierarchical constraints. At training time, NSHA distills solver-based decisions into model parameters using automatically constructed supervision. We evaluate our approach on rule following, task execution, tool use, and safety, covering both single-turn and multi-turn interactions, and show that NSHA significantly improves performance under such conflicts while maintaining competitive utility in reference settings.

[NLP-25] NyayaMind- A Framework for Transparent Legal Reasoning and Judgment Prediction in the Indian Legal System

【速读】：该论文旨在解决法院判决预测与解释（Court Judgment Prediction and Explanation, CJPE）系统在实际司法或法律研究场景中面临的两大核心挑战：一是如何实现高精度的判决预测，二是如何生成符合法律逻辑、结构清晰且可解释的推理过程。现有方法往往缺乏透明性和与司法实践的一致性，难以满足专业用户对可信度和合规性的要求。解决方案的关键在于提出一个名为NyayaMind的开源框架，其核心创新在于整合了检索（Retrieval）、推理（Reasoning）和验证（Verification）三个模块，构建了一个类人化的结构化决策流程。其中，检索模块基于RAG（Retrieval-Augmented Generation）管道从大规模法律语料中精准定位相关法条和判例，预测模块则利用针对印度法律领域微调的推理导向大语言模型（LLM），输出包括争议焦点、论点、裁判理由及最终判决在内的结构化法律推理内容，从而显著提升解释质量与证据一致性，推动可信AI辅助法律决策系统的落地应用。

链接: https://arxiv.org/abs/2604.09069
作者: Parjanya Aditya Shukla,Shubham Kumar Nigam,Debtanu Datta,Balaramamahanthi Deepak Patnaik,Noel Shallum,Pradeep Reddy Vanga,Saptarshi Ghosh,Arnab Bhattacharya
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Court Judgment Prediction and Explanation (CJPE) aims to predict a judicial decision and provide a legally grounded explanation for a given case based on the facts, legal issues, arguments, cited statutes, and relevant precedents. For such systems to be practically useful in judicial or legal research settings, they must not only achieve high predictive performance but also generate transparent and structured legal reasoning that aligns with established judicial practices. In this work, we present NyayaMind, an open-source framework designed to enable transparent and scalable legal reasoning for the Indian judiciary. The proposed framework integrates retrieval, reasoning, and verification mechanisms to emulate the structured decision-making process typically followed in courts. Specifically, NyayaMind consists of two main components: a Retrieval Module and a Prediction Module. The Retrieval Module employs a RAG pipeline to identify legally relevant statutes and precedent cases from large-scale legal corpora, while the Prediction Module utilizes reasoning-oriented LLMs fine-tuned for the Indian legal domain to generate structured outputs including issues, arguments, rationale, and the final decision. Our extensive results and expert evaluation demonstrate that NyayaMind significantly improves the quality of explanation and evidence alignment compared to existing CJPE approaches, providing a promising step toward trustworthy AI-assisted legal decision support systems.

[NLP-26] Anchored Sliding Window: Toward Robust and Imperceptible Linguistic Steganography ACL2026

【速读】：该论文旨在解决基于语言模型的隐写术（linguistic steganography）在文本传输过程中对微小修改敏感的问题，即传统方法因假设文本无变化传输而缺乏鲁棒性。其解决方案的关键在于提出锚定滑动窗口（anchored sliding window, ASW）框架：在上下文窗口中固定最新token、提示词（prompt）以及一个桥接上下文（bridge context），促使语言模型补偿被排除的token，从而提升隐写文本的不可察觉性和鲁棒性；进一步将桥接上下文优化建模为一种提示蒸馏（prompt distillation）变体，并引入自蒸馏（self-distillation）策略增强效果。实验表明，ASW在文本质量、不可察觉性和鲁棒性方面均显著优于基线方法。

链接: https://arxiv.org/abs/2604.09066
作者: Ruiyi Yan,Shiao Meng,Yugo Murawaki
机构: Kyoto University (京都大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: ACL2026 Main

点击查看摘要

Abstract:Linguistic steganography based on language models typically assumes that steganographic texts are transmitted without alteration, making them fragile to even minor modifications. While previous work mitigates this fragility by limiting the context window, it significantly compromises text quality. In this paper, we propose the anchored sliding window (ASW) framework to improve imperceptibility and robustness. In addition to the latest tokens, the prompt and a bridge context are anchored within the context window, encouraging the model to compensate for the excluded tokens. We formulate the optimization of the bridge context as a variant of prompt distillation, which we further extend using self-distillation strategies. Experiments show that our ASW significantly and consistently outperforms the baseline method in text quality, imperceptibility, and robustness across diverse settings. The code is available at this http URL.

[NLP-27] CONDESION-BENCH: Conditional Decision-Making of Large Language Models in Compositional Action Space

【速读】：该论文旨在解决现有大语言模型（Large Language Models, LLMs）在高风险决策场景中作为辅助决策工具时，因依赖两个简化假设而无法准确建模真实世界复杂决策过程的问题：一是动作从预定义有限集合中选择，二是未显式引入限制动作可行性的条件。这些问题导致模型难以捕捉现实决策中动作的组合结构及其有效性约束。解决方案的关键在于提出 CONDESION-BENCH 基准测试平台，其核心创新是将动作定义为对决策变量的分配，并通过变量级、上下文级和分配级的显式条件来约束动作空间；同时采用基于 oracle 的评估方法，同时衡量决策质量与条件合规性，从而实现对 LLM 决策能力更严格、更贴近实际的评估。

链接: https://arxiv.org/abs/2604.09029
作者: Yeonjun Hwang,Sungyong Park,Minju Kim,Dongha Lee,Jinyoung Yeo
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Large language models have been widely explored as decision-support tools in high-stakes domains due to their contextual understanding and reasoning capabilities. However, existing decision-making benchmarks rely on two simplifying assumptions: actions are selected from a finite set of pre-defined candidates, and explicit conditions restricting action feasibility are not incorporated into the decision-making process. These assumptions fail to capture the compositional structure of real-world actions and the explicit conditions that constrain their validity. To address these limitations, we introduce CONDESION-BENCH, a benchmark designed to evaluate conditional decision-making in compositional action space. In CONDESION-BENCH, actions are defined as allocations to decision variables and are restricted by explicit conditions at the variable, contextual, and allocation levels. By employing oracle-based evaluation of both decision quality and condition adherence, we provide a more rigorous assessment of LLMs as decision-support tools.

[NLP-28] owards Linguistically-informed Representations for English as a Second or Foreign Language: Review Construction and Application

【速读】：该论文旨在解决当前第二语言习得研究中对英语作为第二或外语（ESFL）缺乏系统性、知识密集型语料库的问题。传统观点将ESFL视为标准英语的偏离，而本文主张将其视为独立的语言系统，亟需专门的语法-语义资源来支持深入研究。解决方案的关键在于基于建构主义理论，以“构式”（construction）为基本分析单位，构建一个能同时刻画ESFL与标准英语句法-语义接口的资源框架。该方法通过参照标准英语的句法-语义映射，保留ESFL的独特特征，最终形成包含1643条标注句子的金标准语义库（sembank），并验证其在第二语言习得研究中的实用性，如用于检验语言生态位假说（Linguistic Niche Hypothesis）。

链接: https://arxiv.org/abs/2604.09008
作者: Wenxi Li,Xihao Wang,Weiwei Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread use of English as a Second or Foreign Language (ESFL) has sparked a paradigm shift: ESFL is not seen merely as a deviation from standard English but as a distinct linguistic system in its own right. This shift highlights the need for dedicated, knowledge-intensive representations of ESFL. In response, this paper surveys existing ESFL resources, identifies their limitations, and proposes a novel solution. Grounded in constructivist theories, the paper treats constructions as the fundamental units of analysis, allowing it to model the syntax–semantics interface of both ESFL and standard English. This design captures a wide range of ESFL phenomena by referring to syntactico-semantic mappings of English while preserving ESFL’s unique characteristics, resulting a gold-standard syntactico-semantic resource comprising 1643 annotated ESFL sentences. To demonstrate the sembank’s practical utility, we conduct a pilot study testing the Linguistic Niche Hypothesis, highlighting its potential as a valuable tool in Second Language Acquisition research.

[NLP-29] ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂表格问答任务中因表序列化（table serialization）导致的性能瓶颈问题，具体挑战包括结构忽视、表示鸿沟和推理不透明性。现有方法难以捕捉显式层级关系且缺乏模式灵活性，而基于树的方案则受限于语义适应能力不足。解决方案的关键在于提出ASTRA（Adaptive Semantic Tree Reasoning Architecture），其核心由两个模块构成：AdaSTR（自适应语义树重构模块）利用LLM的全局语义感知能力将表格重构为逻辑语义树（Logical Semantic Tree），显式建模层级依赖并根据表格规模自适应优化构建策略；DuTR（双模推理框架）进一步融合基于树搜索的文本导航与符号代码执行，实现语言对齐与精确验证的协同推理，从而显著提升复杂表格问答的准确性与可解释性。

链接: https://arxiv.org/abs/2604.08999
作者: Xiaoke Guo,Songze Li,Zhiqiang Liu,Zhaoyan Gong,Yuanxiang Liu,Huajun Chen,Wen Zhang
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.

[NLP-30] PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在使用角色提示（Persona Prompting）时存在的两个核心问题：一是最优角色设定难以高效识别，二是角色提示对输出质量的影响机制尚不明确。现有方法主要依赖推理阶段的角色搜索策略，导致计算开销增加。为此，作者提出在训练阶段引入角色敏感性建模，通过强化学习结合可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）来提升模型对不同角色提示的鲁棒性；但RLVR存在一个固有权衡：虽增强任务导向型鲁棒性，却可能削弱角色扮演中的表现忠实度（Persona Fidelity）。为突破此限制，论文提出PerMix-RLVR策略——一种角色混合的强化学习框架，在保持有害角色扰动下鲁棒性的同时，显著提升角色忠实度，实验证明其在MATH500和PersonaGym数据集上分别将角色稳定性得分（Persona Stability Score, PSS）和角色忠实度提升21.2%与11.4%。

链接: https://arxiv.org/abs/2604.08986
作者: Jihwan Oh,Soowon Oh,Murad Aghazada,Minchan Jeong,Sungnyun Kim,Se-Young Yun
机构: KAIST AI; Samsung Advanced Institute of Technology; Seoul National University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.

[NLP-31] sting the Assumptions of Active Learning for Translation Tasks with Few Samples

【速读】：该论文试图解决的问题是：在仅有少量样本（100–500个）的情况下，主动学习（Active Learning, AL）策略在语言生成任务中无法超越随机采样（random sampling）的性能表现。其关键发现在于，传统AL方法所依赖的核心假设——即训练数据的信息量（informativeness）和多样性（diversity）与测试性能正相关——在小样本场景下并不成立；相反，训练样本的顺序以及与预训练数据的交互作用对模型性能的影响更大。因此，未来主动学习方法的设计必须考虑这些因素，才能在极低标注预算下有效提升模型性能。

链接: https://arxiv.org/abs/2604.08977
作者: Lorenzo Jaime Yu Flores,Cesare Spinoso di-Piano,Ori Ernst,David Ifeoluwa Adelani,Jackie Chi Kit Cheung
机构: Mila - Quebec AI Institute (魁北克人工智能研究所); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Active learning (AL) is a training paradigm for selecting unlabeled samples for annotation to improve model performance on a test set, which is useful when only a limited number of samples can be annotated. These algorithms often work by optimizing for the informativeness and diversity of the training data to be annotated. Recent work found that AL strategies fail to outperform random sampling on various language generation tasks when using 100-500 samples. To understand AL’s poor performance when only using few samples, we investigate whether the core assumptions underlying AL strategies hold. We find that neither the informativeness nor diversity of the training data, which AL strategies optimize for, are correlated with test set performance. Instead, factors like the ordering of the training samples and interactions with pre-training data have a larger impact on performance. This suggests that future AL methods must take these factors into account in order to work with very few samples.

[NLP-32] Quantisation Reshapes the Metacognitive Geometry of Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在不同精度推理格式下，其领域级元认知效率（domain-level metacognitive efficiency）是否保持一致的问题。研究发现，模型量化（model quantisation）并非均匀地降低元认知能力，而是重构了各知识领域的元认知监控水平（如M-ratio），但未改变类型2元认知判别力（Type-2 AUROC），说明这种重构源于M-ratio归一化方式的敏感性而非底层信号变化。解决方案的关键在于识别出：依赖M-ratio进行领域级元认知评估的系统存在对推理格式的隐含依赖，而使用Type-2 AUROC（AUROC_2）作为指标则能提供更稳定、可靠的元认知评估，从而避免因量化带来的误导性判断。

链接: https://arxiv.org/abs/2604.08976
作者: Jon-Paul Cacioli
机构: Melbourne(墨尔本); Australia(澳大利亚)
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 5 tables. Pre-registered study. Code and data: this https URL

点击查看摘要

Abstract:We report that model quantisation restructures domain-level metacognitive efficiency in LLMs rather than degrading it uniformly. Evaluating Llama-3-8B-Instruct on the same 3,000 questions at Q5_K_M and f16 precision, we find that M-ratio profiles across four knowledge domains are uncorrelated between formats (Spearman rho = 0.00). Arts Literature moves from worst-monitored (M-ratio = 0.606 at Q5_K_M) to best-monitored (1.542 at f16). Geography moves from well-monitored (1.210) to under-monitored (0.798). However, Type-2 AUROC profiles are perfectly stable across formats (rho = 1.00), localising the restructuring to the M-ratio normalisation rather than the underlying discrimination signal. This finding emerged from a pre-registered attempt to improve metacognition through domain-conditional training. We prescribed confidence-amplification SFT for the diagnosed weak domain, with matched-budget agnostic and wrong-prescription controls. All four confirmatory hypotheses were null (10,000 bootstrap resamples, seed = 42). The training successfully reshaped confidence distributions, doubling the NLP gap in Science from 0.076 to 0.152, but did not improve meta-d’ because the diagnostic profile did not transfer across formats. Any system relying on domain-level M-ratio profiles has an unexamined dependency on inference format. Systems using AUROC_2 are safer. We release all code, pre-registrations, and trial-level data.

[NLP-33] Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning

【速读】：该论文旨在解决监督微调（Supervised Fine-Tuning, SFT）对语言模型置信度评分（confidence scores）与输出质量之间相关性的影响问题。研究表明，SFT后置信度评分与输出质量的关联性会下降，其根源在于置信度变化可能受输出与训练分布相似性等非质量因素驱动，而非真实预测准确性。解决方案的关键在于认识到当前置信度指标不能直接用于下游任务，必须在特定微调场景下重新评估其有效性，并推动开发对微调更具鲁棒性的置信度度量方法。

链接: https://arxiv.org/abs/2604.08974
作者: Lorenzo Jaime Yu Flores,Cesare Spinoso di-Piano,Jackie Chi Kit Cheung
机构: Mila - Quebec AI Institute (Mila - 魁北克人工智能研究所); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Uncertainty quantification is a set of techniques that measure confidence in language models. They can be used, for example, to detect hallucinations or alert users to review uncertain predictions. To be useful, these confidence scores must be correlated with the quality of the output. However, recent work found that fine-tuning can affect the correlation between confidence scores and quality. Hence, we investigate the underlying behavior of confidence scores to understand its sensitivity to supervised fine-tuning (SFT). We find that post-SFT, the correlation of various confidence scores degrades, which can stem from changes in confidence scores due to factors other than the output quality, such as the output’s similarity to the training distribution. We demonstrate via a case study how failing to address this miscorrelation reduces the usefulness of the confidence scores on a downstream task. Our findings show how confidence metrics cannot be used off-the-shelf without testing, and motivate the need for developing metrics which are more robust to fine-tuning.

[NLP-34] Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models ACL2026

【速读】：该论文旨在解决半自回归（Semi-autoregressive, Semi-AR）解码在扩散大语言模型（Diffusion Large Language Models, dLLMs）中因块内约束导致的跨块稳定标记（stable tokens）解码延迟问题，从而提升推理效率与性能。解决方案的关键在于提出一种无需训练、可即插即用的动态解码策略——基于锚点的历史稳定解码（Anchor-based History-stable Decoding, AHD），其核心机制是通过实时监测 token 的稳定性趋势（以动态锚点为基准），一旦检测到 token 达到稳定状态即触发提前跨块解码，从而有效缓解 block 约束带来的延迟，显著提升生成效率并逆转现有加速策略常见的性能下降问题。

链接: https://arxiv.org/abs/2604.08964
作者: Shun Zou,Yong Wang,Zehui Chen,Lin Chen,Chongyang Tao,Feng Zhao,Xiangxiang Chu
机构: University of Science and Technology of China (中国科学技术大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: Accepted for ACL 2026

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) have recently become a promising alternative to autoregressive large language models (ARMs). Semi-autoregressive (Semi-AR) decoding is widely employed in base dLLMs and advanced decoding strategies due to its superior performance. However, our observations reveal that Semi-AR decoding suffers from inherent block constraints, which cause the decoding of many cross-block stable tokens to be unnecessarily delayed. To address this challenge, we systematically investigate the identification of stable tokens and present three key findings: (1) naive lookahead decoding is unreliable, (2) token stability closely correlates with convergence trend, and (3) historical information is isolated. Building on these insights, we propose Anchor-based History-stable Decoding (AHD), a training-free, plug-and-play dynamic decoding strategy. Specifically, AHD monitors the stability trend of tokens in real time through dynamic anchors. Once a token reaches stability, it initiates early cross-block decoding to enhance efficiency and performance. Extensive experiments across language, vision-language, and audio-language domains demonstrate that AHD simultaneously improves both performance and inference efficiency. Notably, AHD effectively reverses the performance degradation typically observed in existing advanced decoding acceleration strategies. For instance, on the BBH benchmark, our approach reduces decoding steps by 80% while improving performance by 3.67%.

[NLP-35] axPraBen: A Scalable Benchmark for Structured Evaluation of LLM s in Chinese Real-World Tax Practice

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在中文税收实践领域存在显著能力缺口的问题，即现有税法相关基准测试多集中于孤立的自然语言处理（NLP）任务，忽视了真实场景下的综合应用能力。其解决方案的关键在于提出TaxPraBen——首个专为中文税收实践设计的基准评测体系，它不仅整合了10项传统应用任务与3类创新性的现实场景（如税务风险防控、税务稽查分析和税务策略规划），还构建了一个可扩展的结构化评估范式，通过“结构化解析—字段对齐提取—数值与文本匹配”的流程实现端到端的税务实践能力评估，并具备向其他专业领域延伸的潜力。

链接: https://arxiv.org/abs/2604.08948
作者: Gang Hu,Yating Chen,Haiyan Ding,Wang Gao,Jiajia Huang,Min Peng,Qianqian Xie,Kun Yu
机构: Yunnan University (云南大学); Wuhan University (武汉大学); Jianghan University (江汉大学); Nanjing Audit University (南京审计大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of “structured parsing-field alignment extraction-numerical and textual matching”, enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom’s taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen serves as a vital resource for advancing evaluations of LLMs in practical applications.

[NLP-36] MuTSE: A Human-in-the-Loop Multi-use Text Simplification Evaluator

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在文本简化任务中，因缺乏结构化、可视化框架而导致的提示（prompt）与模型组合（prompt-model permutations）难以系统评估的问题。当前研究多依赖静态计算脚本，教育者则受限于标准对话界面，无法实现多维度、可重复的定量与定性分析。解决方案的关键在于提出 MuTSE —— 一个交互式人机协同 Web 应用程序，其核心创新是集成了一种新型分层语义对齐引擎，并引入线性偏置启发式（linearity bias heuristic, λ），能够实时生成 P × M 个提示-模型组合的对比矩阵，通过视觉映射源句与其简化版本，显著降低认知负荷并支持结构化标注，从而为下游自然语言处理（Natural Language Processing, NLP）数据集构建提供可复现的评估基础。

链接: https://arxiv.org/abs/2604.08947
作者: Rares-Alexandru Roscan,Gabriel Petre1,Adrian-Marius Dumitran,Angela-Liliana Dumitran
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for ITS 2026

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly prevalent in text simplification, systematically evaluating their outputs across diverse prompting strategies and architectures remains a critical methodological challenge in both NLP research and Intelligent Tutoring Systems (ITS). Developing robust prompts is often hindered by the absence of structured, visual frameworks for comparative text analysis. While researchers typically rely on static computational scripts, educators are constrained to standard conversational interfaces – neither paradigm supports systematic multi-dimensional evaluation of prompt-model permutations. To address these limitations, we introduce \textbfMuTSE\footnoteThe project code and the demo have been made available for peer review at the following anonymized URL. this https URL, an interactive human-in-the-loop web application designed to streamline the evaluation of LLM-generated text simplifications across arbitrary CEFR proficiency targets. The system supports concurrent execution of P \times M prompt-model permutations, generating a comprehensive comparison matrix in real-time. By integrating a novel tiered semantic alignment engine augmented with a linearity bias heuristic ( \lambda ), MuTSE visually maps source sentences to their simplified counterparts, reducing the cognitive load associated with qualitative analysis and enabling reproducible, structured annotation for downstream NLP dataset construction.

[NLP-37] NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression

【速读】：该论文旨在解决维度情感分析（Dimensional Aspect-Based Sentiment Analysis, DimABSA）中的连续情感值预测问题，即针对文本中每个特定方面（aspect）输出连续的效价（valence, VA）和唤醒度（arousal）得分，而非传统的离散极性标签。其解决方案的关键在于采用基于XLM-RoBERTa-base模型的微调策略，将输入构造为[CLS] T [SEP] a_i [SEP]格式，并设计双回归头（dual regression heads）分别预测效价和唤醒度，输出通过Sigmoid函数缩放到[1, 9]区间。此外，针对不同语言-领域组合（如英文与中文在餐厅、笔记本电脑和金融领域的组合）分别训练独立模型，并合并训练与开发集进行最终测试预测，实验证明该任务特定微调方法在少样本提示设置下显著优于多个大语言模型（LLMs），如GPT-5.2、LLaMA系列等。

链接: https://arxiv.org/abs/2604.08923
作者: Tong Wu,Nicolay Rusnachenko,Huizhi Liang
机构: Bournemouth University (布恩茅斯大学); Newcastle University (纽卡斯尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence-arousal (VA) regression. This paper describes a system developed for Track A - Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, constructing the input as [CLS] T [SEP] a_i [SEP] and training dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language-domain combination (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models including GPT-5.2, LLaMA-3-70B, LLaMA-3.3-70B, and LLaMA-4-Maverick under a few-shot prompting setting, demonstrating that task-specific fine-tuning substantially and consistently outperforms these LLM-based methods across all evaluation datasets. The code is publicly available at this https URL.

[NLP-38] Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective

【速读】：该论文旨在解决链式思维（Chain-of-thought, CoT）知识蒸馏中存在的一类“能力差距”（capacity gap）问题，即当教师模型与学生模型的能力差异较大时，蒸馏过程可能失效甚至导致性能下降。以往研究常仅报告蒸馏后的结果，掩盖了蒸馏后性能反而低于学生原始基线的现象。为此，作者提出了一种更贴近实际的评估协议，强调在蒸馏前后均进行对比分析，并发现能力差距的影响并非在所有任务和设置中都占主导地位，尤其当候选教师模型之间性能差异显著时，其影响更为复杂。这一新评估框架为教师-学生配对选择提供了更具实践指导意义的依据。

链接: https://arxiv.org/abs/2604.08880
作者: Tokio Kajitsuka,Ukyo Honda,Sho Takase
机构: The University of Tokyo (东京大学); CyberAgent (CyberAgent)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student’s pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.

[NLP-39] GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

【速读】：该论文旨在解决多模态讽刺目标识别（Multimodal Sarcasm Target Identification, MSTI）中的细粒度定位难题，现有方法依赖隐式跨模态对齐，导致可解释性差且定位精度不足。其解决方案的关键在于提出GRASP框架，通过引入视觉接地的思维链（Grounded Chain-of-Thought, Grounded CoT）推理机制，显式地将与讽刺相关的视觉区域锚定在推理过程中，并促使模型在预测最终分类标签和讽刺目标前生成结构化的解释逻辑；同时采用双阶段监督优化策略——坐标感知加权损失的监督微调与细粒度目标策略优化，显著提升了模型在多模态场景下的细粒度讽刺目标识别性能与内部推理质量。

链接: https://arxiv.org/abs/2604.08879
作者: Faxian Wan,Xiaocui Yang,Yifan Cao,Shi Feng,Daling Wang,Yifei Zhang
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which explicitly anchors sarcasm-related visual regions within the reasoning trajectory and prompts the model to articulate rationales before predicting the final classification labels and sarcasm targets. Furthermore, we employ a dual-stage outcome-supervised joint optimization strategy: Supervised Fine-Tuning with a coordinate-aware weighted loss, followed by Fine-Grained Target Policy Optimization. Extensive experiments demonstrate that GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities, and an LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. Our dataset and source code will be released on GitHub.

[NLP-40] Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition

【速读】：该论文旨在解决多语言人格识别（Multilingual Personality Recognition）中因缺乏高质量跨语言数据集而导致的性能瓶颈问题。其核心解决方案在于提出ADAM框架，关键创新包括：1）基于现有英文人格数据集，利用大语言模型（LLM）进行翻译增强，并引入人格引导的生成式增强（Personality-Informed Generative Augmentation, PIGA），以生成多种语言（如日语、中文、马来语和法语）下的高质量训练样本；2）设计跨语言注意力蒸馏（Cross-Lingual Attention Distillation, CLAD）机制，使模型能够跨语言学习并捕捉人格特征的一致性表示，从而有效弥合语言与文化差异带来的认知鸿沟。实验表明，结合PIGA增强与CLAD训练的模型在多个数据集上显著优于标准二元交叉熵（BCE）基线，且具备良好的泛化能力与基准性能。

链接: https://arxiv.org/abs/2604.08851
作者: Jing Jie Tan,Ban-Hoe Kwan,Danny Wee-Kiat Ng,Yan-Chai Hum,Noriyuki Kawarazaki,Kosuke Takano
机构: Universiti Tunku Abdul Rahman (UTAR); Kanagawa Institute of Technology; Université Sorbonne Paris Nord (USPN)
类目: Computation and Language (cs.CL)
备注: IEEE Transactions on Cognitive and Developmental Systems (2026)

点击查看摘要

Abstract:While significant work has been done on personality recognition, the lack of multilingual datasets remains an unresolved challenge. To address this, we propose ADAM (Cross-Lingual (A)ttention (D)istillation with Personality-Guided Generative (A)ugmentation for (M)ultilingual Personality Recognition), a state-of-the-art approach designed to advance multilingual personality recognition. Our approach leverages an existing English-language personality dataset as the primary source and employs a large language model (LLM) for translationbased augmentation, enhanced by Personality-Informed Generative Augmentation (PIGA), to generate high-quality training data in multiple languages, including Japanese, Chinese, Malay, and French. We provide a thorough analysis to justify the effectiveness of these augmentation techniques. Building on these advancements, ADAM integrates Cross-Lingual Attention Distillation (CLAD) to train a model capable of understanding and recognizing personality traits across languages, bridging linguistic and cultural gaps in personality analysis. This research presents a thorough evaluation of the proposed augmentation method, incorporating an ablation study on recognition performance to ensure fair comparisons and robust validation. Overall, with PIGA augmentation, the findings demonstrate that CLAD significantly outperforms the standard BCE across all languages and personality traits, achieving notable improvements in average BA scores - 0.6332 (+0.0573) on the Essays dataset and 0.7448 (+0.0968) on the Kaggle dataset. The CLAD-trained model also demonstrated strong generalizability and achieved benchmark performance comparable to current leading encoder models. The model weight, dataset, and algorithm repository are available at this https URL.

[NLP-41] Dictionary-Aligned Concept Control for Safeguarding Multimodal LLM s CVPR2026

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在面对恶意查询时易产生不安全响应的问题，尤其针对现有安全防护方法（如提示工程、响应分类或微调）在应对不断演化的攻击模式时效果有限、需重新运行查询或计算开销过大等局限性。其解决方案的关键在于提出一种名为**字典对齐概念控制（Dictionary-Aligned Concept Control, DACO）**的框架，该框架通过构建包含15,000个多模态概念的 curated concept dictionary（即DACO-400K数据集）并结合稀疏自编码器（Sparse Autoencoder, SAE），实现对冻结模型激活值的细粒度干预：首先利用大量图像-文本对提取概念方向，其次借助稀疏编码进行精准激活调控，并最终基于该字典初始化SAE训练以自动标注原子语义，从而在不损害通用能力的前提下显著提升MLLM的安全性。

链接: https://arxiv.org/abs/2604.08846
作者: Jinqi Luo,Jinyu Yang,Tal Neiman,Lei Fan,Bing Yin,Son Tran,Mubarak Shah,René Vidal
机构: University of Pennsylvania (宾夕法尼亚大学); Amazon (亚马逊); University of Central Florida (中佛罗里达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli and summarizing their activations into concept directions. We name the dataset DACO-400K. Second, we show that the curated dictionary can be used to intervene activations via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves MLLM safety while maintaining general-purpose capabilities.

[NLP-42] HiFloat4 Format for Language Model Pre-training on Ascend NPUs

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在训练和部署过程中面临的高计算与内存开销问题，其核心挑战在于如何在不显著牺牲模型性能的前提下实现低精度训练。解决方案的关键在于系统性地评估并优化适用于华为Ascend NPU的4位浮点数（FP4）格式——特别是HiFloat4与MXFP4——并在大规模训练场景中全面应用FP4精度进行线性（GEMM）和专家层（expert GEMM）运算。研究通过引入专为FP4训练设计的稳定化技术，有效抑制数值误差累积，使相对误差控制在全精度基线的1%以内，同时保持高达4倍的计算吞吐量和内存效率提升，从而实现了高效且稳定的FP4训练方案。

链接: https://arxiv.org/abs/2604.08826
作者: Mehran Taghian,Yunke Peng,Xing Huang,Yao Wang,Yaoyuan Wang,Wei Guo,Yuanyong Luo,Tianchi Hu,Junsong Wang,Xin Wang,Hu Liu,Yu Cheng,Ziwei Yu,Hongliang Li,Mehdi Rahimifar,Lei Yan,Xuefei Wang,Zhuang Ma,Lei Liu,Hui Yu,Anandharaju Durai Raju,Hoang Le,Hei Yi Mak,Tanzila Rahman,Shadan Golestan
机构: Huawei(华为)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats–such as MXFP4 and NVFP4–can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines. In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision. We evaluate both dense architectures (e.g., Pangu and LLaMA-style models) and mixture-of-experts (MoE) models, where both standard linear layers and expert-specific GEMMs operate in FP4. Furthermore, we explore stabilization techniques tailored to FP4 training that significantly reduce numerical degradation, maintaining relative error within 1% of full-precision baselines while preserving the efficiency benefits of 4-bit computation. Our results provide a comprehensive empirical study of FP4 training on NPUs and highlight the practical trade-offs between FP4 formats in large-scale dense and MoE models.

[NLP-43] p1: Better Prompt Optimization with Fewer Prompts

【速读】：该论文旨在解决生成式 AI (Generative AI) 中系统提示（system prompt）优化效果不稳定的问题，即在不更新模型权重的前提下，如何提升提示优化（prompt optimization）在不同任务上的有效性。研究表明，提示优化的成功与否取决于系统提示间奖励方差与响应内随机性方差的相对大小：当系统提示间的差异足够显著时，优化才有效；反之，若响应的随机性主导了整体方差，则优化失败。关键创新在于提出一种名为 p1 的用户提示筛选方法，通过选择在候选系统提示下具有高方差的少量用户提示子集，增强系统提示之间的区分度，从而显著提升优化效率和泛化能力。实验表明，仅用两个来自 AIME 24 的用户提示即可训练出具有良好跨任务泛化性能的系统提示。

链接: https://arxiv.org/abs/2604.08801
作者: Zhaolin Gao, Yu (Sid)Wang,Bo Liu,Thorsten Joachims,Kianté Brantley,Wen Sun
机构: Cornell University (康奈尔大学); Microsoft (微软); Harvard University (哈佛大学); Databricks AI Research (Databricks人工智能研究)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose p1 , a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that p1 substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.

[NLP-44] Lessons Without Borders? Evaluating Cultural Alignment of LLM s Using Multilingual Story Moral Generation

【速读】：该论文旨在解决语言模型在跨文化语境下对故事道德内涵理解不足的问题，即当前模型难以准确捕捉不同语言-文化背景下人类对故事道德的多样性解释。解决方案的关键在于提出一种新的评估任务——多语言故事道德生成（multilingual story moral generation），并构建了一个涵盖14个语言-文化对的人类撰写故事道德数据集，通过语义相似性、人类偏好调查和价值分类三种方式对比模型输出与人类解读的差异。研究表明，尽管前沿模型如GPT-4o和Gemini能生成语义上接近人类回答的道德总结且受人类偏好认可，但其输出在跨语言间变化较小，集中于少数广泛共享的价值观，揭示了模型在再现人类叙事理解多样性方面的局限。该研究将叙事理解视为一种评价任务，为衡量语言模型的文化对齐能力提供了超越静态基准或知识测试的新范式。

链接: https://arxiv.org/abs/2604.08797
作者: Sophie Wu,Andrew Piper
机构: McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stories are key to transmitting values across cultures, but their interpretation varies across linguistic and cultural contexts. Thus, we introduce multilingual story moral generation as a novel culturally grounded evaluation task. Using a new dataset of human-written story morals collected across 14 language-culture pairs, we compare model outputs with human interpretations via semantic similarity, a human preference survey, and value categorization. We show that frontier models such as GPT-4o and Gemini generate story morals that are semantically similar to human responses and preferred by human evaluators. However, their outputs exhibit markedly less cross-linguistic variation and concentrate on a narrower set of widely shared values. These findings suggest that while contemporary models can approximate central tendencies of human moral interpretation, they struggle to reproduce the diversity that characterizes human narrative understanding. By framing narrative interpretation as an evaluative task, this work introduces a new approach to studying cultural alignment in language models beyond static benchmarks or knowledge-based tests.

[NLP-45] MedConceal: A Benchmark for Clinical Hidden-Concern Reasoning Under Partial Observability

【速读】：该论文旨在解决医疗对话中因信息不对称导致的“隐性关切识别”难题，即患者常因恐惧、误解或实际障碍而未主动披露关键信息，传统医疗对话评估基准则通过暴露隐藏状态或将 elicitation（探查）简化为 extraction（提取）来规避此问题。其解决方案的关键在于构建了一个名为 MedConceal 的交互式患者模拟器，该模拟器基于真实临床问答数据设计，包含300个精心策划的病例和600次医生-大语言模型（LLM）交互，每个案例均包含医生可见上下文与模拟器内部隐含的患者关切（hidden concerns），这些关切源自文献并由专家制定的分类体系结构化。模拟器在对话过程中主动隐藏这些关切，通过理论驱动的逐轮沟通信号追踪其是否被揭示及处理，并经由临床医生评审确保合理性，从而实现对任务达成过程（task success）与交互过程（interaction process）的双重评估，特别聚焦于两个核心能力：确认（confirmation，通过多轮对话揭示隐性关切）和干预（intervention，针对主要关切引导患者走向目标治疗方案）。

链接: https://arxiv.org/abs/2604.08788
作者: Yikun Han,Joey Chan,Jingyuan Chen,Mengting Ai,Simo Du,Yue Guo
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); NYC Health + Hospitals/Jacobi (纽约市卫生与医院/雅各布医疗中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Patient-clinician communication is an asymmetric-information problem: patients often do not disclose fears, misconceptions, or practical barriers unless clinicians elicit them skillfully. Effective medical dialogue therefore requires reasoning under partial observability: clinicians must elicit latent concerns, confirm them through interaction, and respond in ways that guide patients toward appropriate care. However, existing medical dialogue benchmarks largely sidestep this challenge by exposing hidden patient state, collapsing elicitation into extraction, or evaluating responses without modeling what remains hidden. We present MedConceal, a benchmark with an interactive patient simulator for evaluating hidden-concern reasoning in medical dialogue, comprising 300 curated cases and 600 clinician-LLM interactions. Built from clinician-answered online health discussions, each case pairing clinician-visible context with simulator-internal hidden concerns derived from prior literature and structured using an expert-developed taxonomy. The simulator withholds these concerns from the dialogue agent, tracks whether they have been revealed and addressed via theory-grounded turn-level communication signals, and is clinician-reviewed for clinical plausibility. This enables process-aware evaluation of both task success and the interaction process that leads to it. We study two abilities: confirmation, surfacing hidden concerns through multi-turn dialogue, and intervention, addressing the primary concern and guiding the patient toward a target plan. Results show that no single system dominates: frontier models lead on different confirmation metrics, while human clinicians (N=159) remain strongest on intervention success. Together, these results identify hidden-concern reasoning under partial observability as a key unresolved challenge for medical dialogue systems.

[NLP-46] MT-OSC: Path for LLM s that Get Lost in Multi-Turn Conversation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多轮对话（multi-turn, MT）场景中因上下文信息分散导致的性能下降问题。当前主流方法将完整对话历史附加至提示词（prompt），易耗尽上下文窗口，引发延迟升高、计算成本增加及长对话收益递减。其解决方案的关键在于提出MT-OSC（One-off Sequential Condensation）框架，通过后台自动压缩对话历史：核心组件为Condenser Agent，结合基于少样本推理的Condenser与轻量级Decider，实现对关键信息的选择性保留，可在10轮对话中减少高达72%的token数量，并显著缩小多轮性能差距，提升模型在多样化任务上的准确性和鲁棒性。

链接: https://arxiv.org/abs/2604.08782
作者: Jyotika Singh,Fang Tu,Miguel Ballesteros,Weiyi Sun,Sandip Ghoshal,Michelle Yuan,Yassine Benajiba,Sujith Ravi,Dan Roth
机构: Oracle AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) suffer significant performance degradation when user instructions and context are distributed over multiple conversational turns, yet multi-turn (MT) interactions dominate chat interfaces. The routine approach of appending full chat history to prompts rapidly exhausts context windows, leading to increased latency, higher computational costs, and diminishing returns as conversations extend. We introduce MT-OSC, a One-off Sequential Condensation framework that efficiently and automatically condenses chat history in the background without disrupting the user experience. MT-OSC employs a Condenser Agent that uses a few-shot inference-based Condenser and a lightweight Decider to selectively retain essential information, reducing token counts by up to 72% in 10-turn dialogues. Evaluated across 13 state-of-the-art LLMs and diverse multi-turn benchmarks, MT-OSC consistently narrows the multi-turn performance gap - yielding improved or preserved accuracy across datasets while remaining robust to distractors and irrelevant turns. Our results establish MT-OSC as a scalable solution for multi-turn chats, enabling richer context within constrained input spaces, reducing latency and operational cost, while balancing performance.

[NLP-47] Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics

【速读】：该论文旨在解决Transformer架构在自然语言处理（Natural Language Processing, NLP）中存在的一种内在各向异性（anisotropy）现象，该现象阻碍了对模型表示几何结构的准确理解。以往理论研究多未基于底层表示几何进行分析，而本文通过推导几何论证，揭示了频率偏置采样如何削弱曲率可见性，并解释了训练过程为何倾向于放大切线方向（tangent directions）。其解决方案的关键在于：利用训练过程中基于概念的机制可解释性方法，而非仅事后分析，构建由激活驱动的低秩切线代理方向（low-rank tangent proxies），并通过与普通反向传播的真实梯度对比验证，发现这些方向不仅捕获了异常大的梯度能量，还显著占有了梯度各向异性的更大份额，从而为各向异性源于切线对齐（tangent-aligned）提供了强有力的实证支持。

链接: https://arxiv.org/abs/2604.08764
作者: Raphael Bernas,Fanny Jourdan,Antonin Poché,Céline Hudelot
机构: 未知
类目: Computation and Language (cs.CL); Differential Geometry (math.DG)
备注:

点击查看摘要

Abstract:Since their introduction, Transformer architectures have dominated Natural Language Processing (NLP). However, recent research has highlighted an inherent anisotropy phenomenon in these models, presenting a significant challenge to their geometric interpretation. Previous theoretical studies on this phenomenon are rarely grounded in the underlying representation geometry. In this paper, we extend them by deriving geometric arguments for how frequency-biased sampling attenuates curvature visibility and why training preferentially amplify tangent directions. Empirically, we then use concept-based mechanistic interpretability during training, rather than only post hoc, to fit activation-derived low-rank tangent proxies and test them against ordinary backpropagated true gradients. Across encoder-style and decoder-style language models, we find that these activation-derived directions capture both unusually large gradient energy and a substantially larger share of gradient anisotropy than matched-rank normal controls, providing strong empirical support for a tangent-aligned account of anisotropy.

[NLP-48] Optimal Multi-bit Generative Watermarking Schemes Under Worst-Case False-Alarm Constraints

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中多比特生成式水印（multi-bit generative watermarking）在最坏情况虚警约束下的最优性能问题。此前研究已建立有限token场景下可实现漏检概率的下界，并提出一种声称达到该下界的方案，但本文指出该方案实际上次优。作者通过构建两种新的编码-解码构造方法，首次实现了对先前下界的严格逼近，从而完整刻画了多比特水印的最优性能边界。其核心解决方案在于将水印设计建模为线性规划问题，并推导出达到最优性的结构条件，同时揭示了原方案失效机制并对比了新方案之间的权衡关系。

链接: https://arxiv.org/abs/2604.08759
作者: Yu-Shin Huang,Chao Tian,Krishna Narayanan
机构: 未知
类目: Information Theory (cs.IT); Computation and Language (cs.CL)
备注: 41 pages, 8 tables

点击查看摘要

Abstract:This paper considers the problem of multi-bit generative watermarking for large language models under a worst-case false-alarm constraint. Prior work established a lower bound on the achievable miss-detection probability in the finite-token regime and proposed a scheme claimed to achieve this bound. We show, however, that the proposed scheme is in fact suboptimal. We then develop two new encoding-decoding constructions that attain the previously established lower bound, thereby completely characterizing the optimal multi-bit watermarking performance. Our approach formulates the watermark design problem as a linear program and derives the structural conditions under which optimality can be achieved. In addition, we identify the failure mechanism of the previous construction and compare the tradeoffs between the two proposed schemes.

[NLP-49] Cards Against LLM s: Benchmarking Humor Alignment in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在幽默判断能力上的对齐问题，即评估LLM是否能够准确理解并生成符合人类幽默偏好的内容。其关键解决方案是设计了一个基于《Cards Against Humanity》（CAH）游戏的实验框架，让五种前沿LLM在9,894轮游戏中从十张候选卡中选出最有趣的回答，并与人类玩家的偏好进行对比分析。研究发现，尽管所有模型均显著优于随机基线，但它们与人类偏好的一致性较低；更值得注意的是，模型之间的一致性远高于与人类的一致性，表明其幽默判断可能受系统性位置偏差和内容偏好等结构化推理 artifacts 的影响，而非真正反映人类幽默认知。

链接: https://arxiv.org/abs/2604.08757
作者: Yousra Fettach,Guillaume Bied,Hannu Toivonen,Tijl De Bie
机构: Ghent University (根特大学); University of Helsinki (赫尔辛基大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humor is one of the most culturally embedded and socially significant dimensions of human communication, yet it remains largely unexplored as a dimension of Large Language Model (LLM) alignment. In this study, five frontier language models play the same Cards Against Humanity games (CAH) as human players. The models select the funniest response from a slate of ten candidate cards across 9,894 rounds. While all models exceed the random baseline, alignment with human preference remains modest. More striking is that models agree with each other substantially more often than they agree with humans. We show that this preference is partly explained by systematic position biases and content preferences, raising the question whether LLM humor judgment reflects genuine preference or structural artifacts of inference and alignment.

[NLP-50] LLM s Underperform Graph-Based Parsers on Supervised Relation Extraction for Complex Graphs ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在复杂语言图结构下的关系抽取（Relation Extraction）性能不足的问题。研究表明，当输入文本的语义关系复杂度较高时，LLMs 的表现显著落后于轻量级的基于图的解析器（graph-based parser）。其解决方案的关键在于：在复杂语言图场景下，采用轻量级但结构感知能力强的图解析方法，相较于LLMs能更有效地建模句子内部的多跳依赖与关系交互，从而在多个关系抽取数据集上实现更优的性能表现。

链接: https://arxiv.org/abs/2604.08752
作者: Paolo Gajo,Domenic Rosati,Hassan Sajjad,Alberto Barrón-Cedeño
机构: University of Bologna(博洛尼亚大学); Dalhousie University(达尔豪西大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 (Main Conference)

点击查看摘要

Abstract:Relation extraction represents a fundamental component in the process of creating knowledge graphs, among other applications. Large language models (LLMs) have been adopted as a promising tool for relation extraction, both in supervised and in-context learning settings. However, in this work we show that their performance still lags behind much smaller architectures when the linguistic graph underlying a text has great complexity. To demonstrate this, we evaluate four LLMs against a graph-based parser on six relation extraction datasets with sentence graphs of varying sizes and complexities. Our results show that the graph-based parser increasingly outperforms the LLMs, as the number of relations in the input documents increases. This makes the much lighter graph-based parser a superior choice in the presence of complex linguistic graphs.

[NLP-51] Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?

【速读】：该论文旨在解决当前生成式 AI（Generative AI）在偏好优化（preference optimization）过程中，对偏好数据中哪些属性能够有效提升模型推理能力尚不明确的问题。研究发现，偏好数据中的“生成器级差异”（generator-level delta）和“样本级差异”（sample-level delta）是影响推理性能的关键因素：前者指生成优选与次选推理轨迹的模型之间能力差异，后者指单个偏好对内部判断质量差异。解决方案的关键在于双管齐下——在构建偏好对时最大化生成器级差异以增强跨域推理能力，并利用样本级差异筛选最具信息量的训练样本，从而实现更高效、更鲁棒的推理模型训练。

链接: https://arxiv.org/abs/2604.08723
作者: Chia-Hsuan Lee,Mingyang Zhou,Renkun Ni,Zelei Cheng,Sihui Dai,Supriyo Chakraborty,Shixiong Zhang,Sambit Sahu,William Campbell
机构: Capital One
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair improve a reasoning model’s performance on general reasoning tasks? We investigate two distinct notions of quality delta in preference data: generator-level delta, arising from the differences in capability between models that generate chosen and rejected reasoning traces, and sample-level delta, arising from differences in judged quality differences within an individual preference pair. To study generator-level delta, we vary the generator’s scale and model family, and to study sample-level delta, we employ an LLM-as-a-judge to rate the quality of generated traces along multiple reasoning-quality dimensions. We find that increasing generator-level delta steadily improves performance on out-of-domain reasoning tasks and filtering data by sample-level delta can enable more data-efficient training. Our results suggest a twofold recipe for improving reasoning performance through preference optimization: maximize generator-level delta when constructing preference pairs and exploit sample-level delta to select the most informative training examples.

[NLP-52] Every Response Counts: Quantifying Uncertainty of LLM -based Multi-Agent Systems through Tensor Decomposition ACL26

【速读】：该论文旨在解决基于大语言模型的多智能体系统（Multi-Agent Systems, MAS）在复杂任务中因交互机制带来的可靠性挑战，尤其是多步推理中的不确定性累积、智能体间通信路径的变异性以及通信拓扑结构多样性所导致的不确定性量化难题。现有不确定性量化方法主要针对单轮输出设计，无法有效捕捉MAS特有的动态交互特性。其解决方案的关键在于提出MATU框架，通过张量分解（tensor decomposition）对整个推理轨迹进行建模：将多次运行的推理过程表示为高阶张量，并利用张量分解技术解耦并量化不同来源的不确定性，从而提供一种可泛化至多种智能体结构的综合性可靠性度量。

链接: https://arxiv.org/abs/2604.08708
作者: Tiejin Chen,Huaiyuan Yao,Jia Chen,Evangelos E. Papalexakis,Hua Wei
机构: Arizona State University (亚利桑那州立大学); University of California, Riverside (加州大学河滨分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accept to ACL 26

点击查看摘要

Abstract:While Large Language Model-based Multi-Agent Systems (MAS) consistently outperform single-agent systems on complex tasks, their intricate interactions introduce critical reliability challenges arising from communication dynamics and role dependencies. Existing Uncertainty Quantification methods, typically designed for single-turn outputs, fail to address the unique complexities of the MAS. Specifically, these methods struggle with three distinct challenges: the cascading uncertainty in multi-step reasoning, the variability of inter-agent communication paths, and the diversity of communication topologies. To bridge this gap, we introduce MATU, a novel framework that quantifies uncertainty through tensor decomposition. MATU moves beyond analyzing final text outputs by representing entire reasoning trajectories as embedding matrices and organizing multiple execution runs into a higher-order tensor. By applying tensor decomposition, we disentangle and quantify distinct sources of uncertainty, offering a comprehensive reliability measure that is generalizable across different agent structures. We provide comprehensive experiments to show that MATU effectively estimates holistic and robust uncertainty across diverse tasks and communication topologies.

[NLP-53] Skip-Connected Policy Optimization for Implicit Advantage

【速读】：该论文旨在解决在强化学习用于人类反馈（Reinforcement Learning from Human Feedback, RLHF）或强化学习用于价值评估（Reinforcement Learning with Value-based Rewards, RLVR）中，使用细粒度密集奖励（dense rewards）时因蒙特卡洛估计（Monte Carlo estimation）带来的高方差与符号不一致（sign-inconsistent）优势估计问题，尤其是在早期推理token阶段表现不佳，反而导致性能低于仅基于最终结果的Group Relative Policy Optimization (GRPO)方法。解决方案的关键在于提出Skip-Connected Optimization (SKPO)，其将推理过程分解为上游（upstream）和下游（downstream）两个阶段：上游通过单流优化接收来自下游的密集奖励，下游维持组相对优化（group-relative optimization），并通过跳跃连接（skip connection）将上游推理片段与原始问题拼接，使模型既能利用有益的上游推理，又能通过直接访问原始问题绕过错误推理路径，从而实现更稳定且高效的训练。

链接: https://arxiv.org/abs/2604.08690
作者: Fengwei Teng,Jinyi Bai,Xinhao Yao,Demi Ruohan Wang,Jiahao Zhao,Zhijiang Guo
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has proven effective in RLVR by using outcome-based rewards. While fine-grained dense rewards can theoretically improve performance, we reveal that under practical sampling budgets, Monte Carlo estimation yields high-variance and sign-inconsistent advantages for early reasoning tokens, paradoxically underperforming outcome-only GRPO. We propose Skip-Connected Optimization (SKPO), which decomposes reasoning into upstream and downstream phases: upstream receives dense rewards from downstream Monte Carlo sampling with single-stream optimization; downstream maintains group-relative optimization, where a skip connection concatenates the upstream segment with the original problem, enabling the model to leverage helpful upstream reasoning while preserving the freedom to bypass flawed reasoning through direct problem access. Experiments demonstrate improvements of 3.91% and 6.17% relative gains over the strongest baselines on Qwen2.5-Math-7B and Llama-3.2-3B respectively across mathematical benchmarks and out-of-domain tasks including general reasoning and code generation. Further analysis reveals an implicit advantage: SKPO generates trajectories with higher intermediate-step quality even when matched for final correctness.

[NLP-54] EXAONE 4.5 Technical Report

【速读】：该论文旨在解决当前多模态大模型在文档理解与特定语言场景（如韩语上下文推理）中性能不足的问题，尤其针对工业级应用场景下对长文本处理能力的需求。解决方案的关键在于：首先，通过将专用视觉编码器集成到EXAONE 4.0框架中，实现视觉与文本模态的原生联合预训练；其次，采用以文档为中心的数据集进行精细化训练，显著提升文档理解任务的表现；最后，将上下文长度扩展至256K tokens，支持长距离依赖建模和企业级应用需求，从而在保持通用语言能力的同时，在文档理解和韩语推理等任务上超越同类模型。

链接: https://arxiv.org/abs/2604.08644
作者: Eunbi Choi,Kibong Choi,Sehyun Chun,Seokhee Hong,Junwon Hwang,Hyojin Jeon,Ahra Jo,Hyunjik Jo,Yeonsik Jo,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Yongil Kim,Changhun Lee,Haeju Lee,Jinsik Lee,Kyungmin Lee,Sangha Park,Kwangrok Ryoo,Minju Seo,Sejong Yang,Heuiyeen Yeen,Hwan Chang,Stanley Jungkyu Choi,Yejin Choi,Kyubeen Han,Joonwon Jang,Kijeong Jeon,Geunyeong Jeong,Gerrard Jeongwon Jo,Jiyeon Jung,Daeseong Kim,Dohoon Kim,Dohyun Kim,Hyunseo Kim,Minu Kim,Myoungshin Kim,Youchul Kim,Byungoh Ko,Christopher Lee,Edward Hwayoung Lee,Honglak Lee,Jiyoung Lee,Sangeun Lee,Seungwon Lim,Woohyung Lim,Jueun Mun,Jaewoo Park,Jimin Park,Jinho Park,Yongmin Park,Wooseok Seo,Yongwoo Song,Sihyuk Yi,Kyungjae Yoo,Sangyeon Yoon
机构: LG AI Research
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG’s strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG’s ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.

[NLP-55] From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的智能体系统在企业场景中决策缺乏可追溯性和现实约束的问题——这些系统通常从无限制的知识空间中直接生成答案，而未模拟业务事件如何重塑知识结构，导致决策虽流畅却无根基且无法审计。解决方案的关键在于提出LOM-action框架，其核心是通过事件驱动的本体模拟（event-driven ontology simulation），即由业务事件触发企业本体（Enterprise Ontology, EO）中的场景条件，进而在一个隔离沙箱中对子图进行确定性演化，生成与当前场景一致的仿真图 $G_\text{sim}$ ，所有决策均仅基于此演化后的图得出。该方法构建了“事件 → 模拟 → 决策”的闭环流程，并采用双模式架构（skill mode 和 reasoning mode），确保每一步决策都具备完整的审计日志，从而实现可信的企业级决策智能。

链接: https://arxiv.org/abs/2604.08603
作者: Hongyin Zhu,Jinming Liang,Mengjun Hou,Ruifan Tang,Xianbin Zhu,Jingyuan Yang,Yuanman Mao,Feng Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing LLM-based agent systems share a common architectural failure: they answer from the unrestricted knowledge space without first simulating how active business scenarios reshape that space for the event at hand – producing decisions that are fluent but ungrounded and carrying no audit trail. We present LOM-action, which equips enterprise AI with \emphevent-driven ontology simulation: business events trigger scenario conditions encoded in the enterprise ontology~(EO), which drive deterministic graph mutations in an isolated sandbox, evolving a working copy of the subgraph into the scenario-valid simulation graph G_\textsim ; all decisions are derived exclusively from this evolved graph. The core pipeline is \emphevent \to simulation \to decision, realized through a dual-mode architecture – \emphskill mode and \emphreasoning mode. Every decision produces a fully traceable audit log. LOM-action achieves 93.82% accuracy and 98.74% tool-chain F1 against frontier baselines Doubao-1.8 and DeepSeek-V3.2, which reach only 24–36% F1 despite 80% accuracy – exposing the \emphillusive accuracy phenomenon. The four-fold F1 advantage confirms that ontology-governed, event-driven simulation, not model scale, is the architectural prerequisite for trustworthy enterprise decision intelligence.

[NLP-56] Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean

【速读】：该论文旨在解决当前大语言模型（Large Language Model, LLM）评估方法（如LLM-as-a-Judge、判别系统和自然语言推理NLI）在不同应用领域中难以与人类评估保持一致的问题，其核心挑战在于这些方法缺乏对评估严格程度的自适应能力。解决方案的关键是提出温度控制的判别聚合方法（Temperature-Controlled Verdict Aggregation, TCVA），该方法结合五级评分体系、广义幂平均聚合策略，并引入一个直观的温度参数 $ T \in [0.1, 1.0] $ 来调节评估的严谨性：低温度生成偏保守的分数，适用于安全关键场景；高温度则输出更宽松的分数，适合对话型AI应用。实验表明，TCVA在三个基准数据集上的表现优于DeepEval，在忠实性维度上与RAGAS相当（Spearman相关系数分别为0.667 vs. 0.676），且无需额外的大语言模型调用即可动态调整温度参数。

链接: https://arxiv.org/abs/2604.08595
作者: Aleksandr Meshkov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing evaluation methods for LLM-based AI systems, such as LLM-as-a-Judge, verdict systems, and NLI, do not always align well with human assessment because they cannot adapt their strictness to the application domain. This paper presents Temperature-Controlled Verdict Aggregation (TCVA), a method that combines a five-level verdict-scoring system with generalized power-mean aggregation and an intuitive temperature parameter T [0.1, 1.0] to control evaluation rigor. Low temperatures yield pessimistic scores suited for safety-critical domains; high temperatures produce lenient scores appropriate for conversational AI. Experimental evaluation on three benchmark datasets with human Likert-scale annotations (SummEval and USR) shows that TCVA achieves correlation with human judgments comparable to RAGAS on faithfulness (Spearman = 0.667 vs. 0.676) while consistently outperforming DeepEval. The method requires no additional LLM calls when adjusting the temperature parameter.

[NLP-57] Robust Reasoning Benchmark

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在数学推理任务中对标准文本格式的高度过拟合问题，以及由此导致的推理过程脆弱性。其核心挑战在于：尽管前沿模型在标准基准上表现优异，但其内部推理机制缺乏鲁棒性，容易因输入格式微小扰动而崩溃。解决方案的关键是提出一个包含14种扰动技术的评估流水线，并将其应用于AIME 2024数据集，系统性地测试8个主流模型的推理稳定性；进一步通过强制模型在单个上下文窗口内顺序处理多个未扰动问题，隔离工作记忆容量限制与下游推理失败之间的关系，从而揭示开放权重模型在密集注意力机制中因中间推理步骤污染而导致的准确性衰减现象。这一发现表明，未来可靠的推理架构需在Chain-of-Thought中引入显式的上下文重置机制，以应对原子推理任务粒度优化等根本性问题。

链接: https://arxiv.org/abs/2604.08571
作者: Pavel Golikov,Evgenii Opryshko,Gennady Pekhimenko,Mark C. Jeffrey
机构: Cranberry-Lemon University (cranberry-lemon大学); Vector Institute (向量研究所); Compute Canada Alliance (加拿大计算联盟)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open weights reasoning models suffer catastrophic collapses (up to 55% average accuracy drops across perturbations and up to 100% on some), exposing structural fragility. To further disentangle mechanical parsing failures from downstream reasoning failures, we strictly isolate the models’ working memory capacity by forcing models to solve multiple unperturbed mathematical problems sequentially within a single context window. Our results indicate that open weight models ranging from 7B to 120B parameters and Claude Opus 4.6 exhibit accuracy decay on subsequent problems. This degradation demonstrates that intermediate reasoning steps permanently pollute standard dense attention mechanisms. We argue that to achieve reliable reasoning, future reasoning architectures must integrate explicit contextual resets within a model’s own Chain-of-Thought, leading to fundamental open questions regarding the optimal granularity of atomic reasoning tasks.

[NLP-58] Can We Still Hear the Accent? Investigating the Resilience of Native Language Signals in the LLM Era

【速读】：该论文试图解决的问题是：随着写作辅助工具从机器翻译向大语言模型（Large Language Models, LLMs）演进，研究论文的语言风格是否趋于同质化。为解答这一问题，作者构建了一个半自动化标注的数据集，并微调了一个分类器以识别作者背景的语语言特征指纹（linguistic fingerprints）。解决方案的关键在于通过分析ACL Anthology中三个时期（神经网络前、LLM前和LLM后）的自然语言识别（Native Language Identification, NLI）性能变化趋势，揭示写作工具演化对学术文本多样性的影响，发现LLM时代出现了非预期的语言风格趋同现象，尤其在日语和韩语论文中表现更为显著，而中文和法语则表现出抵抗或偏离趋势。

链接: https://arxiv.org/abs/2604.08568
作者: Nabelanita Utami,Sasano Ryohei
机构: Nagoya University (名古屋大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evolution of writing assistance tools from machine translation to large language models (LLMs) has changed how researchers write. This study investigates whether this shift is homogenizing research papers by analyzing native language identification (NLI) trends in ACL Anthology papers across three eras: pre-neural network (NN), pre-LLM, and post-LLM. We construct a labeled dataset using a semi-automated framework and fine-tune a classifier to detect linguistic fingerprints of author backgrounds. Our analysis shows a consistent decline in NLI performance over time. Interestingly, the post-LLM era reveals anomalies: while Chinese and French show unexpected resistance or divergent trends, Japanese and Korean exhibit sharper-than-expected declines.

[NLP-59] Sentiment Classification of Gaza War Headlines: A Comparative Analysis of Large Language Models and Arabic Fine-Tuned BERT Models

【速读】：该论文旨在解决冲突相关媒体话语中情感解读的算法偏差问题，即不同人工智能架构在处理战争语境下的情感分类时是否存在系统性差异及其背后的解释学意义。其解决方案的关键在于摒弃传统以单一人工标注金标准为基准的准确性评估范式，转而采用一种认识论视角，将情感分类视为由模型架构所生成的解释性行为；并通过信息论与分布度量（如香农熵、詹森-香农散度和方差得分）量化模型间的情感分布差异，揭示出微调的BERT模型（尤其是MARBERT）倾向于中性分类，而大语言模型（LLM）则普遍放大负面情绪，且GPT-4.1能根据叙事框架动态调整判断，其他LLM则缺乏情境敏感性。这一方法凸显了模型选择本身即是一种解释视角的选择，对战争与危机情境下自动化情感输出的中立性和可比性提出了警示。

链接: https://arxiv.org/abs/2604.08566
作者: Amr Eleraqi,Hager H. Mustafa,Abdul Hadi N. Ahmed
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 45 pages, 6 figures (including diagrams), 8 tables. Dataset available at this https URL . Previously posted at this https URL

点击查看摘要

Abstract:This study examines how different artificial intelligence architectures interpret sentiment in conflict-related media discourse, using the 2023 Gaza War as a case study. Drawing on a corpus of 10,990 Arabic news headlines (Eleraqi 2026), the research conducts a comparative analysis between three large language models and six fine-tuned Arabic BERT models. Rather than evaluating accuracy against a single human-annotated gold standard, the study adopts an epistemological approach that treats sentiment classification as an interpretive act produced by model architectures. To quantify systematic differences across models, the analysis employs information-theoretic and distributional metrics, including Shannon Entropy, Jensen-Shannon Distance, and a Variance Score measuring deviation from aggregate model behavior. The results reveal pronounced and non-random divergence in sentiment distributions. Fine-tuned BERT models, particularly MARBERT, exhibit a strong bias toward neutral classifications, while LLMs consistently amplify negative sentiment, with LLaMA-3.1-8B showing near-total collapse into negativity. Frame-conditioned analysis further demonstrates that GPT-4.1 adjusts sentiment judgments in line with narrative frames (e.g., humanitarian, legal, security), whereas other LLMs display limited contextual modulation. These findings suggest that the choice of model constitutes a choice of interpretive lens, shaping how conflict narratives are algorithmically framed and emotionally evaluated. The study contributes to media studies and computational social science by foregrounding algorithmic discrepancy as an object of analysis and by highlighting the risks of treating automated sentiment outputs as neutral or interchangeable measures of media tone in contexts of war and crisis.

[NLP-60] Dynamic sparsity in tree-structured feed-forward layers at scale

【速读】：该论文旨在解决Transformer架构中前馈神经网络（Feed-Forward Neural Network, FFN）模块在典型上下文长度下占据大量计算资源的问题，从而提出一种可直接替代密集FFN块的稀疏结构化方案。其解决方案的关键在于引入树状结构的前馈层（tree-structured feed-forward layers），通过硬性层次路由（hard hierarchical routing）实现条件计算，无需额外的路由器网络即可动态激活部分神经元路径。实验表明，即使每token仅激活少于5%的FFN单元，模型在自回归语言建模和下游问答任务（包括零样本与少样本场景）中仍能媲美密集基线模型；同时发现训练过程中存在一种自发的自动剪枝效应（auto-pruning effect），即硬路由与非对称非线性激活函数的相互作用逐步抑制未使用路径，将动态路由转化为静态结构稀疏性，且可通过简单架构设计调控该行为以恢复平衡树结构，无需辅助损失函数。

链接: https://arxiv.org/abs/2604.08565
作者: Reza Sedghi,Robin Schiewer,Anand Subramoney,David Kappel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer’s compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block’s units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that simple architectural choices can modulate this behavior and recover balanced trees without auxiliary losses. Overall, our work demonstrates that tree-structured feed-forward layers provide a scalable and controllable mechanism for sparsifying large transformer models.

[NLP-61] Attention-Based Sampler for Diffusion Language Models

【速读】：该论文旨在解决自回归语言模型（Auto-regressive Models, ARM）因严格串行解码机制导致的推理效率低下和建模灵活性不足的问题。现有基于扩散的大型语言模型（Diffusion-based Large Language Models, dLLMs）虽具备并行解码潜力，但其主流解码策略仅依赖词元级信息，忽视了全局序列结构，从而影响生成质量。解决方案的关键在于从对数似然最大化角度重新审视解码顺序选择问题，并理论证明：通过按注意力矩阵列和降序排列来决定解码顺序，可近似实现最优序列似然。这一发现为注意力引导的解码提供了理论依据，并据此提出无需训练的Attn-Sampler算法，结合块注意力近似与动态注意力阈值截断以提升实际加速效果，显著提升了生成质量和解码并行度。

链接: https://arxiv.org/abs/2604.08564
作者: Yuyan Zhou,Kai Syun Hou,Weiyu Chen,James Kwok
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Auto-regressive models (ARMs) have established a dominant paradigm in language modeling. However, their strictly sequential decoding paradigm imposes fundamental constraints on both inference efficiency and modeling flexibility. To address these limitations, diffusion-based large language models (dLLMs) have been proposed, offering the potential for parallel decoding and flexible language modeling. Despite these advantages, current dLLMs decoding strategies rely primarily on token level information, which fails to account for global sequence structure and often yields suboptimal results. In this paper, we study the decoding order selection problem from the perspective of log-likelihood maximization. We theoretically demonstrate that optimal sequence likelihood can be approximately achieved by decoding tokens in descending order of their attention matrix column sums. This finding provides a principled justification for attention-guided decoding and offers a theoretically grounded alternative to greedy search. We instantiate this theoretical insight in a new training-free decoding algorithm, termed Attn-Sampler, and further propose a block attention approximation and dynamic attention thresholding for practical acceleration. Extensive experiments across multiple benchmarks validate the effectiveness of our proposed method, demonstrating that it achieves superior generation quality while enhancing the decoding parallelism.

[NLP-62] mperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models

【速读】：该论文旨在解决扩展推理模型（Extended Reasoning Models）在复杂问题求解中，采样温度（sampling temperature）与提示策略（prompting strategy）配置优化不足的问题。其关键解决方案在于系统性地评估零样本提示（zero-shot prompting）与思维链提示（chain-of-thought prompting）在四种温度设置（0.0、0.4、0.7、1.0）下的表现，发现提示策略与温度之间存在显著交互效应：零样本提示在中等温度（T=0.4 和 T=0.7）时达到最优准确率（59%），而思维链提示则在极端温度下表现更优；更重要的是，扩展推理带来的性能增益随温度升高从6倍提升至14.3倍，表明温度应与提示策略联合优化，而非默认固定为T=0，从而挑战了当前推理任务中普遍采用的低温采样惯例。

链接: https://arxiv.org/abs/2604.08563
作者: Mousa Salah,Amgad Muneer
机构: Gujarat Technological University(古吉拉特技术大学); The University of Texas MD Anderson Cancer Center(德克萨斯大学MD安德森癌症中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 3 Figures, 2 Tables

点击查看摘要

Abstract:Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest that temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks.

[NLP-63] Neural networks for Text-to-Speech evaluation

【速读】：该论文旨在解决文本到语音（Text-to-Speech, TTS）系统在大规模部署时难以确保人类感知质量的问题。当前主流的主观评估方法如平均意见得分（Mean Opinion Score, MOS）和对比评估（Side-by-Side, SBS）虽为金标准，但存在成本高、效率低及评估者偏倚敏感等缺陷。解决方案的关键在于构建一套新型神经模型，用于逼近专家判断：针对相对评估提出基于HuBERT特征的NeuralSBS模型，在SOMOS数据集上达到73.7%准确率；针对绝对评估改进MOSNet并引入WhisperBert——一种融合Whisper音频特征与BERT文本嵌入的多模态堆叠集成模型，其RMSE降至约0.40，显著优于人类评分者间RMSE基准（0.62）。此外，实验表明直接通过交叉注意力融合文本信息会损害性能，凸显了基于弱学习器的堆叠集成策略的有效性，同时验证了专用度量学习框架对TTS质量评估的重要性。

链接: https://arxiv.org/abs/2604.08562
作者: Ilya Trofimenko,David Kocharyan,Aleksandr Zaitsev,Pavel Repnikov,Mark Levin,Nikita Shevtsov
机构: HSE University(高等经济大学); Institute for System Programming, Russian Academy of Sciences(俄罗斯科学院系统编程研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings. For relative assessment, we propose NeuralSBS, a HuBERT-backed model achieving 73.7% accuracy (on SOMOS dataset). For absolute assessment, we introduce enhancements to MOSNet using custom sequence-length batching, as well as WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings via weak learners. Our best MOS models achieve a Root Mean Square Error (RMSE) of ~0.40, significantly outperforming the human inter-rater RMSE baseline of 0.62. Furthermore, our ablation studies reveal that naively fusing text via cross-attention can degrade performance, highlighting the effectiveness of ensemble-based stacking over direct latent fusion. We additionally report negative results with SpeechLM-based architectures and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 flash preview), reinforcing the necessity of dedicated metric learning frameworks.

[NLP-64] A Representation-Level Assessment of Bias Mitigation in Foundation Models ECML-PKDD2025

【速读】：该论文旨在解决基础模型中性别偏见的内在表征问题，特别是探究偏见缓解技术如何重塑编码器-only（如BERT）和解码器-only（如Llama2）模型的嵌入空间，从而实现对模型行为的内部审计。其解决方案的关键在于通过对比基准模型与偏见缓解版本的嵌入空间中性别与职业术语之间的关联变化，发现偏见缓解能有效降低性别-职业差异，使内部表示更加中立和均衡；这种可解释且几何化的表征迁移在两类模型中具有一致性，表明公平性改进可通过嵌入空间的结构化变换来验证。此外，为促进对解码器-only模型的评估，作者还构建了WinoDec数据集（含4,000条包含性别和职业术语的序列），并公开发布以支持后续研究。

链接: https://arxiv.org/abs/2604.08561
作者: Svetoslav Nizhnichenkov,Rahul Nair,Elizabeth Daly,Brian Mac Namee
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ECML-PKDD 2025 (5th Workshop on Bias and Fairness in AI)

点击查看摘要

Abstract:We investigate how successful bias mitigation reshapes the embedding space of encoder-only and decoder-only foundation models, offering an internal audit of model behaviour through representational analysis. Using BERT and Llama2 as representative architectures, we assess the shifts in associations between gender and occupation terms by comparing baseline and bias-mitigated variants of the models. Our findings show that bias mitigation reduces gender-occupation disparities in the embedding space, leading to more neutral and balanced internal representations. These representational shifts are consistent across both model types, suggesting that fairness improvements can manifest as interpretable and geometric transformations. These results position embedding analysis as a valuable tool for understanding and validating the effectiveness of debiasing methods in foundation models. To further promote the assessment of decoder-only models, we introduce WinoDec, a dataset consisting of 4,000 sequences with gender and occupation terms, and release it to the general public. (this https URL)

[NLP-65] Uncertainty Estimation for the Open-Set Text Classification systems

【速读】：该论文旨在解决开放集文本分类（Open-Set Text Classification, OSTC）中预测不确定性估计不准确的问题，即如何有效识别模型在面对未知类别时的置信度不足情况，从而提升系统的鲁棒性和可信度。其解决方案的关键在于引入并适配了Holistic Uncertainty Estimation (HolUE) 方法至文本领域，通过同时建模两类主要误差来源：文本不确定性（由查询表述不清引起）和画廊不确定性（由数据分布模糊导致），从而更全面地捕捉预测错误的可能性。实验表明，该方法在多个基准数据集上显著优于基于置信度的质量基线（SCF），预测拒绝比（Prediction Rejection Ratio, PRR）提升达40%–365%，证明了其在提升OSTC系统可靠性方面的有效性。

链接: https://arxiv.org/abs/2604.08560
作者: Leonid Erlygin,Alexey Zaytsev
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate uncertainty estimation is essential for building robust and trustworthy recognition systems. In this paper, we consider the open-set text classification (OSTC) task - and uncertainty estimation for it. For OSTC a text sample should be classified as one of the existing classes or rejected as unknown. To account for the different uncertainty types encountered in OSTC, we adapt the Holistic Uncertainty Estimation (HolUE) method for the text domain. Our approach addresses two major causes of prediction errors in text recognition systems: text uncertainty that stems from ill formulated queries and gallery uncertainty that is related the ambiguity of data distribution. By capturing these sources, it becomes possible to predict when the system will make a recognition error. We propose a new OSTC benchmark and conduct extensive experiments on a wide range of data, utilizing the authorship attribution, intent and topic classification datasets. HolUE achieves 40-365% improvement in Prediction Rejection Ratio (PRR) over the quality-based SCF baseline across datasets: 365% on Yahoo Answers (0.79 vs 0.17 at FPIR 0.1), 347% on DBPedia (0.85 vs 0.19), 240% on PAN authorship attribution (0.51 vs 0.15 at FPIR 0.5), and 40% on CLINC150 intent classification (0.73 vs~0.52). We make public our code and protocols this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.08560 [cs.CL] (or arXiv:2604.08560v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.08560 Focus to learn more arXiv-issued DOI via DataCite

[NLP-66] Medical Reasoning with Large Language Models : A Survey and MR-Bench

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在医疗场景中应用时面临的临床推理能力不足问题。尽管LLMs在医学考试类任务中表现优异，但真实临床决策具有安全敏感性、情境依赖性和证据动态演化的特征，单纯依赖事实记忆无法满足实际需求，因此亟需具备鲁棒医学推理能力的模型。解决方案的关键在于：首先，基于认知科学中的临床推理理论，将医疗推理建模为归纳（induction）、演绎（deduction）和溯因（abduction）的迭代过程；其次，系统梳理现有方法并将其归类为七种技术路径，涵盖训练型与免训练型策略；最后，通过统一实验设置开展跨基准评估，并引入源自真实医院数据的MR-Bench基准，揭示了当前模型在考试级性能与真实临床决策准确性之间存在的显著差距，从而为未来研究指明方向。

链接: https://arxiv.org/abs/2604.08559
作者: Xiaohan Ren,Chenxiao Fan,Wenyin Ma,Hongliang He,Chongming Gao,Xiaoyan Zhao,Fuli Feng
机构: University of Science and Technology of China (中国科学技术大学); The First Affiliated Hospital of USTC (中国科学技术大学第一附属医院); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning. In this work, we present a comprehensive review of medical reasoning with LLMs. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process of abduction, deduction, and induction, and organize existing methods into seven major technical routes spanning training-based and training-free approaches. We further conduct a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting, enabling a more systematic and comparable assessment of the empirical impact of existing methods. To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. Overall, this survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices, and highlights key gaps between current model performance and the requirements of real-world clinical reasoning.

[NLP-67] WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models INTERSPEECH2026

【速读】：该论文旨在解决自回归文本到语音（AR-TTS）模型在生成长序列时面临的计算和内存开销随序列长度呈二次增长的问题，这主要源于其全自注意力机制（full self-attention）。解决方案的关键在于提出WAND框架，通过将注意力机制解耦为两类：对条件token采用持久性全局注意力（persistent global attention），对生成token采用局部滑动窗口注意力（local sliding-window attention），从而实现常数级的计算与内存复杂度；同时结合课程学习策略逐步收紧注意力窗口以稳定微调，并利用知识蒸馏从全注意力教师模型中迁移高质量语音合成能力，显著提升数据效率并保持原始音质。

链接: https://arxiv.org/abs/2604.08558
作者: Hanna Lee,Tan Dat Nguyen,Jaehoon Kang,Kyuhong Shim
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院); Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent global attention over conditioning tokens and local sliding-window attention over generated tokens. To stabilize fine-tuning, we employ a curriculum learning strategy that progressively tightens the attention window. We further utilize knowledge distillation from a full-attention teacher to recover high-fidelity synthesis quality with high data efficiency. Evaluated on three modern AR-TTS models, WAND preserves the original quality while achieving up to 66.2% KV cache memory reduction and length-invariant, near-constant per-step latency.

[NLP-68] Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

【速读】：该论文旨在解决扩散语言模型（Diffusion-based Language Models, dLLMs）在安全对齐方面的脆弱性问题，即这些模型的安全机制依赖于一个单一且易被破坏的假设：去噪过程中的调度是单调的，且一旦承诺的拒绝令牌（refusal tokens）被固定后就不会再被重新评估。解决方案的关键在于揭示了这种安全机制的本质缺陷——其并非源于复杂的对抗鲁棒性，而是建立在一个结构上极为浅层的假设之上。研究通过一个简单的两步干预即可实现高达76.1%的攻击成功率（ASR），表明当前dLLM的安全性仅在调度未被违反时才成立，从而证明其安全性本质上是架构层面的非鲁棒性。

链接: https://arxiv.org/abs/2604.08557
作者: Arth Singh
机构: AIM Intelligence; National Institute of Technology Agartala
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure, 6 tables

点击查看摘要

Abstract:Diffusion-based language models (dLLMs) generate text by iteratively denoising masked token sequences. We show that their safety alignment rests on a single fragile assumption: that the denoising schedule is monotonic and committed tokens are never re-evaluated. Safety-aligned dLLMs commit refusal tokens within the first 8-16 of 64 denoising steps, and the schedule treats these commitments as permanent. A trivial two-step intervention - re-masking these tokens and injecting a 12-token affirmative prefix - achieves 76.1% ASR on HarmBench (n=159, Lg=128) against LLaDA-8B-Instruct and 81.8% ASR (n=159) against Dream-7B-Instruct, without any gradient computation or adversarial search. The simplicity of this exploit is itself the central finding: augmenting with gradient-optimized perturbation via a differentiable Gumbel-softmax chain consistently degrades ASR (e.g., 41.5% vs. 76.1% at Lg=128), confirming that the vulnerability is structural rather than requiring sophisticated exploitation. These findings reveal that dLLM safety is not adversarially robust but architecturally shallow - it holds only because the denoising schedule is never violated. We discuss defenses including safety-aware unmasking schedules, step-conditional prefix detection, and post-commitment re-verification.

[NLP-69] EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

【速读】：该论文旨在探究高效序列模型相较于简单时间平均方法在表征能力上的本质差异，核心问题是：固定系数的累积机制（如时间或深度方向上的线性加权）是否足以捕捉序列中的关键信息，还是必须依赖输入相关的动态选择机制才能实现性能突破。解决方案的关键在于引入指数移动平均（Exponential Moving Average, EMA）作为最简化的递归上下文结构（无门控、无内容检索），用以系统性地评估固定系数积累的表达边界。研究发现，EMA能够编码多尺度时间结构，在无需标签的情况下达到监督型双向循环神经网络（BiGRU）96%的语法角色标注性能，并在结构依赖任务上超越后者；同时，EMA丢失了token身份信息，且其数据无关的有损压缩特性导致下游预测器无法恢复被丢弃的信息——这说明固定系数积累存在不可逆的信息稀释问题，唯有通过学习得到的、输入依赖的选择机制（如注意力机制）才能有效解决这一瓶颈。

链接: https://arxiv.org/abs/2604.08556
作者: Arth Singh
机构: AIM Intelligence; National Institute of Technology Agartala
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure, 7 tables

点击查看摘要

Abstract:What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, input-dependent selection can resolve.

[NLP-70] SynDocDis: A Metadata-Driven Framework for Generating Synthetic Physician Discussions Using Large Language Models

【速读】：该论文旨在解决医疗领域中医生间病例讨论（physician-physician discussions）数据难以获取的问题，这类对话蕴含丰富的临床推理知识，但受隐私法规和伦理限制难以直接用于训练或增强生成式 AI（Generative AI）模型。现有合成数据方法多集中于患者-医生交互或结构化病历，缺乏对医生间专业交流的高质量模拟。其解决方案的关键在于提出 SynDocDis 框架，该框架结合结构化提示（structured prompting）技术与去标识化的病例元数据（de-identified case metadata），生成在临床有效性（平均评分 4.4/5）和医学内容质量（平均评分 4.1/5）上均表现优异的医生间对话，同时确保隐私合规性（91% 临床相关性评分），为医学教育与临床决策支持提供可信赖的合成数据来源。

链接: https://arxiv.org/abs/2604.08555
作者: Beny Rubinstein,Sergio Matos
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Physician-physician discussions of patient cases represent a rich source of clinical knowledge and reasoning that could feed AI agents to enrich and even participate in subsequent interactions. However, privacy regulations and ethical considerations severely restrict access to such data. While synthetic data generation using Large Language Models offers a promising alternative, existing approaches primarily focus on patient-physician interactions or structured medical records, leaving a significant gap in physician-to-physician communication synthesis. We present SynDocDis, a novel framework that combines structured prompting techniques with privacy-preserving de-identified case metadata to generate clinically accurate physician-to-physician dialogues. Evaluation by five practicing physicians in nine oncology and hepatology scenarios demonstrated exceptional communication effectiveness (mean 4.4/5) and strong medical content quality (mean 4.1/5), with substantial interrater reliability (kappa = 0.70, 95% CI: 0.67-0.73). The framework achieved 91% clinical relevance ratings while maintaining doctors’ and patients’ privacy. These results place SynDocDis as a promising framework for advancing medical AI research ethically and responsibly through privacy-compliant synthetic physician dialogue generation with direct applications in medical education and clinical decision support.

[NLP-71] Drift and selection in LLM text ecosystems

【速读】：该论文试图解决的问题是：在生成式 AI (Generative AI) 系统不断生成文本并反馈至公共文本记录（public text record）的循环过程中，这种递归机制如何影响语料库的结构演化及其稳定性。解决方案的关键在于构建了一个可精确求解的数学框架，基于变阶 n-gram 代理模型（variable-order n-gram agents），将作用于公共语料库的两种力量明确分离：一是“漂移”（drift）——即未经筛选的重复使用会逐步消除罕见语言形式，在无限语料极限下可精确刻画其稳定分布；二是“选择”（selection）——即发布、排序与验证等过滤机制决定哪些内容进入记录，其结果取决于所选标准。当出版仅反映统计现状时，语料库趋于浅层状态；而当出版具有规范性（如奖励质量、正确性或新颖性）时，深层结构得以维持，并可建立其偏离浅层平衡的最优上界。该框架揭示了递归发布何时压缩文本结构、何时通过选择性过滤保持丰富性，为 AI 训练语料库的设计提供了理论依据。

链接: https://arxiv.org/abs/2604.08554
作者: Søren Riis
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The public text record – the material from which both people and AI systems now learn – is increasingly shaped by its own outputs. Generated text enters the public record, later agents learn from it, and the cycle repeats. Here we develop an exactly solvable mathematical framework for this recursive process, based on variable-order n -gram agents, and separate two forces acting on the public corpus. The first is drift: unfiltered reuse progressively removes rare forms, and in the infinite-corpus limit we characterise the stable distributions exactly. The second is selection: publication, ranking and verification filter what enters the record, and the outcome depends on what is selected. When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit. When publication is normative – rewarding quality, correctness or novelty – deeper structure persists, and we establish an optimal upper bound on the resulting divergence from shallow equilibria. The framework therefore identifies when recursive publication compresses public text and when selective filtering sustains richer structure, with implications for the design of AI training corpora.

[NLP-72] GNN-as-Judge: Unleashing the Power of LLM s for Graph Learning with GNN Feedback ICLR2026

【速读】：该论文旨在解决在低资源场景下（即标注节点严重稀缺）大型语言模型（Large Language Models, LLMs）在文本属性图（Text-Attributed Graphs, TAGs）上进行少样本半监督学习时性能受限的问题，核心挑战在于：(i) 如何生成并筛选可靠的伪标签（pseudo labels），以及 (ii) 在使用伪标签微调LLMs时如何缓解潜在的标签噪声。解决方案的关键是提出GNN-as-Judge框架，其通过引入图神经网络（Graph Neural Networks, GNNs）的结构归纳偏置（structural inductive bias），设计了一种协同伪标签策略——首先识别受已标注节点影响最显著的未标注节点，再利用LLMs与GNNs之间的共识与分歧模式生成高质量伪标签；同时开发了一种弱监督的LLM微调算法，能够从信息丰富的伪标签中蒸馏知识，同时抑制标签噪声。实验表明，该方法在多个TAG数据集上显著优于现有方法，尤其在标注数据稀缺的场景下表现突出。

链接: https://arxiv.org/abs/2604.08553
作者: Ruiyao Xu,Kaize Ding
机构: Northwestern University (西北大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICLR 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong performance on text-attributed graphs (TAGs) due to their superior semantic understanding ability on textual node features. However, their effectiveness as predictors in the low-resource setting, where labeled nodes are severely limited and scarce, remains constrained since fine-tuning LLMs usually requires sufficient labeled data, especially when the TAG shows complex structural patterns. In essence, this paper targets two key challenges: (i) the difficulty of generating and selecting reliable pseudo labels on TAGs for LLMs, and (ii) the need to mitigate potential label noise when fine-tuning LLMs with pseudo labels. To counter the challenges, we propose a new framework, GNN-as-Judge, which can unleash the power of LLMs for few-shot semi-supervised learning on TAGs by incorporating the structural inductive bias of Graph Neural Networks (GNNs). Specifically, GNN-as-Judge introduces a collaborative pseudo-labeling strategy that first identifies the most influenced unlabeled nodes from labeled nodes, then exploits both the agreement and disagreement patterns between LLMs and GNNs to generate reliable labels. Furthermore, we develop a weakly-supervised LLM fine-tuning algorithm that can distill the knowledge from informative pseudo labels while mitigating the potential label noise. Experiments on multiple TAG datasets demonstrate that GNN-as-Judge significantly outperforms existing methods, particularly in low-resource regimes where labeled data are scarce.

[NLP-73] SUPERNOVA: Eliciting General Reasoning in LLM s with Reinforcement Learning on Natural Instructions

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在通用推理任务中表现不足的问题，尤其是因果推理和时间理解等复杂能力的欠缺。当前基于可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）虽在数学与代码等形式化领域取得进展，但其扩展受限于高质量、多样化的可验证训练数据稀缺。解决方案的关键在于提出 SUPERNOVA 数据整理框架，其核心思想是利用包含专家标注答案的指令微调数据集，从中系统性提取并适配出适用于 RLVR 的高质推理模式；通过100余组受控强化学习实验发现，源任务选择策略对下游推理性能影响显著，尤其以针对目标任务个体表现优化的选择方式优于整体平均性能策略，从而证明了结构化数据设计对提升通用推理能力的有效性。

链接: https://arxiv.org/abs/2604.08477
作者: Ashima Suvarna,Kendrick Phan,Mehrab Beikzadeh,Hritik Bansal,Saadia Gabriel
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 Pages, 4 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at this https URL.

[NLP-74] owards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon Cross-scenario Heterogeneous Behavior Traces

【速读】：该论文旨在解决现有用户模拟基准在真实人类行为建模上的局限性问题，即当前基准多局限于孤立场景、窄动作空间或合成数据，无法全面捕捉真实世界中长期、跨场景和异构的行为模式。其解决方案的关键在于提出OmniBehavior——首个完全基于真实世界数据构建的用户模拟基准，该框架整合了长时程、跨场景与多样化行为模式，并通过实证分析揭示了以往孤立场景数据导致的“隧道视野”问题，同时发现大语言模型（LLM）在模拟复杂行为时存在结构性偏差：趋向于生成平均化、同质化的“理想用户”，丧失个体差异与长尾行为特征，从而指明未来高保真用户模拟研究的核心方向。

链接: https://arxiv.org/abs/2604.08362
作者: Jiawei Chen,Ruoxi Xu,Boxi Cao,Ruotong Pan,Yunfei Zhang,Yifei Hu,Yong Du,Tingting Gao,Yaojie Lu,Yingfei Sun,Xianpei Han,Le Sun,Xiangyu Wu,Hongyu Lin
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Kuaishou Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

[NLP-75] Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

【速读】：该论文旨在解决大视觉语言模型（Large Vision-Language Models, LVLMs）在多模态任务中常见的幻觉问题，即模型在生成过程中会自信地描述图像中并不存在的对象或属性。针对现有无训练干预方法在开放式和长文本生成场景下难以保持准确性的局限，作者提出了一种基于置信度感知的注意力校准框架（Confidence-Aware Attention Calibration, CAAC）。其解决方案的关键在于识别并修正两个核心偏差：空间感知偏差（spatial perception bias），即注意力在图像token间分布不均；以及模态偏差（modality bias），即模型随生成过程逐渐从视觉输入转向文本输入。CAAC通过两步策略实现：首先采用视觉token校准（Visual-Token Calibration, VTC）均衡视觉token间的注意力分配，其次引入自适应注意力重缩放（Adaptive Attention Re-Scaling, AAR），依据模型置信度动态调整注意力权重，从而强化视觉锚定，确保生成过程中的视觉一致性，尤其在长文本生成中显著减少幻觉现象。

链接: https://arxiv.org/abs/2505.21472
作者: Mehrdad Fazli,Bowen Wei,Ahmet Sari,Ziwei Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current training-free interventions struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding guided by the model’s confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.

信息检索

[IR-0] rans-RAG : Query-Centric Vector Transformation for Secure Cross-Organizational Retrieval DASFAA2026

【速读】：该论文旨在解决跨组织部署的检索增强生成（Retrieval Augmented Generation, RAG）系统在安全性、准确性和效率之间存在的根本性矛盾问题。现有加密方法在解密过程中暴露明文，而联邦架构则阻碍资源集成并带来显著开销。其解决方案的核心是提出Trans-RAG框架，采用一种新颖的向量空间语言范式，使每个组织的知识存在于数学上隔离的语义空间中；关键创新在于向量2变换（vector2Trans）技术，该技术通过多阶段查询中心转换机制，使查询能够动态“翻译”为各组织向量空间的“语言”，从而在无需解密的前提下实现高效检索，既保障了安全又维持了原生检索效率。

链接: https://arxiv.org/abs/2604.09541
作者: Yu Liu,Kun Peng,Wenxiao Zhang,Fangfang Yuan,Cong Cao,Wenxuan Lu,Yanbing Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注: Accepted by DASFAA 2026

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) systems deployed across organizational boundaries face fundamental tensions between security, accuracy, and efficiency. Current encryption methods expose plaintext during decryption, while federated architectures prevent resource integration and incur substantial overhead. We introduce Trans-RAG, implementing a novel vector space language paradigm where each organization’s knowledge exists in a mathematically isolated semantic space. At the core lies vector2Trans, a multi-stage transformation technique that enables queries to dynamically “speak” each organization’s vector space “language” through query-centric transformations, eliminating decryption overhead while maintaining native retrieval efficiency. Security evaluations demonstrate near-orthogonal vector spaces with 89.90° angular separation and 99.81% isolation rates. Experiments across 8 retrievers, 3 datasets, and 3 LLMs show minimal accuracy degradation (3.5% decrease in nDCG@10) and significant efficiency improvements over homomorphic encryption.

[IR-1] Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision

【速读】：该论文旨在解决生成式 AI (Generative AI) 在证据依赖推理中的关键瓶颈问题，即模型在做出决策时未能真正依赖所提供的证据，而是可能仅基于案例上下文或弱监督信号进行预测。这一问题源于监督信号不足、证据与主张之间的关联松散，以及评估未直接测试证据依赖性。解决方案的关键在于提出一种“案例驱动的证据验证”框架（case-grounded evidence verification），其核心创新是设计了一种无需人工标注证据即可自动构建显式支持样本和语义受控的非支持样本（包括反事实错误状态和主题相关负例）的监督构造方法，从而有效引导模型学习证据与主张之间的因果关系，实现真正的证据依赖性。

链接: https://arxiv.org/abs/2604.09537
作者: Soroosh Tayebi Arasteh,Mehdi Joodaki,Mahshad Lotfinia,Sven Nebelung,Daniel Truhn
机构: RWTH Aachen University (亚琛工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.

[IR-2] RecaLLM : Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval

【速读】：该论文旨在解决长上下文推理中“lost-in-thought”问题，即在推理过程中，随着 reasoning 步骤的推进，模型对上下文中相关信息的检索能力显著下降，导致后续推理难以有效利用已有信息。其解决方案的关键在于提出 RecaLLM，通过在推理与显式上下文检索之间交替进行（interleaving reasoning with explicit in-context retrieval），使模型能够动态地识别并获取解决中间子问题所需的证据；同时引入一种低开销的约束解码机制，支持原文复制证据片段（verbatim copying of evidence spans），从而增强生成内容的可追溯性和准确性。该方法在 RULER 和 HELMET 两个长上下文基准上显著优于基线模型，且仅需最多 10K tokens 的训练样本即可实现高达 128K tokens 上下文窗口的性能提升，突破了传统长上下文训练依赖昂贵长文本数据的限制。

链接: https://arxiv.org/abs/2604.09494
作者: Kyle Whitecross,Negin Rahimi
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Code, data, and models available at this https URL

点击查看摘要

Abstract:We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open-source LLMs, we observe that in-context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test-time scaling that we refer to as lost-in-thought: reasoning steps that improve performance also make subsequent in-context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in-context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible-overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long-context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long-context approaches, highlighting a promising path toward improving long-context performance without expensive long-context training data.

[IR-3] Dynamic Ranked List Truncation for Reranking Pipelines via LLM -generated Reference-Documents

【速读】：该论文旨在解决大语言模型（Large Language Models, LLM）在排序重排（reranking）任务中面临的计算开销高和上下文长度过长的问题。其核心挑战在于如何在保证重排效果的同时提升效率，尤其是在 ranked list truncation (RLT) 和 listwise reranking 过程中依赖人工设定的超参数与主题无关的启发式策略。解决方案的关键在于利用 LLM 生成语义可控的参考文档（reference documents），这些文档作为相关与不相关文档之间的桥梁，用于指导 RLT 的截断决策，并支持更高效的并行或自适应步长窗口处理机制，从而实现对原始排序列表的高效、精准重排。实验表明，该方法显著提升了 LLM-based listwise reranking 的效率（最高加速达 66%），同时在 TREC Deep Learning 等基准上优于现有基于 RLT 的方法。

链接: https://arxiv.org/abs/2604.09492
作者: Nilanjan Sinhababu,Soumedhik Bharati,Debasis Ganguly,Pabitra Mitra
机构: IIT Kharagpur(印度理工学院克哈格普尔分校); Sister Nivedita University(姐妹妮维达大学); University of Glasgow(格拉斯哥大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLM) have been widely used in reranking. Computational overhead and large context lengths remain a challenging issue for LLM rerankers. Efficient reranking usually involves selecting a subset of the ranked list from the first stage, known as ranked list truncation (RLT). The truncated list is processed further by a reranker. For LLM rerankers, the ranked list is often partitioned and processed sequentially in batches to reduce the context length. Both these steps involve hyperparameters and topic-agnostic heuristics. Recently, LLMs have been shown to be effective for relevance judgment. Equivalently, we propose that LLMs can be used to generate reference documents that can act as a pivot between relevant and non-relevant documents in a ranked list. We propose methods to use these generated reference documents for RLT as well as for efficient listwise reranking. While reranking, we process the ranked list in either parallel batches of non-overlapping windows or overlapping windows with adaptive strides, improving the existing fixed stride setup. The generated reference documents are also shown to improve existing efficient listwise reranking frameworks. Experiments on TREC Deep Learning benchmarks show that our approach outperforms existing RLT-based approaches. In-domain and out-of-domain benchmarks demonstrate that our proposed methods accelerate LLM-based listwise reranking by up to 66% compared to existing approaches. This work not only establishes a practical paradigm for efficient LLM-based reranking but also provides insight into the capability of LLMs to generate semantically controlled documents using relevance signals.

[IR-4] ME-PSR: Time-aware Multi-interest and Explanation Personalization for Sequential Recommendation

【速读】：该论文旨在解决个性化序列推荐（Personalized Sequential Recommendation, PSR）中用户在时间节奏偏好、多细粒度潜在兴趣以及推荐与解释之间的语义对齐方面存在的个性化差异问题。解决方案的关键在于提出TME-PSR模型，其核心创新包括：（1）采用双视角门控时间编码器（dual-view gated time encoder）捕捉用户的个性化时间节奏；（2）设计轻量级多头线性循环单元架构（lightweight multihead Linear Recurrent Unit），实现高效且细粒度的子兴趣建模；（3）引入动态双分支互信息加权机制（dynamic dual-branch mutual information weighting mechanism），以实现推荐结果与解释之间的个性化语义对齐。实验表明，该方法在提升推荐准确率和解释质量的同时，显著降低了计算成本。

链接: https://arxiv.org/abs/2604.09439
作者: Qingzhuo Wang,Leilei Wen,Juntao Chen,Kunyu Peng,Ruiyang Qin,Zhihua Wei,Wen Shen
机构: Tongji University (同济大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose a sequential recommendation model that integrates Time-aware personalization, Multi-interest personalization, and Explanation personalization for Personalized Sequential Recommendation (TME-PSR). That is, we consider the differences across different users in temporal rhythm preference, multiple fine-grained latent interests, and the personalized semantic alignment between recommendations and explanations. Specifically, the proposed TME-PSR model employs a dual-view gated time encoder to capture personalized temporal rhythms, a lightweight multihead Linear Recurrent Unit architecture that enables fine-grained sub-interest modeling with improved efficiency, and a dynamic dual-branch mutual information weighting mechanism to achieve personalized alignment between recommendations and explanations. Extensive experiments on real-world datasets demonstrate that our method consistently improves recommendation accuracy and explanation quality, at a lower computational cost.

[IR-5] On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework

【速读】：该论文旨在解决量子启发式（quantum-inspired）1024维文档嵌入在信息检索中的表示局限性问题，特别是其在语义结构编码能力与排名稳定性方面的不足。解决方案的关键在于构建一个实验框架，通过重叠窗口和多尺度聚合生成嵌入，并结合语义投影（如EigAngle）、电路启发特征映射、教师-学生蒸馏机制以及指纹识别技术以确保可复现性和可控评估。此外，引入静态与动态插值、候选集合并策略及概念性的alpha-oracle作为混合检索诊断工具，从而系统性地对比BM25与嵌入方法的融合效果。实验表明，尽管教师嵌入提供稳定的语义结构，但独立的量子启发嵌入表现弱且不稳定，蒸馏效果不一致，而混合检索可通过整合词法与嵌入信号恢复竞争力，揭示了此类嵌入作为辅助组件而非独立检索表示的本质限制。

链接: https://arxiv.org/abs/2604.09430
作者: Dario Maio
机构: University of Bologna (博洛尼亚大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 44 pages, 6 figures

点击查看摘要

Abstract:Text embeddings are central to modern information retrieval and Retrieval-Augmented Generation (RAG). While dense models derived from Large Language Models (LLMs) dominate current practice, recent work has explored quantum-inspired alternatives motivated by the geometric properties of Hilbert-like spaces and their potential to encode richer semantic structure. This paper presents an experimental framework for constructing quantum-inspired 1024-dimensional document embeddings based on overlapping windows and multi-scale aggregation. The pipeline combines semantic projections (e.g., EigAngle), circuit-inspired feature mappings, and optional teacher-student distillation, together with a fingerprinting mechanism for reproducibility and controlled evaluation. We introduce a set of diagnostic tools for hybrid retrieval, including static and dynamic interpolation between BM25 and embedding-based scores, candidate union strategies, and a conceptual alpha-oracle that provides an upper bound for score-level fusion. Experiments on controlled corpora of Italian and English documents across technical, narrative, and legal domains, using synthetic queries, show that BM25 remains a strong baseline, teacher embeddings provide stable semantic structure, and standalone quantum-inspired embeddings exhibit weak and unstable ranking signals. Distillation yields mixed effects, improving alignment in some cases but not consistently enhancing retrieval performance, while hybrid retrieval can recover competitive results when lexical and embedding-based signals are combined. Overall, the results highlight structural limitations in the geometry of quantum-inspired embeddings, including distance compression and ranking instability, and clarify their role as auxiliary components rather than standalone retrieval representations. Comments: 44 pages, 6 figures Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09430 [cs.IR] (or arXiv:2604.09430v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.09430 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dario Maio [view email] [v1] Fri, 10 Apr 2026 15:48:37 UTC (580 KB) Full-text links: Access Paper: View a PDF of the paper titled On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework, by Dario MaioView PDFHTML (experimental)TeX Source view license Current browse context: cs.IR prev | next new | recent | 2026-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[IR-6] hree Modalities Two Design Probes One Prototype and No Vision: Experience-Based Co-Design of a Multi-modal 3D Data Visualization Tool

【速读】：该论文旨在解决三维（3D）数据可视化在盲人及低视力（BLV）人群中的可访问性问题，尤其是在STEM领域如生物医学成像和光谱学中，传统基于视觉的表面图等工具对BLV用户不友好。解决方案的关键在于采用以体验为基础的共同设计（Experience-Based Co-Design）方法，联合BLV专家与非BLV研究人员组成跨学科团队，通过多阶段迭代流程开发出一款多模态、原生Web的可视化原型工具。该工具整合了参考声学化（reference sonification）、立体声与体音频（stereo and volumetric audio）以及可配置缓冲区聚合（configurable buffer aggregation）等功能，经BLV合作者验证显著提升了分析准确性与学习效率，从而为非视觉环境下的3D数据探索任务（如方向判断、特征定位、梯度追踪等）提供了可操作的设计范式与技术路径。

链接: https://arxiv.org/abs/2604.09426
作者: Sanchita S. Kamath,Aziz N Zeidieh,Venkatesh Potluri,Sile O’Modhrain,Kenneth Perry,JooYoung Seo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Three-dimensional (3D) data visualizations, such as surface plots, are vital in STEM fields from biomedical imaging to spectroscopy, yet remain largely inaccessible to blind and low-vision (BLV) people. To address this gap, we conducted an Experience-Based Co-Design with BLV co-designers with expertise in non-visual data representations to create an accessible, multi-modal, web-native visualization tool. Using a multi-phase methodology, our team of five BLV and one non-BLV researcher(s) participated in two iterative sessions, comparing a low-fidelity tactile probe with a high-fidelity digital prototype. This process produced a prototype with empirically grounded features, including reference sonification, stereo and volumetric audio, and configurable buffer aggregation, which our co-designers validated as improving analytic accuracy and learnability. In this study, we target core analytic tasks essential for non-visual 3D data exploration: orientation, landmark and peak finding, comparing local maxima versus global trends, gradient tracing, and identifying occluded or partially hidden features. Our work offers accessibility researchers and developers a co-design protocol for translating tactile knowledge to digital interfaces, concrete design guidance for future systems, and opportunities to extend accessible 3D visualization into embodied data environments.

[IR-7] FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding

【速读】：该论文旨在解决现有时尚数据集在服装理解任务中碎片化、任务单一的问题，无法支持对穿搭的全面理解和专家级推理（如风格、场合适配性、搭配合理性等）。其解决方案的关键在于构建了一个由专业时尚专家标注的基准数据集 FashionStylist，通过专门设计的专家标注流程，在单品和穿搭两个层级提供专业注释，并支持三项代表性任务：穿搭到单品的定位（outfit-to-item grounding）、穿搭补全（outfit completion）以及穿搭评估（outfit evaluation），从而实现对复杂穿搭中单品识别、兼容性驱动的组合生成及专家级语义评价的统一建模与训练。

链接: https://arxiv.org/abs/2604.09249
作者: Kaidong Feng,Zhuoxuan Huang,Huizhong Guo,Yuting Jin,Xinyu Chen,Yue Liang,Yifei Gai,Li Zhou,Yunshan Ma,Zhu Sun
机构: Yanshan University (燕山大学); Central South University (中南大学); Zhejiang University (浙江大学); Southwest University (西南大学); Singapore Management University (新加坡管理大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Fashion understanding requires both visual perception and expert-level reasoning about style, occasion, compatibility, and outfit rationale. However, existing fashion datasets remain fragmented and task-specific, often focusing on item attributes, outfit co-occurrence, or weak textual supervision, and thus provide limited support for holistic outfit understanding. In this paper, we introduce FashionStylist, an expert-annotated benchmark for holistic and expert-level fashion understanding. Constructed through a dedicated fashion-expert annotation pipeline, FashionStylist provides professionally grounded annotations at both the item and outfit levels. It supports three representative tasks: outfit-to-item grounding, outfit completion, and outfit evaluation. These tasks cover realistic item recovery from complex outfits with layering and accessories, compatibility-aware composition beyond co-occurrence matching, and expert-level assessment of style, season, occasion, and overall coherence. Experimental results show that FashionStylist serves not only as a unified benchmark for multiple fashion tasks, but also as an effective training resource for improving grounding, completion, and outfit-level semantic evaluation in MLLM-based fashion systems.

[IR-8] Hybrid Cold-Start Recommender System for Closure Model Selection in Multiphase Flow Simulations

【速读】：该论文旨在解决多相流计算流体力学（Multiphase Computational Fluid Dynamics, CFD）中闭合模型（closure model）选择困难的问题，即在不同流动条件下，大量闭合模型组合的性能差异显著，不当选择会导致预测失真、模拟失败及计算资源浪费。其解决方案的关键在于将闭合模型选择建模为一个高成本科学领域中的冷启动推荐系统问题，并提出一种混合推荐框架：该框架融合基于元数据的案例相似性分析与通过矩阵补全实现的协同推理机制，从而在无历史数据的新案例上利用其描述性特征进行个性化推荐，同时借助相似案例的历史仿真结果提升推荐准确性。实验表明，该方法在13,600次仿真和多种数据稀疏场景下均优于基于流行度或专家设计的基准模型，并显著降低决策 regret（性能损失）。

链接: https://arxiv.org/abs/2604.09112
作者: S. Hänsch,A. Sajdoková,A. Rębowski,F. Miškařík,K. Ramakrishna,F. Schlegel,V. Rybář,R. Alves,P. Kordík
机构: Helmholtz-Zentrum Dresden-Rossendorf (亥姆霍兹德累斯顿-罗森多夫中心); Czech Technical University in Prague (捷克技术大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Selecting appropriate physical models is a critical yet difficult step in many areas of computational science and engineering. In multiphase Computational Fluid Dynamics (CFD), practitioners must choose among numerous closure model combinations whose performance varies strongly across flow conditions. Sub-optimal choices can lead to inaccurate predictions, simulation failures, and wasted computational resources, making model selection a prime candidate for data-driven decision support. This work formulates closure model selection as a cold-start recommender system problem in a high-cost scientific domain. We propose a hybrid recommendation framework that combines (i) metadata-driven case similarity and (ii) collaborative inference via matrix completion. The approach enables case-specific model recommendations for entirely new CFD cases using their descriptive features, while leveraging historical simulation results from similar cases. The methodology is evaluated on 13,600 simulations across 136 validation cases and 100 model combinations. A nested cross-validation protocol with experiment-level holdout is employed to rigorously assess generalisation to unseen flow scenarios under varying levels of data sparsity. Recommendation quality is measured using ranking-based metrics and a domain-specific regret measure capturing performance loss relative to the per-case optimum. Results show that the proposed hybrid recommender consistently outperforms popularity-based and expert-designed reference models and reduces regret across the investigated sparsities. These findings demonstrate that recommender system methodology can effectively support complex scientific decision-making tasks characterised by expensive evaluations, structured metadata, and limited prior observations.

[IR-9] DIAURec: Dual-Intent Space Representation Optimization for Recommendation

【速读】：该论文旨在解决推荐系统中用户表示学习的局限性问题，即如何更有效地捕捉用户的潜在偏好。现有方法通常依赖稀疏交互数据生成用户表示，难以全面刻画用户行为，且多数研究侧重于模型可解释性而非表示优化，导致推荐效果提升有限。解决方案的关键在于提出DIAURec框架，该框架通过统一意图建模（intent modeling）与语言建模（language modeling），在协同信号和语言信号共同构建的原型空间（prototype space）与分布意图空间（distribution intent space）基础上进行表示重构，并设计了一套综合的表示优化策略：以对齐（alignment）与均匀性（uniformity）为核心目标，结合粗粒度与细粒度匹配实现跨空间的有效对齐，同时引入空间内正则化与交互正则化机制，增强模型鲁棒性并防止重构空间中的表示坍塌（representation collapse）。

链接: https://arxiv.org/abs/2604.09087
作者: Yu Zhang,Yiwen Zhang,Yi Zhang,Lei Sang
机构: Anhui University (安徽大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:General recommender systems deliver personalized services by learning user and item representations, with the central challenge being how to capture latent user preferences. However, representations derived from sparse interactions often fail to comprehensively characterize user behaviors, thereby limiting recommendation effectiveness. Recent studies attempt to enhance user representations through sophisticated modeling strategies ( e.g., intent or language modeling). Nevertheless, most works primarily concentrate on model interpretability instead of representation optimization. This imbalance has led to limited progress, as representation optimization is crucial for recommendation quality by promoting the affinity between users and their interacted items in the feature space, yet remains largely overlooked. To overcome these limitations, we propose DIAURec, a novel representation learning framework that unifies intent and language modeling for recommendation. DIAURec reconstructs representations based on the prototype and distribution intent spaces formed by collaborative and language signals. Furthermore, we design a comprehensive representation optimization strategy. Specifically, we adopts alignment and uniformity as the primary optimization objectives, and incorporates both coarse- and fine-grained matching to achieve effective alignment across different spaces, thereby enhancing representational consistency. Additionally, we further introduce intra-space and interaction regularization to enhance model robustness and prevent representation collapse in reconstructed space representation. Experiments on three public datasets against fifteen baseline methods show that DIAURec consistently outperforms state-of-the-art baselines, fully validating its effectiveness and superiority.

[IR-10] aming the Black Swan: A Momentum-Gated Hierarchical Optimisation Framework for Asymmetric Alpha Generation

【速读】：该论文旨在解决传统动量策略（momentum strategies）中存在的“赢家诅咒”（Winner’s Curse）问题，即高收益资产在市场反转时易出现集聚波动和严重回撤的结构性脆弱性。其解决方案的关键在于提出自适应股权生成与免疫系统（Adaptive Equity Generation and Immunisation System, AEGIS），通过两个核心机制实现：一是采用波动率调整的动量滤波器识别趋势强度，二是利用极小极大相关性算法（minimax correlation algorithm）强制结构分散化；进而结合序列最小二乘规划（SLSQP）优化资本配置以最大化索提诺比率（Sortino ratio）。该架构使组合能够动态适应不同市场状态，在熊市中降低崩溃强度、解耦相关风险，同时在牛市中保留非对称上行参与度，从而在控制下行风险的同时实现超额收益。

链接: https://arxiv.org/abs/2604.09060
作者: Arya Chakraborty,Randhir Singh
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR)
备注: 18 pages, 17 figures, 6 tables, 3 algorithms

点击查看摘要

Abstract:Conventional momentum strategies, despite their proven efficacy in generating alpha, frequently suffer from the “Winner’s Curse”, a structural vulnerability in which high performing assets exhibit clustered volatility and severe drawdowns during market reversals. To counteract this propensity for momentum crashes, this study presents the Adaptive Equity Generation and Immunisation System (AEGIS), a novel framework that fundamentally reengineers the trade-off between growth and stability. By leveraging a volatility-adjusted momentum filter to identify trend strength and employing a minimax correlation algorithm to enforce structural diversification, the model utilises sequential least squares programming (SLSQP) to optimise capital allocation for the sortino ratio. This architecture allows the portfolio to dynamically adapt to distinct market regimes: explicitly lowering the intensity of crashes during bear markets by decoupling correlated risks, while retaining asymmetric upside participation during bull runs. Empirical validation via a comprehensive 20-year walk-forward backtest (2006-2025), which covers significant stress events like the 2008 Global Financial Crisis, confirms that the framework produces substantial excess alpha relative to the standard SP 500 benchmark. Notably, the strategy successfully matched the capital appreciation of the high-beta NASDAQ-100 index while achieving significantly reduced downside volatility and improved structural resilience. These results suggest that synthetic beta can be effectively engineered through mathematical regularisation, enabling investors to capture the high-growth characteristics of concentrated portfolios while preserving the defensive stability typically associated with broad-market diversification.

[IR-11] Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA

【速读】：该论文旨在解决两跳问答（Two-hop QA）中检索阶段的路由决策问题，即如何根据问题类型自动选择最优的检索策略：是仅基于问题文本进行检索（Q-dominant），还是需要结合桥接段落中的关系句进行检索（B-dominant）。其核心贡献在于通过三个理论定理揭示了检索性能与查询特征之间的内在联系（如AUC与余弦分离边界单调相关、表面文本谓词可有效区分两种情形、桥接优势依赖于关系句而非实体名本身），并据此提出轻量级二分类路由模型 RegimeRouter。该方案的关键创新在于利用从谓词定义直接提取的五个文本特征，实现对不同检索路径的精准判别，在零样本迁移至MuSiQue和HotpotQA等数据集时显著提升R@5指标（分别+5.6 pp 和 +5.3 pp，p < 0.001），且无负向影响。

链接: https://arxiv.org/abs/2604.09019
作者: Andre Bacellar
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 5 figures. Theory and empirical validation of regime-conditional multi-hop retrieval routing

点击查看摘要

Abstract:Two-hop QA retrieval splits queries into two regimes determined by whether the hop-2 entity is explicitly named in the question (Q-dominant) or only in the bridge passage (B-dominant). We formalize this split with three theorems: (T1) per-query AUC is a monotone function of the cosine separation margin, with R^2 = 0.90 for six of eight type-encoder pairs; (T2) regime is characterized by two surface-text predicates, with P1 decisive for routing and P2 qualifying the B-dominant case, holding across three encoders and three datasets; and (T3) bridge advantage requires the relation-bearing sentence, not entity name alone, with removal causing an 8.6-14.1 pp performance drop (p 0.001). Building on this theory, we propose RegimeRouter, a lightweight binary router that selects between question-only and question-plus-relation-sentence retrieval using five text features derived directly from the predicate definitions. Trained on 2WikiMultiHopQA (n = 881, 5-fold cross-fitted) and applied zero-shot to MuSiQue and HotpotQA, RegimeRouter achieves +5.6 pp (p 0.001), +5.3 pp (p = 0.002), and +1.1 pp (non-significant, no-regret) R@5 improvement, respectively, with artifact-driven.

[IR-12] MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits ACL2026

【速读】：该论文旨在解决多模态检索增强生成（Multimodal Retrieval-Augmented Generation, RAG）在文档问答（Document Question Answering, DQA）任务中因检索阶段仅保留少量候选页面（如Top-4）而导致的视觉信息利用不充分问题，即忽视了那些虽不具视觉显著性但信息丰富的内容。解决方案的关键在于提出一种基于多臂老虎机（Multi-Armed Bandit, MAB）的DQA框架（MAB-DQA），其通过将查询分解为多个面向特定语义方面的子查询（aspect-aware subqueries），为每个子查询独立检索对应的候选页面集，并将每个子查询视为一个“臂”，利用少量代表性页面的初步推理结果作为奖励信号来估计各方面的效用值。在此基础上，MAB-DQA采用探索-利用策略动态调整检索预算分配，优先聚焦于高价值方面，从而有效整合最具信息量的页面及其关联关系，提升答案生成质量。

链接: https://arxiv.org/abs/2604.08952
作者: Yixin Xiang,Yunshan Ma,Xiaoyu Du,Yibing Chen,Yanxin Zhang,Jinhui Tang
机构: Nanjing University of Science and Technology(南京理工大学); Singapore Management University(新加坡管理大学); Nanjing Pami Intelligent Technology Co., Ltd.(南京帕米智能科技有限公司); University of Wisconsin - Madison(威斯康星大学麦迪逊分校); Nanjing Forestry University(南京林业大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted by ACL 2026. 19 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Document Question Answering (DQA) involves generating answers from a document based on a user’s query, representing a key task in document understanding. This task requires interpreting visual layouts, which has prompted recent studies to adopt multimodal Retrieval-Augmented Generation (RAG) that processes page images for answer generation. However, in multimodal RAG, visual DQA struggles to utilize a large number of images effectively, as the retrieval stage often retains only a few candidate pages (e.g., Top-4), causing informative but less visually salient content to be overlooked in favor of common yet low-information pages. To address this issue, we propose a Multi-Armed Bandit-based DQA framework (MAB-DQA) to explicitly model the varying importance of multiple implicit aspects in a query. Specifically, MAB-DQA decomposes a query into aspect-aware subqueries and retrieves an aspect-specific candidate set for each. It treats each subquery as an arm and uses preliminary reasoning results from a small number of representative pages as reward signals to estimate aspect utility. Guided by an exploration-exploitation policy, MAB-DQA dynamically reallocates retrieval budgets toward high-value aspects. With the most informative pages and their correlations, MAB-DQA generates the expected results. On four benchmarks, MAB-DQA shows an average improvement of 5%-18% over the state-of-the-art method, consistently enhancing document understanding. Code at this https URL.

[IR-13] IAT: Instance-As-Token Compression for Historical User Sequence Modeling in Industrial Recommender Systems

【速读】：该论文旨在解决传统推荐系统中手工设计的序列特征信息容量受限的问题，从而限制了模型性能上限。其核心解决方案是提出一种两阶段序列建模框架——实例作为令牌（Instance-As-Token, IAT），第一阶段将每个历史交互实例的所有特征压缩为统一的实例嵌入（instance embedding），生成紧凑且信息丰富的“令牌”；第二阶段通过时间戳获取固定长度的压缩实例令牌，并采用标准序列建模方法学习长程偏好模式。关键创新在于引入用户顺序压缩机制，更贴合下游序列建模需求，显著提升模型在域内和跨域场景下的表现，并已在电商广告、商场营销及直播电商等工业场景中成功部署。

链接: https://arxiv.org/abs/2604.08933
作者: Xinchun Li,Ning Zhang,Qianqian Yang,Fei Teng,Wenlin Zhao,Huizhi Yang,Heng Shi,Linlan Chen,Yixin Wu,Zhen Wang,Daiye Hou,Fei Qin,Lele Yu,Yaocheng Tan
机构: ByteDance(字节跳动)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Although sophisticated sequence modeling paradigms have achieved remarkable success in recommender systems, the information capacity of hand-crafted sequential features constrains the performance upper bound. To better enhance user experience by encoding historical interaction patterns, this paper presents a novel two-stage sequence modeling framework termed Instance-As-Token (IAT). The first stage of IAT compresses all features of each historical interaction instance into a unified instance embedding, which encodes the interaction characteristics in a compact yet informative token. Both temporal-order and user-order compression schemes are proposed, with the latter better aligning with the demands of downstream sequence modeling. The second stage involves the downstream task fetching fixed-length compressed instance tokens via timestamps and adopting standard sequence modeling approaches to learn long-range preferences patterns. Extensive experiments demonstrate that IAT significantly outperforms state-of-the-art methods and exhibits superior in-domain and cross-domain transferability. IAT has been successfully deployed in real-world industrial recommender systems, including e-commerce advertising, shopping mall marketing, and live-streaming e-commerce, delivering substantial improvements in key business metrics.

[IR-14] Beyond Relevance: Utility-Centric Retrieval in the LLM Era SIGIR2026

【速读】：该论文旨在解决传统信息检索系统过度关注主题相关性（topical relevance）而忽视最终用户任务完成度的问题，即检索效果的评估应从单纯依赖相关性指标转向衡量其对大语言模型（Large Language Models, LLMs）生成质量的实际贡献。其解决方案的关键在于提出一个以LLM为中心的统一框架，将检索目标从“相关性优化”转变为“效用优化”，涵盖LLM无关与LLM特定效用、上下文独立与上下文依赖效用，并连接LLM的信息需求与代理式检索增强生成（agentic RAG），从而为面向LLM的信息获取系统设计提供理论基础与实践指导。

链接: https://arxiv.org/abs/2604.08920
作者: Hengran Zhang,Minghao Tang,Keping Bi,Jiafeng Guo
机构: State Key Laboratory of AI Safety, ICT, CAS (中国科学院自动化研究所人工智能安全重点实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by SIGIR2026

点击查看摘要

Abstract:Information retrieval systems have traditionally optimized for topical relevance-the degree to which retrieved documents match a query. However, relevance only approximates a deeper goal: utility, namely, whether retrieved information helps accomplish a user’s underlying task. The emergence of retrieval-augmented generation (RAG) fundamentally changes this paradigm. Retrieved documents are no longer consumed directly by users but instead serve as evidence for large language models (LLMs) that produce answers. As a result, retrieval effectiveness must be evaluated by its contribution to generation quality rather than by relevance-based ranking metrics alone. This tutorial argues that retrieval objectives are evolving from relevance-centric optimization toward LLM-centric utility. We present a unified framework covering LLM-agnostic versus LLM-specific utility, context-independent versus context-dependent utility, and the connection with LLM information needs and agentic RAG. By synthesizing recent advances, the tutorial provides conceptual foundations and practical guidance for designing retrieval systems aligned with the requirements of LLM-based information access.

[IR-15] BracketRank: Large Language Model Document Ranking via Reasoning -based Competitive Elimination ACL

【速读】：该论文旨在解决生成式 AI (Generative AI) 在复杂语义推理任务中进行文档重排序时面临的挑战，即现有基于大语言模型（LLM）的重排序器受限于上下文长度约束和顺序敏感性，难以实现深层次的语义推理。其解决方案的核心在于提出BracketRank框架，通过将文档重排序建模为一种推理驱动的竞争淘汰赛，引入三项关键创新：(1) 基于模型上下文限制的自适应分组策略，(2) 强制要求逐步解释相关性的推理增强提示机制，以及 (3) 包含胜者与败者双轨制的括号式淘汰结构，从而在保障文档筛选鲁棒性的同时支持多阶段并行处理，显著提升复杂多步检索任务的性能表现。

链接: https://arxiv.org/abs/2604.08834
作者: Abdelrahman Abdallah,Mohammed Ali,Bhawna Piryani,Adam Jatowt
机构: University of Innsbruck (因斯布鲁克大学)
类目: Information Retrieval (cs.IR)
备注: Accepted at ACL main 2026

点击查看摘要

Abstract:Reasoning-intensive retrieval requires deep semantic inference beyond surface-level keyword matching, posing a challenge for current LLM-based rerankers limited by context constraints and order sensitivity. We propose \textbf\BracketRank, a framework that treats document reranking as a reasoning-driven competitive tournament. Our approach introduces three key innovations: (1) adaptive grouping based on model context limits, (2) reasoning-enhanced prompts that mandate step-by-step relevance explanations, and (3) a bracket-style elimination structure with winner and loser tracks. This design ensures robust document advancement while enabling parallel processing across competition stages. Evaluation on the BRIGHT reasoning benchmark shows that \BracketRank achieves \textbf26.56 nDCG@10, significantly outperforming state-of-the-art baselines including RankGPT-4 (17.0) and Rank-R1-14B (20.5). On TREC datasets, BracketRank achieves 77.90 nDCG@5 on DL 19 and 75.85 nDCG@5 on DL 20, exceeding all baselines, establishing that explicit reasoning within competitive elimination is a powerful paradigm for complex, multi-step retrieval tasks. this https URL

[IR-16] owards Generalizable Representations of Mathematical Strategies

【速读】：该论文旨在解决现有预训练编码器在数学文本中难以有效表征学生完整解题路径（solution pathways）的问题，尤其是无法捕捉学生策略的多样性与抽象性。此前的方法要么依赖人工标注（成本高、不可扩展），要么局限于特定平台的操作行为表示（泛化能力差）。其解决方案的关键在于构建基于转换的序列嵌入（transition-based sequence embeddings）：首先利用高容量预训练模型对连续代数状态进行编码，并通过向量差计算生成转换嵌入（transition embeddings），聚焦于状态间的变换而非问题特异性特征；随后采用SimCSE框架学习序列级嵌入，借助对比目标将语义相似的解题路径拉近、差异路径推开。该方法实现了跨问题、平台无关的学生解题行为分析，且嵌入可衍生出策略独特性、多样性与一致性等指标，与短期及长期学习成效显著相关，为教育数据挖掘和自动化评估提供了可扩展的新范式。

链接: https://arxiv.org/abs/2604.08693
作者: Siddhartha Pradhan,Ethan Prihar,Erin Ottmar
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 10 pages

点击查看摘要

Abstract:Pretrained encoders for mathematical texts have achieved significant improvements on various tasks such as formula classification and information retrieval. Yet they remain limited in representing and capturing student strategies for entire solution pathways. Previously, this has been accomplished either through labor-intensive manual labeling, which does not scale, or by learning representations tied to platform-specific actions, which limits generalizability. In this work, we present a novel approach for learning problem-invariant representations of entire algebraic solution pathways. We first construct transition embeddings by computing vector differences between consecutive algebraic states encoded by high-capacity pretrained models, emphasizing transformations rather than problem-specific features. Sequence-level embeddings are then learned via SimCSE, using contrastive objectives to position semantically similar solution pathways close in embedding space while separating dissimilar strategies. We evaluate these embeddings through multiple tasks, including multi-label action classification, solution efficiency prediction, and sequence reconstruction, and demonstrate their capacity to encode meaningful strategy information. Furthermore, we derive embedding-based measures of strategy uniqueness, diversity, and conformity that correlate with both short-term and distal learning outcomes, providing scalable proxies for mathematical creativity and divergent thinking. This approach facilitates platform-agnostic and cross-problem analyses of student problem-solving behaviors, demonstrating the effectiveness of transition-based sequence embeddings for educational data mining and automated assessment.

[IR-17] PRAG MA: Revolut Foundation Model

【速读】：该论文旨在解决金融领域中多源银行事件序列的通用表征学习问题，即如何从海量、异构且离散的交易与事件数据中提取具有普适性的特征表示，以支持信用评分、欺诈检测和生命周期价值预测等下游任务。解决方案的关键在于提出PRAGMA这一基于Transformer架构的基础模型，通过在大规模银行事件语料上采用针对离散、变长金融记录设计的自监督掩码建模目标进行预训练，从而获得高质量的嵌入表示；后续仅需在这些嵌入基础上训练轻量级线性模型即可实现优异性能，并可通过微调进一步提升效果，为金融应用提供了可复用的通用表示层。

链接: https://arxiv.org/abs/2604.08649
作者: Maxim Ostroukhov,Ruslan Mikhailov,Vladimir Iashin,Artem Sokolov,Andrei Akshonov,Vitaly Protasov,Dmitrii Beloborodov,Vince Mullin,Roman Yokunda Enzmann,Georgios Kolovos,Jason Renders,Pavel Nesterov,Anton Repushko
机构: Revolut Research (Revolut 研究院); NVIDIA
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Information Retrieval (cs.IR); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:Modern financial systems generate vast quantities of transactional and event-level data that encode rich economic signals. This paper presents PRAGMA, a family of foundation models for multi-source banking event sequences. Our approach pre-trains a Transformer-based architecture with masked modelling on a large-scale, heterogeneous banking event corpus using a self-supervised objective tailored to the discrete, variable-length nature of financial records. The resulting model supports a wide range of downstream tasks such as credit scoring, fraud detection, and lifetime value prediction: strong performance can be achieved by training a simple linear model on top of the extracted embeddings and can be further improved with lightweight fine-tuning. Through extensive evaluation on downstream tasks, we demonstrate that PRAGMA achieves superior performance across multiple domains directly from raw event sequences, providing a general-purpose representation layer for financial applications.

[IR-18] Retrieval Augmented Classification for Confidential Documents

【速读】：该论文旨在解决保密文档分类中因数据分布不均衡和模型参数泄露导致的性能不稳定与安全风险问题。解决方案的关键在于提出一种基于检索增强分类（Retrieval Augmented Classification, RAC）的方法，通过将决策过程锚定在外部向量存储中的相似性匹配机制，避免敏感内容进入模型权重，从而实现低泄漏、高鲁棒性的分类效果。RAC在类别不平衡场景下表现更稳定，且支持无需重新训练即可快速引入新数据，适用于动态变化的现实环境和治理要求。

链接: https://arxiv.org/abs/2604.08628
作者: Yeseul E. Chang,Rahul Kailasa,Simon Shim,Byunghoon Oh,Jaewoo Lee
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Appears in: KSII The 17th International Conference on Internet (ICONI) 2025, Dec 2025. 7 pages (48-54)

点击查看摘要

Abstract:Unauthorized disclosure of confidential documents demands robust, low-leakage classification. In real work environments, there is a lot of inflow and outflow of documents. To continuously update knowledge, we propose a methodology for classifying confidential documents using Retrieval Augmented Classification (RAC). To confirm this effectiveness, we compare RAC and supervised fine tuning (FT) on the WikiLeaks US Diplomacy corpus under realistic sequence-length constraints. On balanced data, RAC matches FT. On unbalanced data, RAC is more stable while delivering comparable performance–about 96% Accuracy on both the original (unbalanced) and augmented (balanced) sets, and up to 94% F1 with proper prompting–whereas FT attains 90% F1 trained on the augmented, balanced set but drops to 88% F1 trained on the original, unbalanced set. When robust augmentation is infeasible, RAC provides a practical, security-preserving path to strong classification by keeping sensitive content out of model weights and under your control, and it remains robust as real-world conditions change in class balance, data, context length, or governance requirements. Because RAC grounds decisions in an external vector store with similarity matching, it is less sensitive to label skew, reduces parameter-level leakage, and can incorporate new data immediately via reindexing–a difficult step for FT, which typically requires retraining. The contributions of this paper are threefold: first, a RAC-based classification pipeline and evaluation recipe; second, a controlled study that isolates class imbalance and context-length effects for FT versus RAC in confidential-document grading; and third, actionable guidance on RAC design patterns for governed deployments.

[IR-19] SkillForge: Forging Domain-Specific Self-Evolving Agent Skills in Cloud Technical Support SIGIR2026

【速读】：该论文旨在解决企业场景中大型语言模型（Large Language Model, LLM）驱动的智能体（agent）在部署后技能质量难以提升的问题，具体表现为：现有技能生成器缺乏领域知识（domain grounding），导致生成的技能与实际任务需求不匹配；且缺乏系统性机制将执行失败追溯至技能缺陷并驱动针对性优化，致使技能质量停滞不前。解决方案的关键在于提出SkillForge框架，其核心是构建一个端到端的“创建-评估-精炼”闭环：首先通过领域上下文化技能生成器（Domain-Contextualized Skill Creator）利用知识库和历史工单生成高质量初始技能；随后通过三阶段自动诊断管道（Failure Analyzer、Skill Diagnostician、Skill Optimizer）批量分析执行失败、定位技能缺陷并重构技能，实现技能的持续自我进化。实验证明，该机制可从不同起点（包括专家编写、领域创建和通用技能）逐步提升技能质量，甚至超越人工标注的专家知识。

链接: https://arxiv.org/abs/2604.08618
作者: Xingyan Liu,Xiyue Luo,Linyu Li,Ganghong Huang,Jianfeng Liu,Honglin Qiao
机构: Alibaba Cloud Computing, Alibaba Group (阿里巴巴集团)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted at ACM SIGIR 2026 Industry Track. 18 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Deploying LLM-powered agents in enterprise scenarios such as cloud technical support demands high-quality, domain-specific skills. However, existing skill creators lack domain grounding, producing skills poorly aligned with real-world task requirements. Moreover, once deployed, there is no systematic mechanism to trace execution failures back to skill deficiencies and drive targeted refinements, leaving skill quality stagnant despite accumulating operational evidence. We introduce SkillForge, a self-evolving framework that closes an end-to-end creation-evaluation-refinement loop. To produce well-aligned initial skills, a Domain-Contextualized Skill Creator grounds skill synthesis in knowledge bases and historical support tickets. To enable continuous self-optimization, a three-stage pipeline – Failure Analyzer, Skill Diagnostician, and Skill Optimizer – automatically diagnoses execution failures in batch, pinpoints the underlying skill deficiencies, and rewrites the skill to eliminate them. This cycle runs iteratively, allowing skills to self-improve with every round of deployment feedback. Evaluated on five real-world cloud support scenarios spanning 1,883 tickets and 3,737 tasks, experiments show that: (1) the Domain-Contextualized Skill Creator produces substantially better initial skills than the generic skill creator, as measured by consistency with expert-authored reference responses from historical tickets; and (2) the self-evolution loop progressively improves skill quality from diverse starting points (including expert-authored, domain-created, and generic skills) across successive rounds, demonstrating that automated evolution can surpass manually curated expert knowledge.

[IR-20] Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search SIGIR2026

【速读】：该论文旨在解决文本驱动的人体检索（Text-based Person Search）在真实场景部署中因目标域标注数据稀缺而导致的性能下降问题。现有方法依赖于“预训练-微调”范式，需大量带标签的目标域数据进行微调，这在实际应用中难以实现。其解决方案的关键在于提出一种全新的“预训练-适应”（Pretrain-then-Adapt）范式，通过离线测试时自适应（Offline Test-Time Adaptation, TTA）机制，在仅使用无标签测试数据的前提下实现模型动态调整，从而缓解领域偏移（domain shift）。该方法的核心创新是Uncertainty-Aware Test-Time Adaptation (UATTA) 框架，利用双向检索不一致性（bidirectional retrieval disagreement）来估计样本不确定性：当图像-文本对在图像到文本和文本到图像两个方向的检索排序均较高时，认为其对齐度高、不确定性低；反之则判定为高不确定性。该不确定性指标无需标签即可驱动模型校准，显著提升了跨域泛化能力，并在多个基准数据集上验证了有效性。

链接: https://arxiv.org/abs/2604.08598
作者: Jiahao Zhang,Shaofei Huang,Yaxiong Wang,Zhedong Zheng
机构: University of Macau(澳门大学); Hefei University of Technology(合肥工业大学)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM SIGIR 2026

点击查看摘要

Abstract:Text-based person search faces inherent limitations due to data scarcity, driven by stringent privacy constraints and the high cost of manual annotation. To mitigate this, existing methods usually rely on a Pretrain-then-Finetune paradigm, where models are first pretrained on synthetic person-caption data to establish cross-modal alignment, followed by fine-tuning on labeled real-world datasets. However, this paradigm lacks practicality in real-world deployment scenarios, where large-scale annotated target-domain data is typically inaccessible. In this work, we propose a new Pretrain-then-Adapt paradigm that eliminates reliance on extensive target-domain supervision through an offline test-time adaptation manner, enabling dynamic model adaptation using only unlabeled test data with minimal post-train time cost. To mitigate overconfidence with false positives of previous entropy-based test-time adaptation, we propose an Uncertainty-Aware Test-Time Adaptation (UATTA) framework, which introduces a bidirectional retrieval disagreement mechanism to estimate uncertainty, i.e., low uncertainty is assigned when an image-text pair ranks highly in both image-to-text and text-to-image retrieval, indicating high alignment; otherwise, high uncertainty is detected. This indicator drives offline test-time model recalibration without labels, effectively mitigating domain shift. We validate UATTA on four benchmarks, i.e., CUHK-PEDES, ICFG-PEDES, RSTPReid, and PAB, showing consistent improvements across both CLIP-based (one-stage) and XVLM-based (two-stage) frameworks. Ablation studies confirm that UATTA outperforms existing offline test-time adaptation strategies, establishing a new benchmark for label-efficient, deployable person search systems. Our code is available at this https URL.

[IR-21] Unbiased Rectification for Sequential Recommender Systems Under Fake Orders

【速读】：该论文旨在解决虚假订单（fake orders）对序列推荐系统（sequential recommender systems）造成的威胁问题，这些虚假订单通过人为操纵用户交互行为（如点击农场、上下文无关替换和序列扰动）误导推荐结果，从而扭曲真实用户偏好并提升特定商品的曝光率。为避免重新训练模型带来的高计算与时间成本，论文提出Dual-view Identification and Targeted Rectification (DITaR) 方法，其核心在于：首先从协同视角（collaborative view）和语义视角（semantic view）提取差异化表示以精准识别可疑假样本；随后利用梯度上升策略筛选出真正有害的假订单进行靶向修正，从而在保留假订单中潜在有用信息的同时消除偏差残留。该方法保持原始数据量和序列结构不变，实现高效、无偏的系统修复，显著优于现有最先进方法在推荐质量、计算效率和系统鲁棒性方面的表现。

链接: https://arxiv.org/abs/2604.08550
作者: Qiyu Qin,Yichen Li,Haozhao Wang,Cheng Wang,Rui Zhang,Ruixuan Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fake orders pose increasing threats to sequential recommender systems by misleading recommendation results through artificially manipulated interactions, including click farming, context-irrelevant substitutions, and sequential perturbations. Unlike injecting carefully designed fake users to influence recommendation performance, fake orders embedded within genuine user sequences aim to disrupt user preferences and mislead recommendation results, thereby manipulating exposure rates of specific items to gain competitive advantages. To protect users’ authentic interest preferences and eliminate misleading information, this paper aims to perform precise and efficient rectification on compromised sequential recommender systems while avoiding the enormous computational and time costs of retraining existing models. Specifically, we identify that fake orders are not absolutely harmful - in certain cases, partial fake orders can even have a data augmentation effect. Based on this insight, we propose Dual-view Identification and Targeted Rectification (DITaR), which primarily identifies harmful samples to achieve unbiased rectification of the system. The core idea of this method is to obtain differentiated representations from collaborative and semantic views for precise detection, and then filters detected suspicious fake orders to select truly harmful ones for targeted rectification with gradient ascent. This ensures that useful information in fake orders is not removed while preventing bias residue. Moreover, it maintains the original data volume and sequence structure, thus protecting system performance and trustworthiness to achieve optimal unbiased rectification. Extensive experiments on three datasets demonstrate that DITaR achieves superior performance compared to state-of-the-art methods in terms of recommendation quality, computational efficiency, and system robustness.

[IR-22] VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering

【速读】：该论文旨在解决生成式 AI（Generative AI）在生物医学问答场景中存在幻觉（hallucination）导致的事实性不一致问题，即模型生成的答案可能包含无依据的陈述或错误引用。解决方案的关键在于提出一个名为 VerifAI 的开源专家系统，其核心创新是将检索增强生成（Retrieval-Augmented Generation, RAG）与一种新颖的事后声明验证机制相结合：通过将生成答案分解为原子化声明，并利用微调后的自然语言推理（Natural Language Inference, NLI）引擎对每条声明与检索到的证据进行一致性验证，从而显著降低幻觉率并提供可追溯的证据链。该方法在 HealthVer 基准上优于 GPT-4，且整体系统具备模块化设计、高检索精度（MAP@10 为 42.7%）和引用感知生成能力，适用于高风险应用场景。

链接: https://arxiv.org/abs/2604.08549
作者: Miloš Košprdić,Adela Ljajić,Bojana Bašaragin,Darija Medvecki,Lorenzo Cassano,Nikola Milošević
机构: The Institute for Artificial Intelligence Research and Development of Serbia (塞尔维亚人工智能研究与发展研究所); Bayer A.G. (拜耳股份公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce VerifAI, an open-source expert system for biomedical question answering that integrates retrieval-augmented generation (RAG) with a novel post-hoc claim verification mechanism. Unlike standard RAG systems, VerifAI ensures factual consistency by decomposing generated answers into atomic claims and validating them against retrieved evidence using a fine-tuned natural language inference (NLI) engine. The system comprises three modular components: (1) a hybrid Information Retrieval (IR) module optimized for biomedical queries (MAP@10 of 42.7%), (2) a citation-aware Generative Component fine-tuned on a custom dataset to produce referenced answers, and (3) a Verification Component that detects hallucinations with state-of-the-art accuracy, outperforming GPT-4 on the HealthVer benchmark. Evaluations demonstrate that VerifAI significantly reduces hallucinated citations compared to zero-shot baselines and provides a transparent, verifiable lineage for every claim. The full pipeline, including code, models, and datasets, is open-sourced to facilitate reliable AI deployment in high-stakes domains.

人机交互

[HC-0] Demonstrably Informed Consent in Privacy Policy Flows: Evidence from a Randomized Experiment

【速读】：该论文旨在解决隐私政策同意流程中“形式化同意”与“实质性知情同意”之间的鸿沟问题，即当前多数隐私政策仅通过单次点击完成同意，缺乏对用户是否真正理解条款内容的验证。其解决方案的关键在于引入“教学摩擦”（pedagogical friction）的设计框架——在隐私政策同意流中嵌入轻量级干预措施，如分段展示、节奏控制和二次复核机制，以在不显著增加用户负担的前提下，提升用户对关键条款的理解程度，并生成可验证的知情证据。实验结果表明，采用幻灯片式呈现（G3）和分节 paced 展示（G4）的条件显著提高了首次通过率，且二次复核机制能有效帮助未达标用户提升理解水平，从而为实现“可证明的知情同意”提供了实证支持。

链接: https://arxiv.org/abs/2604.09518
作者: Qian Ma,Aditya Majumdar,Sarah Rajtmajer,Brett Frischmann
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Villanova University (维拉诺瓦大学)
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 7 tables, 6 figures

点击查看摘要

Abstract:Privacy policies govern how personal data is collected, used, and shared. Yet, in most privacy-policy consent flows, agreement is operationalized as a single click at the end of a long, opaque policy document. Recent privacy-law scholarship has argued for a standard of demonstrably informed consent. That is, the party drafting and designing privacy-policy consent mechanisms must generate reliable evidence that a person demonstrates comprehension of the consequential terms to which they agree. To this end, we study pedagogical friction as a design framing: minimal interventions embedded within a privacy-policy consent flow that aim to support demonstrated comprehension while keeping burden on the user low. In a randomized experiment, we tested pedagogical friction for demonstrably informed consent in the context of a privacy policy for an edtech app for young children. We recruited 293 parents of kids ages 3-8 to review the app’s privacy policy under one of six conditions that varied presentation format and pacing, then complete a six-question comprehension quiz. Three conditions offered a second policy review and quiz retake for participants who did not pass this quiz on their first attempt. We find that the slide-based condition (G3) achieved the highest first-attempt threshold attainment (=80%) (41.7%), followed by the paced, sectioned condition (G4) (30.6%). In the retake conditions, 64.9% of participants who completed a second attempt improved their score. Notably, in conditions that did not gate consent on demonstrated comprehension, 97.3% of participants who scored below the threshold still chose to consent, suggesting that ungated consent flows can record agreement without demonstrated comprehension. Our results suggest that pedagogical friction can strengthen the evidentiary basis of consent and clarify what it costs in time and burden.

[HC-1] Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation

【速读】：该论文旨在解决现有虚假新闻检测基准对“混合真实”（mixed-truth）类虚假新闻覆盖不足的问题，这类虚假新闻通过人机协作生成，其特点是虚假信息被巧妙地嵌入到看似真实可信的叙述中，从而构成更隐蔽且更具威胁性的挑战。解决方案的关键在于构建了一个名为MANYFAKE的合成基准数据集，包含6,798篇通过多种策略驱动提示（strategy-driven prompting pipelines）生成的虚假新闻文章，能够系统性地模拟虚假新闻的多样构造与优化过程；基于此基准的评估表明，尽管先进推理增强模型在完全虚构内容上表现接近饱和，但在面对精心设计、隐匿于真实信息中的虚假内容时仍表现出显著脆弱性。

链接: https://arxiv.org/abs/2604.09514
作者: Xinyu Wang,Sai Koneru,Wenbo Zhang,Wenliang Zheng,Saksham Ranjan,Sarah Rajtmajer
机构: Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled the large-scale generation of highly fluent and deceptive news-like content. While prior work has often treated fake news detection as a binary classification problem, modern fake news increasingly arises through human-AI collaboration, where strategic inaccuracies are embedded within otherwise accurate and credible narratives. These mixed-truth cases represent a realistic and consequential threat, yet they remain underrepresented in existing benchmarks. To address this gap, we introduce MANYFAKE, a synthetic benchmark containing 6,798 fake news articles generated through multiple strategy-driven prompting pipelines that capture many ways fake news can be constructed and refined. Using this benchmark, we evaluate a range of state-of-the-art fake news detectors. Our results show that even advanced reasoning-enabled models approach saturation on fully fabricated stories, but remain brittle when falsehoods are subtle, optimized, and interwoven with accurate information.

[HC-2] Silence and Noise: Self-censorship and Opinion Expression on Social Media

【速读】：该论文旨在解决在线自我审查（self-censorship）现象在社交媒体中被忽视的问题，尤其关注其在极化语境下对公共话语参与度的影响。研究通过混合方法分析390份问卷调查和20次半结构化访谈，揭示了用户在公开表达与私有信念之间存在差异，并强调社群背景对自我表达的显著影响。解决方案的关键在于识别出：当用户处于更大受众环境中、发帖频率较低且感知支持较弱时，更倾向于抑制表达；同时，那些仍选择发声的个体常会调整观点以符合所感知的群体规范。这一发现补充了关于回音室效应（echo chamber）和意见强化的研究，凸显了“喧嚣中的沉默”及其对公共 discourse 的潜在危害。

链接: https://arxiv.org/abs/2604.09465
作者: Xinyu Wang,Emma Carpenetti,Bruce Desmarais,Sarah Rajtmajer
机构: Pennsylvania State University (宾夕法尼亚州立大学)
类目: ocial and Information Networks (cs.SI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Unlike the more observable phenomenon of group opinion reinforcement, self-censorship online has received comparatively less attention. Our goal in this work is to dissect the phenomena of self-censorship and to examine the implications of restrained expression for participation in public discourse, particularly in polarized contexts. We explore how social media users express their opinions online through analyses of 390 survey responses and 20 semi-structured interviews using a mixed-methods approach. We ask social media users about the differences between their publicly shared opinions and privately held beliefs, highlighting the influence of contextual factors on self-expression. Our findings show that self-censorship is associated with community context; social media users embedded within larger audiences, with lower posting frequency and perceived support, are less likely to express their opinions, and those who do speak often adjust their expressed views to align with perceived group norms. The study complements the rich literature on echo chambers and opinion reinforcement on social media platforms, highlighting the silence within the noise and its potential consequences for public discourse, which have become increasingly pertinent in an era where online platforms are pivotal to social and political narratives.

[HC-3] Confidence Without Competence in AI-Assisted Knowledge Work

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在学生学习过程中因提供快速且完整的答案而抑制深度思考、助长过度自信的问题。其核心挑战在于如何在不显著增加认知负荷的前提下，促进更深层次的学习与反思。解决方案的关键在于设计一种名为Deep3的交互系统，该系统包含三种创新交互模式：未来自我解释（future-self explanations）、对比学习（contrastive learning）和引导提示（guided hints）。实证研究表明，不同交互模式对认知负荷、主观理解感知与实际学习效果之间的关系具有差异化影响，其中未来自我解释虽增加认知负担但最能提升感知与真实理解的一致性，而引导提示则在最小化负面情绪的同时实现了最大的学习收益，从而揭示了努力、信心与学习之间系统性的偏离现象。

链接: https://arxiv.org/abs/2604.09444
作者: Elena Eleftheriou,George Pallis,Marios Constantinides
机构: University of Cyprus(塞浦路斯大学); CYENS Centre of Excellence(卓越中心)
类目: Human-Computer Interaction (cs.HC)
备注: 25 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used by students, yet their tendency to provide fast and complete answers may discourage reflection and foster overconfidence. We examined how alternative LLM interaction designs support deeper thinking without excessively increasing cognitive burden. We conducted a two-phase mixed-methods study. In Phase 1, interviews with 16 Gen Z students informed the design of Deep3, a web-based system with three interaction modes: \empha) future-self explanations, \emphb) contrastive learning, and \emphc) guided hints. In Phase 2, we evaluated Deep3 with 85 participants across two learning tasks. We found that a standard single-agent baseline produced high perceived understanding despite the lowest objective learning. In contrast, future-self explanations imposed higher cognitive workload yet yielded the closest alignment between perceived and actual understanding, while guided hints achieved the largest learning gains without a proportional increase in frustration. These findings show that effort, confidence, and learning systematically diverge in LLM-supported work.

[HC-4] Intent Lenses: Inferring Capture-Time Intent to Transform Opportunistic Photo Captures into Structured Visual Notes

【速读】：该论文旨在解决用户在信息密集环境中通过机会性拍照（如幻灯片、展品或文物）收集内容后，这些照片往往无法转化为有意义笔记的问题。现有自动笔记生成方法虽提供一定支持，但常产生泛化总结，未能反映用户的实际捕捉意图。其解决方案的关键在于提出“意图透镜”（Intent Lenses）这一概念原语，将用户在拍摄时的意图从捕获内容中推断出来，并以可复用的交互对象形式具象化，其中包含执行功能、聚焦信息源以及结果呈现层级等要素；该机制借助大语言模型（Large Language Models, LLMs）的推理能力动态生成，并集成于一个交互式系统中，使用户能跨多张照片添加、链接和排列透镜，从而实现结构化视觉笔记与深度意义建构。

链接: https://arxiv.org/abs/2604.09438
作者: Ashwin Ram,Aeneas Leon Sommer,Martin Schmitz,Jürgen Steimle
机构: Saarland University (萨尔兰大学); University of Koblenz (科布伦茨大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Opportunistic photo capture (e.g., slides, exhibits, or artifacts) is a common strategy for preserving information encountered in information-rich environments for later revisitation. While fast and minimally disruptive, such photo collections rarely become meaningful notes. Existing automatic note-generation approaches provide some support but often produce generic summaries that fail to reflect what users intended to capture. We introduce Intent Lenses, a conceptual primitive for intent-mediated note generation and sensemaking. Intent Lenses reify users’ capture-time intent inferred from captured information into reusable interactive objects that encode the function to perform, the information sources to focus on, and how results are represented at an appropriate level of detail. These lenses are dynamically generated using the reasoning capabilities of large language models. To investigate this concept, we instantiate Intent Lenses in the context of academic conference photos and present an interactive system that infers lenses from presentation captures to generate structured visual notes on a spatial canvas. Users can further add, link, and arrange lenses across captures to support exploration and sensemaking. A study with nine academics showed that intent-mediated notes aligned with users’ expectations, providing effective overviews of their captures while facilitating deeper sensemaking.

[HC-5] Insights from Farmer-Managed Decentralized Solar Irrigation Systems

【速读】：该论文旨在解决分布式太阳能灌溉系统在农村地区部署后因远程和分散特性而导致的维护困难问题。解决方案的关键在于利用WhatsApp等即时通讯平台作为非正式数字基础设施，使农民能够共享每日发电数据、相互比较不同安装点的性能，并识别潜在的系统异常，从而实现集体感知与协作式维护。这种基于社交网络的实践构成了一个以社区驱动为核心的“社会技术平台”，为农业能源技术的设计提供了新思路，即通过支持同侪比较、情境化解释和群体维护行为，提升系统的可持续运维能力。

链接: https://arxiv.org/abs/2604.09395
作者: Arnab Paul Choudhury,Rahul Rathod,Aryan Yadav
机构: Viksit Labs Foundation (维克西特实验室基金会); Sustain Plus Energy Foundation (可持续能源基金会)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 6 pages, 2 figures, Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems

点击查看摘要

Abstract:Solar irrigation systems are increasingly deployed in rural regions, yet their distributed and remote deployment makes maintenance challenging for farmers. While formal monitoring processes and applications exist, they often fall short in practice. We present insights from grid-connected solar irrigation schemes that incentivize farmers to feed energy to the grid, focusing on how farmers maintain their systems. We found that farmers face multiple challenges but are also devising strategies, including the appropriation of WhatsApp to share daily generation data with peers and compare performance across installations to identify potential system anomalies. Our findings highlight how messaging platforms function as informal digital infrastructures enabling collective sensemaking around distributed energy systems. We discuss implications for designing agricultural energy technologies that support peer comparison, contextual interpretation, and community-driven maintenance, framing these as a socio-technical platform. Finally, we outline directions for future work integrating such practices with formal monitoring tools and explore their potential to support citizen science initiatives in environmental sensing.

[HC-6] 3D-Printing Water-Soluble Channels Filled with Liquid Metal for Recyclable and Cuttable Wireless Power Sheet MICRO

【速读】：该论文旨在解决传统二维无线能量传输（Wireless Power Transfer, WPT）系统在物理损伤或改造后功能失效的问题。其核心解决方案是采用H-tree布线结构与可溶性通道填充液态金属（Liquid Metal, LM）的设计：H-tree拓扑结构确保在剪裁外缘区域后，剩余线圈仍能维持正常工作；而通过3D打印聚乙烯醇（PVA）通道封装的LM可在水中溶解回收，实现材料的循环利用。实验表明，该WPT片在6.78 MHz下Q因子超过55，经100次弯曲测试后仍保持稳定机械与电气性能，且经过四次溶解-重构循环后98%的LM被回收并维持良好导电性，为物联网（IoT）和环境计算场景中长期连续运行的电子设备集成提供了可行方案。

链接: https://arxiv.org/abs/2604.09299
作者: Takashi Sato,Ryo Takahashi,Kento Yamagishi,Takao Someya,Michinao Hashimoto,Eiji Iwase,Yoshihiro Kawahara,Junya Kurumida,Wataru Iwasaki
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages, 9 figures. The Proceeding in the IEEE 39th International Conference on Micro Electro Mechanical Systems (MEMS 2026)

点击查看摘要

Abstract:A recyclable and cuttable wireless power transfer (WPT) sheet is proposed, enabled by H-tree wiring and water-soluble channels filled with liquid metal (LM). Conventional 2D WPT systems lose their functionality when physically damaged or modified. The H-tree wiring pattern maintains the operation of the remaining coils even after the outer region of the sheet is cut away. The LM can be recovered by dissolving 3D-printed polyvinyl alcohol (PVA) channels in water. The sheet dimensions were experimentally optimized, and a Q-factor over 55 was achieved at 6.78 MHz. The sheet maintained its bending stiffness and electrical resistance during 100 bending cycles. After four dissolution-refabrication cycles, 98 percent of the LM was recovered with stable electrical properties. The WPT sheet can be integrated into everyday objects and enables long-term, continuous operation of surrounding electronic devices, contributing to IoT applications and ambient computing.

[HC-7] LandSAR: Visceralizing Landslide Data for Enhanced Situational Awareness in Immersive Analytics

【速读】：该论文旨在解决滑坡灾害分析中因计算模拟生成大量抽象数据而导致分析师与真实地形之间存在认知鸿沟的问题。传统方法难以直观理解滑坡的动态过程，从而限制了对灾害情景的准确判断和应对策略制定。解决方案的关键在于提出LandSAR系统，通过整合沉浸式分析（Immersive Analytics, IA）与数据具身化（data visceralization）技术，将滑坡数据以三维可视化方式呈现，并结合3D打印地形模型作为触觉交互接口，实现手势驱动的直观地理感知和实时仿真。该设计显著提升了用户的情境意识（Situational Awareness, SA），支持多视角“假设性”分析，增强了专家对滑坡动态的理解与参与度。

链接: https://arxiv.org/abs/2604.09241
作者: Wong Kam-Kwai,Yi-Lin Ye,Wai Tong,Haobo Li,Kentaro Takahira,Aastha Bhatta,Sunil Poudyal,Charles Wang Wai Ng,Huamin Qu,Leni Yang
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages. A preprint version accepted to PacificVis’26

点击查看摘要

Abstract:Landslides pose a significant threat to public safety, but their dynamic processes are difficult to analyze from post-event observation alone. Computational simulation is therefore essential, but it generates vast, abstract datasets that create a cognitive gap between the analyst and the real-world, physical terrain. While Immersive Analytics (IA) begins to bridge this gap by visualizing data in 3D, we explore how these systems evolve beyond abstract data and integrate data visceralization to enhance Situational Awareness (SA). We present LandSAR, an immersive analytics system that enhances SA for landslide analysis by visceralizing landslide data through integrated simulations and visualizations. LandSAR supports real-time simulations of landslide dynamics, prevention strategies, and climate impacts, enabling multi-perspective what-if analyses. The system uses 3D-printed terrain models as tangible interfaces to facilitate haptic feedback and enable gesture-based exploration, allowing for intuitive geographical perception. Expert interviews and workshops demonstrate that LandSAR effectively improves SA and engagement.

[HC-8] Artificial intelligence can persuade people to take political actions

【速读】：该论文旨在解决一个关键问题：现有研究多基于态度变化来评估生成式 AI (Generative AI) 的说服效果，但这种基于态度的结论是否能有效外推至真实世界的行为改变仍不明确。为验证这一点，作者设计了两项大规模预注册实验（N=17,950），使用对话式 AI 模型对参与者进行劝说干预，测量其在签署请愿书和向慈善机构捐款等实质性行为上的变化。解决方案的关键在于通过实证方法直接检验 AI 对行为的影响，并对比态度与行为之间的关系：结果发现，AI 对行为具有显著影响（如请愿签署率提升 19.7 个百分点），但态度变化与行为变化之间无显著相关性；此外，行为劝说策略普遍优于态度劝说策略，且各行为策略间差异微小。这表明以往依赖态度指标的研究可能严重高估或低估 AI 的真实行为影响力，亟需转向以行为为导向的评估框架。

链接: https://arxiv.org/abs/2604.09200
作者: Kobi Hackenburg,Luke Hewitt,Caroline Wagner,Ben M. Tappin,Christopher Summerfield
机构: University of Oxford (牛津大学); UK AI Security Institute (英国人工智能安全研究所); Stanford University (斯坦福大学); London School of Economics and Political Science (伦敦政治经济学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:There is substantial concern about the ability of advanced artificial intelligence to influence people’s behaviour. A rapidly growing body of research has found that AI can produce large persuasive effects on people’s attitudes, but whether AI can persuade people to take consequential real-world actions has remained unclear. In two large preregistered experiments N=17,950 responses from 14,779 people), we used conversational AI models to persuade participants on a range of attitudinal and behavioural outcomes, including signing real petitions and donating money to charity. We found sizable AI persuasion effects on these behavioural outcomes (e.g. +19.7 percentage points on petition signing). However, we observed no evidence of a correlation between AI persuasion effects on attitudes and behaviour. Moreover, we replicated prior findings that information provision drove effects on attitudes, but found no such evidence for our behavioural outcomes. In a test of eight behavioural persuasion strategies, all outperformed the most effective attitudinal persuasion strategy, but differences among the eight were small. Taken together, these results suggest that previous findings relying on attitudinal outcomes may generalize poorly to behaviour, and therefore risk substantially mischaracterizing the real-world behavioural impact of AI persuasion.

[HC-9] Persona-E2: A Human-Grounded Dataset for Personality-Shaped Emotional Responses to Textual Events ACL2026

【速读】：该论文旨在解决情感计算研究中长期存在的“读者视角缺失”问题，即现有方法将情绪视为文本的静态属性，仅关注作者的情感倾向，而忽视了个体人格差异导致的情绪评估多样性。其核心挑战在于缺乏真实的人类数据来建立人格特质与情绪变化之间的因果关联。解决方案的关键在于构建一个大规模、标注详尽的Persona-E²（Persona-Event2Emotion）数据集，该数据集基于MBTI（Myers-Briggs Type Indicator）和大五人格（Big Five Personality Traits）的结构化标注，系统捕捉新闻、社交媒体和生活叙事等多场景下读者情绪的差异化反应。实验证明，该数据集显著提升了大语言模型（LLM）对情绪评估动态变化的理解能力，并揭示了大五人格特质在缓解“人格幻觉”（personality illusion）中的关键作用。

链接: https://arxiv.org/abs/2604.09162
作者: Yuqin Yang,Haowu Zhou,Haoran Tu,Zhiwen Hui,Shiqi Yan,HaoYang Li,Dong She,Xianrong Yao,Yang Gao,Zhanpeng Jin
机构: South China University of Technology, Guangzhou, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted by ACL 2026 Main

点击查看摘要

Abstract:Most affective computing research treats emotion as a static property of text, focusing on the writer’s sentiment while overlooking the reader’s perspective. This approach ignores how individual personalities lead to diverse emotional appraisals of the same event. Although role-playing Large Language Models (LLMs) attempt to simulate such nuanced reactions, they often suffer from "personality illusion’’ – relying on surface-level stereotypes rather than authentic cognitive logic. A critical bottleneck is the absence of ground-truth human data to link personality traits to emotional shifts. To bridge the gap, we introduce Persona-E ^2 (Persona-Event2Emotion), a large-scale dataset grounded in annotated MBTI and Big Five traits to capture reader-based emotional variations across news, social media, and life narratives. Extensive experiments reveal that state-of-the-art LLMs struggle to capture precise appraisal shifts, particularly in social media domains. Crucially, we find that personality information significantly improves comprehension, with the Big Five traits alleviating "personality illusion.’

[HC-10] Structuring versus Problematizing: How LLM -based Agents Scaffold Learning in Diagnostic Reasoning

【速读】：该论文旨在解决初学者在发展诊断推理能力时面临的挑战，如过早闭合（premature closure）和对启发式策略的过度依赖，以及诊断策略迁移至新情境时的困难。其解决方案的关键在于构建一个基于情景学习（Scenario-based Learning, SBL）的增强环境——PharmaSim Switch，该环境整合了学习分析（Learning Analytics, LA）与大语言模型（Large Language Models, LLM），引入由理论驱动的两种支架式教学策略：结构化（structuring）与问题化（problematizing），并通过一个学生学习轨迹实现个性化引导。实验结果表明，两种支架策略均能有效支持诊断策略的应用，且不同策略对参与类型产生差异化影响，凸显了融合多种支架方法在设计LA与LLM赋能系统中的重要性。

链接: https://arxiv.org/abs/2604.09158
作者: Fatma Betül Güreş,Tanya Nazaretsky,Seyed Parsa Neshaei,Tanja Käser
机构: ETH Zürich (苏黎世联邦理工学院); EPFL (洛桑联邦理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures. Accepted at LAK 2026

点击查看摘要

Abstract:Supporting students in developing diagnostic reasoning is a key challenge across educational domains. Novices often face cognitive biases such as premature closure and over-reliance on heuristics, and they struggle to transfer diagnostic strategies to new cases. Scenario-based learning (SBL) enhanced by Learning Analytics (LA) and large language models (LLM) offers a promising approach by combining realistic case experiences with personalized scaffolding. Yet, how different scaffolding approaches shape reasoning processes remains insufficiently explored. This study introduces PharmaSim Switch, an SBL environment for pharmacy technician training, extended with an LA- and LLM-powered pharmacist agent that implements pedagogical conversations rooted in two theory-driven scaffolding approaches: \emphstructuring and \emphproblematizing, as well as a student learning trajectory. In a between-groups experiment, 63 vocational students completed a learning scenario, a near-transfer scenario, and a far-transfer scenario under one of the two scaffolding conditions. Results indicate that both scaffolding approaches were effective in supporting the use of diagnostic strategies. Performance outcomes were primarily influenced by scenario complexity rather than students’ prior knowledge or the scaffolding approach used. The structuring approach was associated with more accurate Active and Interactive participation, whereas problematizing elicited more Constructive engagement. These findings underscore the value of combining scaffolding approaches when designing LA- and LLM-based systems to effectively foster diagnostic reasoning.

[HC-11] Whats in a BIP? Exploring the Lived Experiences of Breaks In Presence

【速读】：该论文旨在解决虚拟现实（Virtual Reality, VR）环境中用户出现“存在感中断”（Break in Presence, BIP）现象的机制与体验特征不明确的问题，从而为提升VR应用的沉浸感设计提供依据。其解决方案的关键在于采用微观现象学（micro-phenomenology）方法，对14名使用高度暴露类VR应用的用户所经历的57个BIP事件进行精细化建模，识别出四类典型的时间动态模式：反思性、丢弃型、自我保护型及矛盾中介型BIP，并基于PI/Psi模型提出以意识水平为核心的BIP定义，进而提炼出三个面向用户体验的设计机会。

链接: https://arxiv.org/abs/2604.09146
作者: Jean-Philippe Rivière,Roman Malo,Sarah Varlin Grassi,Yannick Prié
机构: University of Nantes (南特大学)
类目: Human-Computer Interaction (cs.HC)
备注: To appear in Journal of Virtual Reality, Springer-Nature, 2026

点击查看摘要

Abstract:Occasionally, individuals immersed in a Virtual Reality (VR) environment may experience distractions that disrupt their sense of presence, a phenomenon referred to as a break in presence (BIP). Better understanding BIPs is crucial to designing VR applications that keep their users present. BIPs have been studied using a variety of methods, exploring their origins or trying to detect them from physiological or behavioral measurements. However, despite the importance of understanding how they are actually lived and managed by VR users, very few studies focused on their phenomenological characterization. We employed micro-phenomenology to collect the descriptions of BIPs experienced by users (n=14) of a height exposure VR application. We precisely modeled 57 BIP episodes, bringing to light a variety of experiences and behaviors. Four generic diachronic patterns of BIP episodes emerge: reflected-upon, discarded, self-preservation, and contradictory mediation BIPs. We discuss these in light of the PI/Psi model of presence, propose an awareness-based definition of BIPs, as well as three BIP-related design opportunities.

[HC-12] Enhance Comprehension of Over-the-Counter Drug Instructions for the General Public and Medical Professionals through Visualization Design

【速读】：该论文旨在解决非处方药（Over-the-Counter, OTC）说明书在信息传达上的不足问题，尤其是针对普通公众和医疗专业人员两类用户群体在理解药物使用说明时存在的认知障碍。其解决方案的关键在于通过迭代式可视化设计，开发出两种针对不同受众的定制化说明书版本，并通过受控用户实验验证了新设计在响应时间和可用性方面显著优于传统文本形式，同时证明双版本设计具有协同优势。研究还构建了一个基于官方药品数据库的OTC说明书分类体系，并提炼出一套可推广的可视化设计工作流程，为提升药品说明书的信息可读性和用户适配性提供了系统性方法。

链接: https://arxiv.org/abs/2604.09134
作者: Mengjie Fan,Katrin Angerbauer,Yinchu Cheng,Yingying Yan,Xiaohan Xu,Tianfu Wang,Michael Sedlmair,Yu Yang,Liang Zhou
机构: 未知
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Drug instructions are crucial for guiding the rational use of medication. We conduct a visualization design study to enhance the comprehension of over-the-counter (OTC) drug instructions, targeting both the general public and medical professionals. We devise two tailored drug instruction designs for different audience groups through an iterative design process. A controlled user study reveals that our design outperforms traditional text-based instructions in terms of response time and usability, and the availability of two versions is also found to be beneficial. This study also motivates a taxonomy based on a systematic classification of OTC drug instructions sampled from an official drug database, which received positive expert feedback. Finally, this study summarizes a workflow for a visualization design strategy based on our design exploration and user study feedback, which can be generalized to other OTC drug instructions.

[HC-13] he Speculative Future of Conversational AI for Neurocognitive Disorder Screening: a Multi-Stakeholder Perspective

【速读】：该论文旨在解决当前神经认知障碍（Neurocognitive Disorders, NCDs）筛查方法在社会接受度、用户参与度及早期医疗干预激励方面存在的局限性，尤其是在如何设计可广泛部署且具吸引力的对话式人工智能（Conversational AI, CAI）解决方案上。其关键解决方案在于采用以用户为中心的设计方法，通过深入访谈36名利益相关者（包括临床医生、高风险个体及其照护者），识别出不同群体在家庭或社区环境中使用CAI进行NCD筛查时的共通期望与潜在冲突，从而提出可操作的设计启示，以平衡情感支持需求与专业标准化之间的矛盾，推动CAI在实际筛查流程中的有效整合与可持续应用。

链接: https://arxiv.org/abs/2604.09070
作者: Jiaxiong Hu,Ruowen Niu,Qiuxin Du,Chenzhuo Xiang,Yirui Zuo,Jihong Jeung,Xiaojuan Ma
机构: The University of Hong Kong (香港大学); The Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); University College London (伦敦大学学院); Tsinghua University (清华大学); Harbin Institute of Technology (哈尔滨工业大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Human-Computer Interaction (cs.HC)
备注: Under minor revision, 2026

点击查看摘要

Abstract:Neurocognitive disorders (NCDs), such as Alzheimer’s disease, are globally prevalent and require scalable screening methods for proactive management. Prior research has explored the potential of technologies like conversational AI (CAI) to administer NCD screening tests. However, challenges remain in designing CAI-based solutions that make routine NCD screening socially acceptable, engaging, and capable of encouraging early medical consultation. In this study, we conducted interviews with 36 participants, including clinicians, individuals at risk of NCDs, and their caregivers, to explore the speculative future of adopting CAI for NCD screening. Our findings reveal shared expectations, such as deploying CAI in home or community settings to reduce social stress. Nonetheless, conflicts emerged among stakeholders, for example, users’ need for emotional support may conflict with clinicians’ preference for CAI’s professional and standardized administration. Then, we look into the user journey of NCD screening based on the current practice of manual screening and the expected CAI-supported screening. Finally, leveraging the human-centered approach, we provide actionable implications for future CAI design in NCD screening.

[HC-14] sf TriDeliver: Cooperative Air-Ground Instant Delivery with UAVs Couriers and Crowdsourced Ground Vehicles

【速读】：该论文旨在解决即时配送（Instant Delivery）中因单一配送主体（如人力快递员、无人机UAVs或众包地面车辆GVs）存在固有局限性（如效率低、劳动力短缺、飞行控制复杂及动态适应能力弱）而难以满足日益增长的配送需求的问题。其解决方案的关键在于提出首个分层协同框架TriDeliver，通过融合人类快递员、无人机与众包地面车辆实现高效协同配送；并设计基于迁移学习（Transfer Learning, TL）的算法，从快递员的历史行为数据中提取调度知识，并将其迁移至无人机和地面车辆进行微调，从而提升整体协同配送性能。实验表明，该方法在真实世界轨迹与配送数据集上显著降低配送成本（减少65.8%），同时优化交付时间（减少17.7%）、成本（减少9.8%）及对众包车辆原有任务的影响（减少43.6%）。

链接: https://arxiv.org/abs/2604.09049
作者: Junhui Gao,Yan Pan,Qianru Wang,Wenzhe Hou,Yiqin Deng,Liangliang Jiang,Yuguang Fang
机构: Hong Kong JC STEM Lab of Smart City and Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China; State Key Laboratory of Complex Critical Software Environment, and College of Computer Science and Technology, National University of Defense Technology, Changsha, China; School of Computer Science and Technology, Xidian University, Xi’an, China; National Key Laboratory of Big Data and Decision, National University of Defense Technology, China; School of Data Science, Lingnan University, Hong Kong, China; School of Accounting and Finance, Hong Kong Polytechnic University, Kowloon, Hong Kong, China
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Instant delivery, shipping items before critical deadlines, is essential in daily life. While multiple delivery agents, such as couriers, Unmanned Aerial Vehicles (UAVs), and crowdsourced agents, have been widely employed, each of them faces inherent limitations (e.g., low efficiency/labor shortages, flight control, and dynamic capabilities, respectively), preventing them from meeting the surging demands alone. This paper proposes \sf TriDeliver, the first hierarchical cooperative framework, integrating human couriers, UAVs, and crowdsourced ground vehicles (GVs) for efficient instant delivery. To obtain the initial scheduling knowledge for GVs and UAVs as well as improve the cooperative delivery performance, we design a Transfer Learning (TL)-based algorithm to extract delivery knowledge from couriers’ behavioral history and transfer their knowledge to UAVs and GVs with fine-tunings, which is then used to dispatch parcels for efficient delivery. Evaluated on one-month real-world trajectory and delivery datasets, it has been demonstrated that 1) by integrating couriers, UAVs, and crowdsourced GVs, \sf TriDeliver reduces the delivery cost by 65.8% versus state-of-the-art cooperative delivery by UAVs and couriers; 2) \sf TriDeliver achieves further improvements in terms of delivery time ( -17.7% ), delivery cost ( -9.8% ), and impacts on original tasks of crowdsourced GVs ( -43.6% ), even with the representation of the transferred knowledge by simple neural networks, respectively.

[HC-15] SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在临床技能视频理解中缺乏对过程状态动态更新的建模能力问题，即模型难以准确追踪交互如何随时间改变程序状态并据此判断后续动作的正确性。解决方案的关键在于构建首个针对此能力的基准测试SiMing-Bench，其核心是基于标准化步骤评分规则（rubric-grounded）对完整临床操作视频进行流程级判断，并通过医师标注的真实临床技能视频数据集SiMing-Score（涵盖心肺复苏、自动体外除颤器使用和面罩通气等）实现细粒度的过程状态演化建模，从而揭示现有模型在中间步骤层面的显著不足，证明单纯依赖全局流程评估会高估模型的实际程序推理能力。

链接: https://arxiv.org/abs/2604.09037
作者: Xiyang Huang,Jiawei Lin,Keying Wu,Jiaxin Huang,Kailai Yang,Renxiong Wei,Cheng zeng,Jiayi Xiang,Ziyan Kuang,Min Peng,Qianqian Xie,Sophia Ananiadou
机构: Wuhan University (武汉大学); Southwest Jiaotong University (西南交通大学); MBZUAI; The University of Manchester (曼彻斯特大学); Zhongnan Hospital of Wuhan University (武汉大学中南医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models’ procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.

[HC-16] Accessible Fine-grained Data Representation via Spatial Audio

【速读】：该论文旨在解决盲人及低视力（Blind and Low-Vision, BLV）群体在获取定量数据可视化信息时面临的可访问性问题。现有基于音高（pitch-based）的声学化方法虽能传达数据趋势和粗粒度比较，但难以有效呈现细粒度数据特征，如单个数据点的符号和精确值。为此，作者基于听觉感知研究提出一种基于空间音频（spatial audio）的解决方案，其关键在于将数据值映射为方位平面（azimuth plane）中的声音方向，从而实现对细粒度数据信息的可访问表达。用户研究表明，该方法在识别数据符号和精确值等细粒度任务上显著优于音高表示法，同时在趋势识别任务中表现相当，验证了其有效性与实用性。

链接: https://arxiv.org/abs/2604.08979
作者: Can Liu,Wenjie Jiang,Shaolun Ruan,Kotaro Hara,Yong Wang
机构: Nanyang Technological University, Singapore(南洋理工大学, 新加坡); Singapore Management University, Singapore(新加坡管理大学, 新加坡)
类目: Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注: Accepted by IEEE Computer Graphics and Applications (IEEE CGA)

点击查看摘要

Abstract:Pitch-based sonification of quantitative data increases the accessibility of data visualizations that are otherwise inaccessible for blind and low-vision (BLV) individuals. We argue that, although pitch representations can reveal the coarse-grained information of data, such as data trend and value comparison, they cannot effectively convey the fine-grained details like the sign and exact value of individual data points. Informed by existing sound perception research, we propose a spatial audio-based approach by representing data values as the sound direction in the azimuth plane to achieve accessible fine-grained data representation. We conducted a user study with 26 participants (including 10 BLV participants) on four data perception tasks. The results show our approach significantly outperforms pitch representation on fine-grained data perception tasks like recognizing data signs and exact values, and performs similarly on data trend identification, despite its inferior accuracy on data value comparison.

[HC-17] How Do LLM s See Charts? A Comparative Study on High-Level Visualization Comprehension in Humans and LLM s

【速读】：该论文旨在解决生成式 AI（Generative AI）在可视化理解中的认知机制与人类解读策略之间存在差距的问题，尤其关注大型语言模型（Large Language Models, LLMs）如何诠释可视化以达成设计者的高阶沟通目标。其解决方案的关键在于通过定性研究方法，系统比较人类与LLMs在三种常见可视化类型（折线图、柱状图和散点图）上的解释路径差异，发现LLMs表现出稳定但结构化的数值枚举式推理模式，而人类则倾向于构建以趋势为中心的叙事逻辑；这一发现揭示了LLMs依赖不同于人类直觉的认知机制，从而指出了可视化设计需面向AI理解能力进行优化的新方向。

链接: https://arxiv.org/abs/2604.08959
作者: Hyotaek Jeon,Hyunwook Lee,Minjeong Shin,Tapendra Pandey,Joohee Kim,Shinwook Seon,Daeun Jeong,Sungahn Ko,Ghulam Jilani Quadri
机构: Pohang University of Science and Technology (POSTECH), Soongsil University, Australian National University, University of Oklahoma, Ulsan National Institute of Science and Technology (UNIST)
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 8 figures, Accepted to EuroVis 2026

点击查看摘要

Abstract:Designers often create visualizations to achieve specific high-level analytical or communication goals. These goals require people to extract complex and interconnected data patterns. Prior perceptual studies of visualization effectiveness have focused on low-level tasks, such as estimating statistical quantities, and have recently explored high-level comprehension of visualization. Despite the growing use of Large Language Models (LLMs) as visualization interpreters, how their interpretations relate to human understanding or what reasoning processes underlie their responses remains insufficiently understood. In this work, we explore LLMs’ visualization comprehension, examining the alignment between designers’ communicative goals and what their audience sees in a visualization. We have conducted a qualitative study to investigate the gap between human interpretative strategies and the reasoning pathways of LLMs across three types of visualizations, line graphs, bar graphs, and scatterplots, to identify the high-level patterns generated by LLMs using three prompt conditions. Our analysis results indicate that LLMs exhibit a consistent interpretative strategy that remains unchanged across prompt constraints. Furthermore, we observe two distinct approaches: humans naturally synthesize data into trend-centric narratives, whereas LLMs persist with a structural enumeration of comparisons and numerical ranges. Lastly, we see LLMs achieve visualization comprehension through mechanisms distinct from human intuition, pointing to critical challenges and new opportunities for visualization design.

[HC-18] Lightweight and Generalizable Multi-Sensor Human Activity Recognition via Cascaded Fusion and Style-Augmented Decomposition

【速读】：该论文旨在解决可穿戴人体活动识别（Wearable Human Activity Recognition, WHAR）中现有方法存在的两大问题：一是基于注意力机制的特征融合模块计算复杂度高，二是特征提取阶段对数据变化缺乏鲁棒性。解决方案的关键在于提出一种轻量且可泛化的框架，保留“分解-提取-融合”的核心结构，并引入两项创新：其一，用级联融合块（Cascaded Fusion Block, CFB）替代昂贵的注意力与跨变量融合（Cross-Variable Fusion, CVF）模块，通过“压缩-递归-拼接-融合”的操作实现高效特征交互而无需显式注意力权重；其二，在局部时间特征提取（Local Temporal Feature Extraction, LTFE）和全局时间聚合（Global Temporal Aggregation, GTA）阶段前集成基于MixStyle的数据增强模块，通过混合批次内样本的均值与方差并引入随机系数扰动数据分布，提升模型泛化能力而不改变核心信息。该方案在保持传感器级、变量级和通道级独立性的基础上，实现了高效融合与鲁棒特征提取，在两个基准数据集上优于当前最优方法，同时计算开销降低超过30%。

链接: https://arxiv.org/abs/2604.08910
作者: Wang Chenglong,Zhuo Yan,Ding Wenbo,Chen Xinlei
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 8 pages. arXiv admin note: text overlap with arXiv:2501.10917 by other authors

点击查看摘要

Abstract:Wearable Human Activity Recognition (WHAR) is a prominent research area within ubiquitous computing, whose core lies in effectively modeling intra- and inter-sensor spatio-temporal relationships from multi-modal time series data. Existing methods either suffer from high computational complexity due to attention-based fusion or lack robustness to data variations during feature extraction. To address these issues, we propose a lightweight and generalizable framework that retains the core “decomposition-extraction-fusion” paradigm while introducing two key innovations. First, we replace the computationally expensive Attention and Cross-Variable Fusion (CVF) modules with a Cascaded Fusion Block (CFB), which achieves efficient feature interaction without explicit attention weights through the operational process of “compression-recursion-concatenation-fusion”. Second, we integrate a MixStyle-based data augmentation module before the Local Temporal Feature Extraction (LTFE) and Global Temporal Aggregation (GTA) stages. By mixing the mean and variance of different samples within a batch and introducing random coefficients to perturb the data distribution, the model’s generalization ability is enhanced without altering the core information of the data. The proposed framework maintains sensor-level, variable-level, and channel-level independence during the decomposition phase, and achieves efficient feature fusion and robust feature extraction in subsequent processes. Experiments on two benchmark datasets (Realdisp, Skoda) demonstrate that our model outperforms state-of-the-art methods in both accuracy and macro-F1 score, while reducing computational overhead by more than 30% compared to attention-based baselines. This work provides a practical solution for WHAR applications on resource-constrained wearable devices.

[HC-19] Omakase: proactive assistance with actionable suggestions for evolving scientific research projects

【速读】：该论文旨在解决当前科研辅助系统在长期研究项目中因缺乏上下文感知能力而导致的被动响应问题，即AI代理难以主动推理用户潜在需求并提供及时、可操作的建议。其核心解决方案在于开发了一种名为Omakase的研究助手系统，该系统通过持续监控用户的项目文档以推断适时的查询请求，并将深度研究系统生成的冗长报告提炼为与项目演进状态紧密关联的行动建议，从而显著提升建议的可操作性与实用性。

链接: https://arxiv.org/abs/2604.08898
作者: Pao Siangliulue,Jonathan Bragg,Doug Downey,Joseph Chee Chang,Daniel S. Weld
机构: Allen Institute for AI (艾伦人工智能研究所)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As AI agents become increasingly capable of complex knowledge tasks, the lack of context limits their capability to proactively reason about a user’s latent needs throughout a long evolving project. In scientific research, many researchers still manually query a deep research system and compress their rich project contexts into short, targeted queries. Further, a deep research system produces exhaustive reports, making it difficult to identify concrete actions. To explore the opportunities of research assistants that are proactive throughout a research project, we conducted several studies (N=42) with a technology probe and an iterative prototype. The latest iteration of our system, Omakase, is a research assistant that monitors a user’s project documents to infer timely queries to a deep research system. Omakase then distills long reports into suggestions contextualized to their evolving projects. Our evaluations showed that participants found the generated queries to be useful and timely, and rated Omakase’s suggestions as significantly more actionable than the original reports.

[HC-20] From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence

【速读】：该论文旨在解决开放源代码人工智能模型（Open Source AI Models, OSM）与传统开源软件（Open Source Software, OSS）在协作开发范式上的根本差异尚未被系统揭示的问题。其解决方案的关键在于通过大规模数据采集（分别从GitHub和Hugging Face Hub获取142.8万和144万仓库）并结合统计分析、社交网络分析和内容分析，量化和刻画OSM与OSS在协作强度、协作开放性以及用户创新模式上的显著差异，并辅以半结构化访谈识别其背后的社会技术因素。研究发现OSM表现出更低的协作强度和直接贡献开放性，但保持相对开放的知识交流，且用户创新更偏向于适应性使用而非协同改进，从而揭示了AI开发中开放协作范式的三维度分化及其成因。

链接: https://arxiv.org/abs/2604.08888
作者: Hengzhi Ye,Minghui Zhou
机构: Peking University (北京大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: Accepted to CSCW26

点击查看摘要

Abstract:AI development is embracing open-source paradigm, but the fundamental distinction between AI models and traditional software artifacts may lead to a divergent open-source development paradigm with different collaborative practices, which remains unexplored. We therefore bridge the knowledge gap by quantifying and characterizing the differences in the collaborative development paradigms of traditional open source software (OSS) and open source AI models (OSM), and investigating the underlying factors that may drive these distinctions. We collect 1,428,792 OSS repositories from GitHub and 1,440,527 OSM repositories from HF Hub, and conduct comprehensive statistical, social network and content analyses to measure and understand the differences in collaboration intensity, collaboration openness, and user innovation across the two development paradigms, complementing these quantitative results with semi-structured interviews. In consequence, we find that compared to OSS development paradigm, the OSM development paradigm exhibits significantly lower collaboration intensity; lower collaboration openness regarding direct contribution while persisting relatively open knowledge exchange; and a divergence toward adaptive utilization user-innovation rather than collaborative improvement. Through semi-structured interviews, we further elucidate the socio-technical factors underlying these differences. These findings reveal the paradigmatic divergence in open source development between traditional OSS and OSM across three critical dimensions of open source collaboration and potential underlying factors, shedding light on how to improve collaborative work techniques and practices within the context of AI development.

[HC-21] AI-Induced Human Responsibility (AIHR) in AI-Human teams

【速读】：该论文旨在解决人工智能（AI）作为团队成员而非独立工具被广泛应用时，人类与AI协作过程中责任归属模糊的问题，尤其关注在道德后果显著的情境下，人们如何分配责任。其核心发现是：当人类与AI配对执行任务时，参与者普遍将更多责任归因于人类决策者（平均高出10分，量表范围为0-100），这一现象被称为“AI诱导的人类责任”（AI-Induced Human Responsibility, AIHR）。解决方案的关键在于揭示了责任偏移的机制——并非源于对AI的偏见或自我保护性推诿，而是由于人们对AI的认知定位为“受约束的执行者”（constrained implementer），从而默认人类才是具有自主裁量权的责任主体。这一机制解释了为何AI-human协作反而强化了人类的责任感，而非稀释责任，对构建AI赋能组织中的问责机制设计具有重要启示。

链接: https://arxiv.org/abs/2604.08866
作者: Greg Nyilasy,Brock Bastian,Jennifer Overbeck,Abraham Ryan Ade Putra Hito
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As organizations increasingly deploy AI as a teammate rather than a standalone tool, morally consequential mistakes often arise from joint human-AI workflows in which causality is ambiguous. We ask how people allocate responsibility in these hybrid-agent settings. Across four experiments (N = 1,801) in an AI-assisted lending context (e.g., discriminatory rejection, irresponsible lending, and low-harm filing errors), participants consistently attributed more responsibility to the human decision maker when the human was paired with AI than when paired with another human (by an average of 10 points on a 0-100 scale across studies). This AI-Induced Human Responsibility (AIHR) effect held across high and low harm scenarios and persisted even where self-serving blame-shifting (when the human in question was the self) would be expected. Process evidence indicates that AIHR is explained by inferences of agent autonomy: AI is seen as a constrained implementer, which makes the human the default locus of discretionary responsibility. Alternative mechanisms (mind perception; self-threat) did not account for the effect. These findings extend research on algorithm aversion, hybrid AI-human organizational behavior and responsibility gaps in technology by showing that AI-human teaming can increase (rather than dilute) human responsibility, with implications for accountability design in AI-enabled organizations.

[HC-22] Semantic Zooming and Edge Bundling for Multi-Scale Supply Chain Flow Visualization

【速读】：该论文旨在解决现代供应链网络中空间分布的物流流在传统可视化技术下易产生视觉混乱（visual clutter）的问题，从而难以识别可操作的模式。其解决方案的关键在于提出一个多层次的可视化分析仪表板，融合语义缩放（Semantic Zooming）与基于骨架的边捆绑（Skeleton-Based Edge Bundling, SBEB），通过动态调整视图层级实现多尺度信息呈现：宏观层面采用捆绑聚合流、中观层面使用六边形密度热力图、微观层面则以层次化库存太阳图展示细节。此外，论文还创新性地改进SBEB算法，引入方向扇区聚类和自适应绕行约束机制，确保地理起源-目的地流向的制图合理性，显著提升了复杂供应链数据的可读性和洞察力。

链接: https://arxiv.org/abs/2604.08823
作者: Songmao Li,Kaixuan Qu,Keer Sun,Bhargav Limbasia,Luciano Nocera(University of Southern California)
机构: University of Southern California (南加州大学)
类目: Human-Computer Interaction (cs.HC)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Modern supply chain networks involve spatially distributed flows that become difficult to interpret using traditional visualization techniques, producing visual clutter that obscures actionable patterns. We present a multi-scale visual analytics dashboard that combines Semantic Zooming with Skeleton-Based Edge Bundling (SBEB). The system dynamically adapts its representation based on zoom level: bundled aggregate flows at the macro-scale, hexagonal density heatmaps at the meso-scale, and hierarchical inventory sunbursts at the micro-scale. Built on Vue3 and this http URL, it reduces raw orders to 202 warehouse-to-state flows. We contribute (1)a semantic zoom implementation with animated transitions that unifies edge bundling, hexagonal density aggregation, and hierarchical inventory views into a single interface; and (2)an algorithmic adaptation of SBEB for geographic origin-destination flows, introducing directional-sector clustering and adaptive detour constraints to preserve cartographic plausibility.

[HC-23] Smartwatch-Based Sitting Time Estimation in Real-World Office Settings ICML

【速读】：该论文旨在解决现实办公环境中准确估算个体久坐时间的问题，以支持对肥胖、心血管疾病等慢性病风险的有效监测与干预。其解决方案的关键在于提出一种基于智能手表惯性测量单元（Inertial Measurement Unit, IMU）信号的新方法，通过引入由欧拉角导出的旋转矢量序列（rotation vector sequences）作为运动动态的新型表征方式，显著提升了算法在自然环境下的久坐时间估计性能。

链接: https://arxiv.org/abs/2604.08808
作者: Olivia Zhang,Zhilin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注: Accepted at the 18th International Conference on Machine Learning and Computing (ICMLC 2026), February 6-9, 2026

点击查看摘要

Abstract:Sedentary behavior poses a major public health risk, being strongly linked to obesity, cardiovascular disease, and other chronic conditions. Accurately estimating sitting time is therefore critical for monitoring and improving individual health. This work addresses the problem in real-world office settings, where signals from the inertial measurement units (IMU) on a smartwatch were collected from office workers during their daily routines. We propose a method that estimates sitting time from the IMU signals by introducing the use of rotation vector sequences, derived from Euler angles, as a novel representation of movement dynamics. Experiments on a 34-hour dataset demonstrate that exploiting rotation vector sequences improves algorithm performance, highlighting their potential for robust sitting time estimation in natural environments.

[HC-24] amLLM : Exploring the Capabilities of LLM s for Multimodal Group Interaction Prediction

【速读】：该论文旨在解决如何利用大语言模型（Large Language Models, LLMs）从多模态传感器数据中预测群体协作模式的问题，特别是在混合现实（Mixed Reality, MR）环境中对团队动态进行实时建模与仿真。其解决方案的关键在于将层级化上下文——包括个体行为特征、群体结构属性和时间活动背景——编码为自然语言形式，并通过三种LLM适应范式（零样本、少样本及监督微调）进行评估。实验表明，微调后的LLM在对话预测任务上达到96%准确率且延迟低于35ms，显著优于LSTM基线（提升3.2倍），证明文本驱动的LLM在捕捉基于语言的行为模式（如轮替发言）方面具有优势；但同时也揭示了其在空间和视觉推理任务（如共享注意力）上的局限性，从而明确了LLM适用于团队动态感知的边界条件与设计启示。

链接: https://arxiv.org/abs/2604.08771
作者: Diana Romero,Xin Gao,Daniel Khalkhali,Salma Elmalaki
机构: 未知
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Predicting group behavior, how individuals coordinate, communicate, and interact during collaborative tasks, is essential for designing systems that can support team performance through real-time prediction and realistic simulation of collaborative scenarios. Large Language Models (LLMs) have shown promise for processing sensor data for human-activity recognition (HAR), yet their capabilities for team dynamics or group-level multimodal sensing remain unexplored. This paper investigates whether LLMs can predict group coordination patterns from multimodal sensor data in collaborative Mixed Reality (MR) environments. We encode hierarchical context – individual behavioral profiles, group structural properties, and temporal activity context – as natural language and evaluate three LLM adaptation paradigms (zero-shot, few-shot, and supervised fine-tuning) against statistical baselines. Our evaluation on 16 groups (64 participants, \sim 25 hours of sensor data) reveals that LLMs achieve 3.2 \times improvement over LSTM baselines for linguistically-grounded behaviors, with fine-tuning reaching 96% accuracy for conversation prediction while maintaining sub-35ms latency. Beyond performance gains, we characterize the boundaries of text-based LLMs for multimodal sensing conversation prediction succeeds because turn-taking maps to linguistic patterns, while shared or joint attention may require spatial and visual reasoning that text only LLMs cannot capture. We further identify simulation mode brittleness (83% degradation from cascading context errors) and minimal few-shot sensitivity to example selection strategy. These findings establish guidelines when LLMs are appropriate for CPS/IoT sensing for team dynamics and inform the design of future multimodal foundation models.

[HC-25] On Semiotic-Grounded Interpretive Evaluation of Generative Art

【速读】：该论文旨在解决当前生成式艺术（Generative Art, GenArt）评估体系过于关注图像表面质量或对提示词的字面匹配，而忽视创作者意图中深层符号性与抽象意义的问题。其核心解决方案是基于皮尔士（Peirce）的计算符号学理论，将人与生成艺术的交互（Human-GenArt Interaction, HGI）建模为级联符号过程（cascaded semiosis），并提出SemJudge评估框架，通过层次化符号图（Hierarchical Semiosis Graph, HSG）显式捕捉图像生成过程中图标型（iconic）、象征型（symbolic）和指示型（indexical）三种意义传递模式，从而突破现有评估方法在象征与指示层面的结构性盲区，实现更贴近人类艺术解读的深度评价。

链接: https://arxiv.org/abs/2604.08641
作者: Ruixiang Jiang,Changwen Chen
机构: The Hong Kong Polytechnic University(香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Interpretation is essential to deciphering the language of art: audiences communicate with artists by recovering meaning from visual artifacts. However, current Generative Art (GenArt) evaluators remain fixated on surface-level image quality or literal prompt adherence, failing to assess the deeper symbolic or abstract meaning intended by the creator. We address this gap by formalizing a Peircean computational semiotic theory that models Human-GenArt Interaction (HGI) as cascaded semiosis. This framework reveals that artistic meaning is conveyed through three modes - iconic, symbolic, and indexical - yet existing evaluators operate heavily within the iconic mode, remaining structurally blind to the latter two. To overcome this structural blindness, we propose SemJudge. This evaluator explicitly assesses symbolic and indexical meaning in HGI via a Hierarchical Semiosis Graph (HSG) that reconstructs the meaning-making process from prompt to generated artifact. Extensive quantitative experiments show that SemJudge aligns more closely with human judgments than prior evaluators on an interpretation-intensive fine-art benchmark. User studies further demonstrate that SemJudge produces deeper, more insightful artistic interpretations, thereby paving the way for GenArt to move beyond the generation of “pretty” images toward a medium capable of expressing complex human experience. Project page: this https URL.

[HC-26] Sustained Impact of Agent ic Personalisation in Marketing: A Longitudinal Case Study

【速读】：该论文旨在解决在消费者应用中，如何平衡人工干预与自主学习系统在客户关系管理（CRM）中的作用，以实现可持续的个性化营销效果提升问题。其核心挑战在于：尽管自适应和自主学习系统具备规模化个性化潜力，但缺乏明确证据表明是否需要持续的人类监督来维持性能增益。解决方案的关键在于提出并验证一种“人机协同”模型——即由人类主导初始策略制定与内容优化（主动阶段），随后由基于代理（agentic）的自动化系统接管日常执行与维护（被动阶段）。实证结果显示，虽然人类干预能带来最高的相对参与度提升，但自主代理仍能有效保持正向性能增益，从而证明了该协同模式在长期运营中的可行性与价值。

链接: https://arxiv.org/abs/2604.08621
作者: Olivier Jeunen,Eleanor Hanna,Schaun Wheeler
机构: aampeAntwerpBelgium; aampeRaleighNCUSA; aampeCaryNCUSA
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: To appear in the 34th ACM International Conference on User Modeling, Adaptation and Personalization (UMAP '26) Industry Track

点击查看摘要

Abstract:In consumer applications, Customer Relationship Management (CRM) has traditionally relied on the manual optimisation of static, rule-based messaging strategies. While adaptive and autonomous learning systems offer the promise of scalable personalisation, it remains unclear to what extent ``human-in-the-loop’’ oversight is required to sustain performance uplift over time. This paper presents a longitudinal case study analysing a real-world consumer application that leverages agentic infrastructure to personalise marketing messaging for a large-scale user base over an 11-month period. We compare two distinct periods: an active phase where marketers directly curated content, audiences, and strategies – followed immediately by a passive phase where agents operated autonomously from a fixed library of components. Our results demonstrate that whilst active human management generates the highest relative lift in engagement metrics, the autonomous agents successfully sustained a positive lift during the passive period. These findings suggest a symbiotic model where human intervention drives strategic initialisation and discovery, yet autonomous agents can ensure the scalable retention and preservation of performance gains. Comments: To appear in the 34th ACM International Conference on User Modeling, Adaptation and Personalization (UMAP '26) Industry Track Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2604.08621 [cs.AI] (or arXiv:2604.08621v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.08621 Focus to learn more arXiv-issued DOI via DataCite

[HC-27] Scaffolding Human-AI Collaboration: A Field Experiment on Behavioral Protocols and Cognitive Reframing

【速读】：该论文旨在解决生成式 AI (Generative AI) 在组织中部署后生产力提升不均衡的问题，核心在于探究人类如何使用 AI 工具比单纯拥有访问权限更为关键。解决方案的关键在于设计两种“支架干预”（scaffolding interventions）：一是行为支架干预（要求成对协作使用 AI 的结构化协议），二是认知支架干预（通过伙伴关系培训将 AI 重新定义为思维伙伴）。结果显示，认知支架干预显著提升了高绩效个体的文档质量，而行为支架干预则导致文档质量和产出量均下降，表明对 AI 使用方式的认知重构比强制性行为规范更能促进高效的人机协同。

链接: https://arxiv.org/abs/2604.08678
作者: Alex Farach,Alexia Cambon,Lev Tankelevitch,Connie Hsueh,Rebecca Janssen
机构: Microsoft Corporation(微软公司)
类目: General Economics (econ.GN); Human-Computer Interaction (cs.HC)
备注: Working paper. 45 pages including appendices

点击查看摘要

Abstract:Organizations have widely deployed generative AI tools, yet productivity gains remain uneven, suggesting that how people use AI matters as much as whether they have access. We conducted a field experiment with 388 employees at a Fortune 500 retailer to test two scaffolding interventions for human-AI collaboration. All participants had access to the same AI tool; we varied only the structure surrounding its use. A behavioral scaffolding intervention (a structured protocol requiring joint AI use within pairs) was associated with lower document quality relative to unstructured use and substantially lower document production. A cognitive scaffolding intervention (partnership training that reframed AI as a thought partner) was associated with higher individual document quality at the top of the distribution. Treatment participants also showed greater positive belief change across the session, though sensitivity analyses suggest this likely reflects recovery from carry-over effects rather than genuine training-induced shifts. Both findings are subject to design limitations including an AM/PM session confound, differential attrition, and LLM grading sensitivity to document length.

[HC-28] Mapping generative AI use in the human brain: divergent neural academic and mental health profiles of functional versus socio emotional AI use

【速读】：该论文旨在解决生成式人工智能对话代理（AICAs）在大学生群体中的广泛使用所构成的新型认知社会环境对其大脑发育和心理健康影响尚不明确的问题。解决方案的关键在于结合问卷调查与高分辨率结构磁共振成像（structural MRI），系统分析了不同类型的AICA使用模式（一般性、功能性与社会情感性）与学业表现、心理健康及脑结构特征之间的关联。研究发现，不同使用动机对应不同的神经机制：高频的一般性和功能性使用与更好的学业成绩、前额叶皮层和海马网络效率增强相关；而高频的社会情感性使用则与抑郁、社交焦虑等心理问题加剧以及颞上回和杏仁核体积减小相关。这表明AICA的效应具有使用模式依赖性，揭示了其对认知功能和情绪调节系统的差异化影响，为设计既能发挥教育优势又能规避心理风险的AI使用环境提供了关键依据。

链接: https://arxiv.org/abs/2604.08594
作者: Junjie Wang,Xianyang Gan,Dan Liu,Jingxian He,Stefania Ferraro,Keith M. Kendrick,Weihua Zhao,Shuxia Yao,Christian Montag,Benjamin Becker
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 45 pages, 20 figures, 5 tables

点击查看摘要

Abstract:The widespread adoption of generative artificial intelligence conversational agents (AICAs) among university students constitutes a novel cognitive social environment whose impact on the maturing brain remains elusive. Combining surveys with high resolution structural MRI, we examined patterns of general, functional, and socio emotional AICA use, academic performance, mental health, and brain structural signatures in a comparatively large sample of 222 young individuals. Across computational anatomy, meta analytic network level, and behavioral decoding analyses, we observed use specific associations. Higher general and functional AICA use frequencies were linked to better academic outcomes (GPA), larger dorsolateral prefrontal and calcarine gray matter volume, and enhanced hippocampal network clustering and local efficiency. In contrast, more frequent socio emotional AICA use was associated with poorer mental health (depression, social anxiety) and lower volume of superior temporal and amygdalar regions central to social and affective processing. These findings indicate that the same class of AI tools exerts distinct effects depending on usage patterns and motivations, engaging prefrontal hippocampal systems that support cognition versus socio emotional systems that may track distress linked usage. These heterogeneities are crucial for designing environments that harness the educational benefits of AI while mitigating mental health risks.

计算机视觉

[CV-0] ango: Taming Visual Signals for Efficient Video Large Language Models

【速读】：该论文旨在解决现有视频大语言模型（Video LLMs）中基于注意力机制的token剪枝方法和基于相似性的聚类方法所存在的两个关键问题：一是传统top-k选择策略未能充分考虑空间多模态且长尾分布的注意力分布；二是直接基于相似性的聚类容易产生碎片化簇，导致池化后的表征失真。解决方案的关键在于提出Tango框架，其核心创新包括：引入多样性驱动策略以优化注意力引导的token选择，以及设计时空旋转向量位置编码（Spatio-temporal Rotary Position Embedding, ST-RoPE）以通过局部性先验保留几何结构，从而显著提升视觉信号利用效率与模型性能。

链接: https://arxiv.org/abs/2604.09547
作者: Shukang Yin,Sirui Zhao,Hanchao Wang,Baozhi Jia,Xianquan Wang,Chaoyou Fu,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学); Reconova AI Lab (Reconova人工智能实验室); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88x inference speedup.

[CV-1] EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

【速读】：该论文旨在解决当前视觉语言模型（VLM）在具身智能任务中因训练数据缺乏精确的人类动作标签、思维链（CoT）和空间标注而导致的噪声问题，这些问题在长时程空间指令遵循任务中被放大，进而引发对象幻觉、步骤跳过或对现实物理属性的忽视。解决方案的关键在于提出EgoTL——一个面向第一人称视角数据的“边说边做”采集管道，通过词级时间戳记录分步目标与口语化推理，结合度量尺度的空间估计器校准物理属性，利用记忆库回放提供场景上下文，并以片段级标签明确导航指令与精细操作动作，从而构建高质量、多维度的基准数据集，支撑长时程生成与推理能力的评估与优化。

链接: https://arxiv.org/abs/2604.09535
作者: Lulin Liu,Dayou Li,Yiqing Liang,Sicong Jiang,Hitesh Vijay,Hezhen Hu,Xuhai Xu,Zirui Liu,Srinivas Shakkottai,Manling Li,Zhiwen Fan
机构: UMN; TAMU; Brown University (布朗大学); McGill University (麦吉尔大学); UT Austin (德克萨斯大学奥斯汀分校); Columbia University (哥伦比亚大学); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.

[CV-2] Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在标签噪声环境下提示学习（Prompt Learning）鲁棒性不足的问题。现有提示学习方法容易受到噪声标签的干扰，导致提示词表示失真、训练不稳定以及对误标样本的过拟合。其解决方案的关键在于提出VisPrompt框架，通过引入跨模态注意力机制，将图像语义信息反向注入提示表示中，使提示token能够选择性聚合与当前样本相关的视觉特征，从而锚定提示学习于更稳定的实例级视觉证据上，降低噪声监督的影响；同时设计了一种轻量级条件调制机制，根据样本视觉线索的质量自适应调节视觉信息注入强度，平衡文本侧语义先验与图像侧实例证据之间的关系，有效抑制噪声扰动、减少提示更新不稳定性，并缓解误标样本的记忆效应。

链接: https://arxiv.org/abs/2604.09532
作者: Zibin Geng,Xuefeng Jiang,Jia Li,Zheng Li,Tian Wen,Lvhua Wu,Sheng Sun,Yuwei Wang,Min Liu
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); PCALab, VCIP, College of Computer Science, Nankai University (南开大学计算机学院PCALab, VCIP); Beijing (北京)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at this https URL.

[CV-3] Envisioning the Future One Step at a Time CVPR2026

【速读】：该论文旨在解决复杂场景未来动态预测中长期多模态运动建模与高效探索之间的矛盾问题，即现有方法依赖密集视频或潜在空间预测，导致计算资源浪费在冗余外观信息上，难以实现大规模未来假设的低成本探索和长时程物理合理性保障。其解决方案的关键在于将开放集未来场景动态预测建模为稀疏点轨迹的逐步推理过程，采用自回归扩散模型通过短时局部可预测的转移步进推进轨迹，并显式建模不确定性随时间增长的过程；该以动态为中心的表示方式支持从单张图像快速生成数千种多样化且物理合理的未来轨迹，同时可通过初始运动约束进行引导，显著提升预测效率与可扩展性。

链接: https://arxiv.org/abs/2604.09527
作者: Stefan Andreas Baumann,Jannik Wiese,Tommaso Martorella,Mahdi M. Kalayeh,Björn Ommer
机构: CompVis @ LMU Munich (CompVis @ 慕尼黑大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Netflix (Netflix)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2026. For code and models, see this http URL

点击查看摘要

Abstract:Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: this http URL.

[CV-4] RIRF: Reasoning Image Restoration Framework

【速读】：该论文旨在解决通用图像恢复（Universal Image Restoration, UIR）中现有方法因缺乏对退化类型、严重程度及场景语义的显式诊断推理而导致恢复质量受限的问题。其解决方案的关键在于提出一种名为“Reason and Restore”（R\R）的新框架，该框架通过引入基于Qwen3-VL微调的显式推理模块（reasoner），实现对退化成分的结构化Chain-of-Thought（CoT）推理，从而生成可解释的细粒度诊断先验信息；同时，将推理模块输出的退化严重程度量化结果作为强化学习（Reinforcement Learning, RL）信号，用于引导和增强像素级恢复器的性能，实现了语义诊断与像素恢复的紧密耦合，显著提升了恢复质量并增强了过程的可解释性。

链接: https://arxiv.org/abs/2604.09511
作者: Wending Yan,Rongkai Zhang,Kaihua Tang,Yu Cheng,Qiankun Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Universal image restoration (UIR) aims to recover clean images from diverse and unknown degradations using a unified model. Existing UIR methods primarily focus on pixel reconstruction and often lack explicit diagnostic reasoning over degradation composition, severity, and scene semantics prior to restoration. We propose Reason and Restore (R\R), a novel framework that integrates structured Chain-of-Thought (CoT) reasoning into the image restoration pipeline. R\R introduces an explicit reasoner, implemented by fine-tuning Qwen3-VL, to diagnose degradation types, quantify degradation severity, infer key degradation-related factors, and describe relevant scene and object semantics. The resulting structured reasoning provides interpretable and fine-grained diagnostic priors for the restorer. To further improve restoration quality, the quantified degradation severity produced by the reasoner is leveraged as reinforcement learning (RL) signals to guide and strengthen the restorer. Unlike existing multimodal LLM-based agentic systems that decouple reasoning from low-level vision tasks, R\R tightly couples semantic diagnostic reasoning with pixel-level restoration in a unified framework. Extensive experiments across diverse UIR benchmarks demonstrate that R\R achieves state-of-the-art performance while offering unique interpretability into the restoration process.

[CV-5] VISOR: Agent ic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

【速读】：该论文旨在解决视觉检索增强生成（Visual Retrieval-Augmented Generation, VRAG）系统在处理多步推理任务时面临的两大瓶颈问题：一是视觉证据稀疏性（Visual Evidence Sparsity），即关键视觉证据分散于文档多页且常被孤立处理，阻碍跨页推理；同时细粒度图像内证据需要精准的视觉操作，错误操作会降低检索质量。二是长时程搜索漂移（Search Drift in Long Horizons），即随着检索页面增多，视觉token累积导致上下文稀释和认知过载，使代理偏离原始搜索目标。

解决方案的关键在于提出一个统一的单智能体框架VISOR（Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning），其核心创新包括：(1) 构建结构化的**证据空间（Evidence Space）以支持渐进式跨页推理；(2) 引入视觉动作评估与修正机制（Visual Action Evaluation and Correction mechanism）来规范细粒度视觉操作；(3) 设计带滑动窗口和意图注入的动态轨迹（Dynamic Trajectory）**以缓解长期搜索漂移，通过锚定证据空间并丢弃早期原始交互，防止上下文被视觉token淹没。此外，采用基于组相对策略优化（GRPO-based RL）的强化学习训练流程，结合状态掩码与信用分配机制，适配动态上下文重构需求，从而显著提升长时程视觉推理任务的性能与效率。

链接: https://arxiv.org/abs/2604.09508
作者: Yucheng Shen,Jiulong Wu,Jizhou Huang,Dawei Yin,Lingyong Yan,Min Cao
机构: Soochow University (苏州大学); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval… However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.

[CV-6] Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

【速读】：该论文旨在解决三维重建（3D Reconstruction）中因场景变化导致的不一致性问题，尤其是在新环境中使用预训练冻结的几何基础模型时难以适应的问题。解决方案的关键在于提出 Online3R 框架，通过引入一组可学习的轻量级视觉提示（visual prompts），使模型能够在不破坏基础模型几何预测能力的前提下，捕获新环境的知识；同时，为解决测试阶段缺乏真实标签且需高效更新的问题，设计了一种局部-全局自监督学习策略：局部一致性约束利用中间融合结果生成高质量伪标签信号，全局一致性约束则在稀疏关键帧上施加长距离轨迹一致性，从而实现高效且稳定的在线学习。

链接: https://arxiv.org/abs/2604.09480
作者: Shunkai Zhou,Zike Yan,Fei Xue,Dong Wu,Yuchen Deng,Hongbin Zha
机构: Peking University (北京大学); State Key Laboratory of General Artificial Intelligence; The Chinese University of Hong Kong (香港中文大学); NVIDIA; Southwest University (西南大学); Anqing Normal University (安庆师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks. Project page: this https URL

[CV-7] Incremental Semantics-Aided Meshing from LiDAR-Inertial Odometry and RGB Direct Label Transfer

【速读】：该论文旨在解决在大型复杂室内环境（如文化建筑）中，基于LiDAR与惯性传感器扫描数据进行高保真几何网格重建时所面临的挑战，包括点云稀疏性导致的孔洞、几何漂移引起的过平滑以及固定融合参数造成的结构边界处伪表面等问题。解决方案的关键在于提出了一种模块化、增量式的RGB+LiDAR融合流程：通过视觉基础模型对每个RGB帧进行语义标注，并将标签以帧为单位增量式地投影和融合至LiDAR-惯性里程计地图上；随后利用语义感知的截断有符号距离函数（Truncated Signed Distance Function, TSDF）融合策略生成最终网格。该方法在保留LiDAR几何精度的同时，借助丰富的视觉语义信息缓解因点云稀疏性和几何漂移带来的重建模糊问题，从而显著提升网格质量。

链接: https://arxiv.org/abs/2604.09478
作者: Muhammad Affan,Ville Lehtola,George Vosselman
机构: University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 5 figures, 2 tables. Accepted in ISPRS Archives 2026

点击查看摘要

Abstract:Geometric high-fidelity mesh reconstruction from LiDAR-inertial scans remains challenging in large, complex indoor environments – such as cultural buildings – where point cloud sparsity, geometric drift, and fixed fusion parameters produce holes, over-smoothing, and spurious surfaces at structural boundaries. We propose a modular, incremental RGB+LiDAR pipeline that generates incremental semantics-aided high-quality meshes from indoor scans through scan frame-based direct label transfer. A vision foundation model labels each incoming RGB frame; labels are incrementally projected and fused onto a LiDAR-inertial odometry map; and an incremental semantics-aware Truncated Signed Distance Function (TSDF) fusion step produces the final mesh via marching cubes. This frame-level fusion strategy preserves the geometric fidelity of LiDAR while leveraging rich visual semantics to resolve geometric ambiguities at reconstruction boundaries caused by LiDAR point-cloud sparsity and geometric drift. We demonstrate that semantic guidance improves geometric reconstruction quality; quantitative evaluation is therefore performed using geometric metrics on the Oxford Spires dataset, while results from the NTU VIRAL dataset are analyzed qualitatively. The proposed method outperforms state-of-the-art geometric baselines ImMesh and Voxblox, demonstrating the benefit of semantics-aided fusion for geometric mesh quality. The resulting semantically labelled meshes are of value when reconstructing Universal Scene Description (USD) assets, offering a path from indoor LiDAR scanning to XR and digital modeling.

[CV-8] Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement CVPR2025

【速读】：该论文旨在解决如何构建全沉浸式音视频交互体验的问题，特别是通过从真实世界捕获的视频直接生成支持大范围6自由度（6-DoF）交互的沉浸式体素视频（Immersive Volumetric Videos, IVV）。现有方法多依赖计算机生成内容，而对真实场景的高保真、多模态、动态重建仍缺乏系统性解决方案。其关键创新在于提出了一种基于空间导向采集理念的多视角、多模态数据集 ImViD，结合自研捕获装置实现运动中的同步多视图音视频采集；并开发了一个统一的动态光场重建框架，采用高斯基函数的时空表示结构，引入流引导稀疏初始化、联合相机时间校准及多目标时空监督机制，实现了复杂动态场景下高质量、时域稳定的音频-视觉一体化建模，首次实现了从多视角音视频数据中重建声场的方法，从而完整支撑IVV的生产流程。

链接: https://arxiv.org/abs/2604.09473
作者: Zhengxian Yang,Shengqi Wang,Shi Pan,Hongshuai Li,Haoxiang Wang,Lin Li,Guanjun Li,Zhengqi Wen,Borong Lin,Jianhua Tao,Tao Yu
机构: Tsinghua University (清华大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Migu Beijing Research Institute (咪咕研究院); School of Architecture, Tsinghua University (清华大学建筑学院); Department of Automation, Tsinghua University (清华大学自动化系); BNRist, Tsinghua University (清华大学脑与智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal extension of CVPR 2025. See also arXiv:2503.14359 . Project page and code: this https URL

点击查看摘要

Abstract:Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground–background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.

[CV-9] AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization

【速读】：该论文旨在解决在资源受限的边缘设备（如智能眼镜）上实现高精度且实时的视觉定位问题，同时降低计算开销以提升能效。其核心挑战在于如何在不牺牲定位精度的前提下，显著减少模型参数与推理复杂度。解决方案的关键是提出了一种不对称视觉定位（Asymmetric Visual Localization）框架，其中大型教师模型（Teacher）离线处理预构建的地图图像，轻量级学生模型（Student）在线处理查询图像；通过引入一种新颖的蒸馏框架 AsymLoc，结合几何驱动的匹配目标与检测器-描述符联合蒸馏目标，使学生模型能够直接进行快速、无参数的最近邻匹配，从而在仅使用约十分之一模型规模的情况下，达到教师模型95%的定位精度，显著优于现有方法，确立了新的效率-精度权衡基准。

链接: https://arxiv.org/abs/2604.09445
作者: Mohammad Omama,Gabriele Berton,Eric Foxlin,Yelin Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be a primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers. We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching. Extensive experiments on HPatches, ScanNet, IMC2022, and Aachen show that AsymLoc achieves up to 95% of the teacher’s localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines and establishing a new state-of-the-art efficiency-accuracy trade-off. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.09445 [cs.CV] (or arXiv:2604.09445v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.09445 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-10] SCoRe: Clean Image Generation from Diffusion Models Trained on Noisy Images IJCNN2026

【速读】：该论文旨在解决扩散模型（Diffusion Models）在训练数据包含噪声时，生成图像中会复现高频训练伪影（high-frequency training artifacts）的问题，从而显著降低生成质量。解决方案的关键在于提出一种无需重新训练的生成阶段频域再生方法——SCoRe（Spectral Cutoff Regeneration），其核心机制是利用扩散模型的频谱偏差（spectral bias）特性：模型从低频信息中推断高频细节。SCoRe通过频域截断（frequency cutoff）抑制生成图像中的受损高频成分，并借助SDEdit进行高频区域的再生；尤为关键的是，作者基于径向平均功率谱密度（Radially Averaged Power Spectral Density, RAPSD）推导出截断频率与SDEdit初始时间步之间的理论映射关系，从而有效避免再生过程中过度引入噪声，实现更接近干净图像分布的高质量生成。

链接: https://arxiv.org/abs/2604.09436
作者: Yuta Matsuzaki,Seiichi Uchida,Shumpei Takezaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IJCNN2026

点击查看摘要

Abstract:Diffusion models trained on noisy datasets often reproduce high-frequency training artifacts, significantly degrading generation quality. To address this, we propose SCoRe (Spectral Cutoff Regeneration), a training-free, generation-time spectral regeneration method for clean image generation from diffusion models trained on noisy images. Leveraging the spectral bias of diffusion models, which infer high-frequency details from low-frequency cues, SCoRe suppresses corrupted high-frequency components of a generated image via a frequency cutoff and regenerates them via SDEdit. Crucially, we derive a theoretical mapping between the cutoff frequency and the SDEdit initialization timestep based on Radially Averaged Power Spectral Density (RAPSD), which prevents excessive noise injection during regeneration. Experiments on synthetic (CIFAR-10) and real-world (SIDD) noisy datasets demonstrate that SCoRe substantially outperforms post-processing and noise-robust baselines, restoring samples closer to clean image distributions without any retraining or fine-tuning.

[CV-11] Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

【速读】：该论文旨在解决计算机视觉与图形学中相机参数恢复与场景渲染长期被视作独立任务所带来的局限性，尤其在图像覆盖稀疏或相机位姿存在歧义时，单一任务难以依赖自身完成高质量重建。其核心解决方案是提出“Rays as Pixels”（Raxels），即通过将相机轨迹表示为密集的射线像素（raxels），并利用解耦自-交叉注意力机制（Decoupled Self-Cross Attention）联合去噪视频帧与相机轨迹，从而学习视频与相机路径的联合分布。该方法使单个模型能够统一处理三类任务：从视频中预测相机轨迹、从输入图像联合生成视频与相机轨迹、以及沿指定相机轨迹生成视频。关键创新在于通过闭环自一致性测试验证了模型正向与逆向预测的一致性，且轨迹预测所需去噪步骤远少于视频生成，展现出高效且鲁棒的多任务协同能力。

链接: https://arxiv.org/abs/2604.09429
作者: Wonbong Jang,Shikun Liu,Soubhik Sanyal,Juan Camilo Perez,Kam Woh Ng,Sanskar Agrawal,Juan-Manuel Perez-Rua,Yiannis Douratsos,Tao Xiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 6 figures, 4 tables. Project page: this https URL

点击查看摘要

Abstract:Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. We represent each camera as dense ray pixels (raxels) and denoise them jointly with video frames through Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, jointly generating video and camera trajectory from input images, and generating video from input images along a target camera trajectory. Because the model can both predict trajectories from a video and generate views conditioned on its own predictions, we evaluate it through a closed-loop self-consistency test, demonstrating that its forward and inverse predictions agree. Notably, trajectory prediction requires far fewer denoising steps than video generation, even a few denoising steps suffice for self-consistency. We report results on pose estimation and camera-controlled video generation.

[CV-12] Do Vision Language Models Need to Process Image Tokens? CVPR2026

【速读】：该论文旨在解决视觉语言模型（Vision Language Models, VLMs）中图像标记（image tokens）在深层Transformer架构中的必要性与功能演化问题，即是否必须持续处理图像标记以维持性能，以及视觉表征是否在深度方向上发生有意义的演变。其解决方案的关键在于系统性地分析VLM中图像和文本表征在不同层间的动态变化：研究发现，视觉表征在早期层即快速收敛至一个低复杂度稳定状态（熵稳定、内在维度压缩、轨迹曲率趋近恒定），且各层间可互换，表明深层对视觉信息的进一步变换有限；相较之下，文本表征则持续重构。此外，深度截断实验揭示任务依赖性——单标记预测对视觉深度削减鲁棒，而多标记生成需持续访问视觉表示；确定性解码下，减少视觉深度更显著扰动中间推理路径而非最终输出，说明图像标记主要影响推理结构而非结论本身。这些结果挑战了当前多模态大语言模型架构中“更深视觉处理必然更好”的假设。

链接: https://arxiv.org/abs/2604.09425
作者: Sambit Ghosh,R. Venkatesh Babu,Chirag Agarwal
机构: IBM(国际商业机器公司); Indian Institute of Science (印度科学研究所); University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted (Oral) at TRUE-V Workshop CVPR 2026

点击查看摘要

Abstract:Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational overhead), it remains fundamentally unclear whether sustained image-token processing is necessary for their performance or visual representations meaningfully evolve from early to later layers. In this work, we systematically investigate the functional role of image tokens in VLMs and show that visual representations rapidly converge to a bounded-complexity regime, \ie their entropy stabilizes, intrinsic dimensionality compresses, and trajectory curvature approaches a near-constant profile. In contrast, textual representations continue to undergo substantial restructuring across depth. Once stabilized, visual representations become largely interchangeable between layers, indicating limited additional transformation in deeper stages. Further, depth-wise visual truncation reveals that the necessity of visual processing is task-dependent, where single-token predictions remain comparatively robust to truncated visual depth, but multi-token generation require sustained access to visual representations. Under deterministic decoding, reducing visual depth perturbs intermediate reasoning trajectories more strongly than final outputs, suggesting that image tokens influence the structure of reasoning more than the ultimate conclusions. Collectively, these findings \textbfquestion the assumption that deeper visual processing is uniformly essential in VLMs, challenging the current paradigm of multimodal LLM architectures.

[CV-13] PhysInOne: Visual Physics Learning and Reasoning in One Suite CVPR2026

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 系统在训练过程中普遍面临的物理基础数据稀缺问题，尤其是缺乏大规模、多样化且具有精确物理标注的视频数据。为应对这一挑战，作者提出 PhysInOne——一个包含 200 万段视频、覆盖 153,810 个动态 3D 场景和 71 种基本物理现象（涵盖力学、光学、流体动力学与磁学）的大规模合成数据集。其关键创新在于：不仅规模远超以往同类工作（数量级提升），还提供了多模态、高精度的地面真实标注，包括 3D 几何结构、语义信息、动态运动轨迹、物理属性及文本描述，从而支持物理感知的视频生成、未来帧预测、物理属性估计与运动迁移等前沿任务。实验表明，基于 PhysInOne 微调的基础模型显著提升了物理合理性，同时揭示了现有模型在复杂物理动态建模与内在属性估计方面的局限性。

链接: https://arxiv.org/abs/2604.09415
作者: Siyuan Zhou,Hejun Wang,Hu Cheng,Jinxi Li,Dongsheng Wang,Junwei Jiang,Yixiao Jin,Jiayue Huang,Shiwei Mao,Shangjia Liu,Yafei Yang,Hongkang Song,Shenxing Wei,Zihui Zhang,Peng Huang,Shijie Liu,Zhengli Hao,Hao Li,Yitian Li,Wenqi Zhou,Zhihan Zhao,Zongqi He,Hongtao Wen,Shouwang Huang,Peng Yun,Bowen Cheng,Pok Kazaf Fu,Wai Kit Lai,Jiahao Chen,Kaiyuan Wang,Zhixuan Sun,Ziqi Li,Haochen Hu,Di Zhang,Chun Ho Yuen,Bing Wang,Zhihua Wang,Chuhang Zou,Bo Yang
机构: vLAR Group; The Hong Kong Polytechnic University; Syai Singapore; Meta
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: CVPR 2026. Siyuan, Hejun, Hu, Jinxi, Dongsheng, Junwei, Yixiao, Jiayue, and Shiwei are co-first authors. Project page: this https URL

点击查看摘要

Abstract:We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne’s efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.

[CV-14] SynFlow: Scaling Up LiDAR Scene Flow Estimation with Synthetic Data

【速读】：该论文旨在解决3D动态感知中因真实场景中密集高质量运动标注稀缺而导致模型难以泛化的问题，特别是现有方法在无监督学习中仅依赖未标注真实数据时无法有效缩小性能差距，主要受限于噪声代理信号（noisy proxy signals）。其解决方案的关键在于提出一种范式转变：完全从可扩展的仿真中学习鲁棒的真实世界运动先验。核心创新是构建SynFlow数据生成流水线，通过以运动为导向的策略合成4,000段序列（约94万帧）的多样化运动模式，形成大规模合成LiDAR场景流数据集SynFlow-4k，相较现有真实基准提升34倍标注规模。实验证明该数据集提供了高度域不变的运动先验，在零样本场景下训练的模型能跨多个真实基准实现媲美甚至超越监督基线的表现，且仅用5%真实标签微调即可优于全量真实数据训练模型，显著提升了标签效率与泛化能力。

链接: https://arxiv.org/abs/2604.09411
作者: Qingwen Zhang,Xiaomeng Zhu,Chenhan Jiang,Patric Jensfelt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable 3D dynamic perception requires models that can anticipate motion beyond predefined categories, yet progress is hindered by the scarcity of dense, high-quality motion annotations. While self-supervision on unlabeled real data offers a path forward, empirical evidence suggests that scaling unlabeled data fails to close the performance gap due to noisy proxy signals. In this paper, we propose a shift in paradigm: learning robust real-world motion priors entirely from scalable simulation. We introduce SynFlow, a data generation pipeline that generates large-scale synthetic dataset specifically designed for LiDAR scene flow. Unlike prior works that prioritize sensor-specific realism, SynFlow employs a motion-oriented strategy to synthesize diverse kinematic patterns across 4,000 sequences ( \sim 940k frames), termed SynFlow-4k. This represents a 34x scale-up in annotated volume over existing real-world benchmarks. Our experiments demonstrate that SynFlow-4k provides a highly domain-invariant motion prior. In a zero-shot regime, models trained exclusively on our synthetic data generalize across multiple real-world benchmarks, rivaling in-domain supervised baselines on nuScenes and outperforming state-of-the-art methods on TruckScenes by 31.8%. Furthermore, SynFlow-4k serves as a label-efficient foundation: fine-tuning with only 5% of real-world labels surpasses models trained from scratch on the full available budget. We open-source the pipeline and dataset to facilitate research in generalizable 3D motion estimation. More detail can be found at this https URL.

[CV-15] EGLOCE: Training-Free Energy-Guided Latent Optimization for Concept Erasure

【速读】：该论文旨在解决文本到图像扩散模型中特定概念（如敏感内容、受版权保护的角色或风格）难以有效移除的问题，现有方法要么需要昂贵的再训练、修改模型参数导致无关概念保真度下降，要么依赖间接的推理时调整而削弱了概念擦除效果。其解决方案的关键在于提出一种无需训练的“能量引导潜在优化”方法（Energy-Guided Latent Optimization for Concept Erasure, EGLOCE），通过在推理阶段重新定向噪声潜在空间中的梯度路径来实现概念擦除：该方法采用双目标框架——排斥能量（repulsion energy）引导生成过程远离目标概念，保留能量（retention energy）确保与原始提示语义一致，从而在不修改模型权重的前提下显著提升擦除性能，并具备即插即用特性。

链接: https://arxiv.org/abs/2604.09405
作者: Junyeong Ahn,Seojin Yoon,Sungyong Baik
机构: KAIST AI; Hanyang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As text-to-image diffusion models grow increasingly prevalent, the ability to remove specific concepts-mostly explicit content and many copyrighted characters or styles-has become essential for safety and compliance. Existing unlearning approaches often require costly re-training, modify parameters at the cost of degradation of unrelated concept fidelity, or depend on indirect inference-time adjustment that compromise the effectiveness of concept erasure. Inspired by the success of energy-guided sampling for preservation of the condition of diffusion models, we introduce Energy-Guided Latent Optimization for Concept Erasure (EGLOCE), a training-free approach that removes unwanted concepts by re-directing noisy latent during inference. Our method employs a dual-objective framework: a repulsion energy that steers generation away from target concepts via gradient descent in latent space, and a retention energy that preserves semantic alignment to the original prompt. Combined with previous approaches that either require erroneous modified model weights or provide weak inference-time guidance, EGLOCE operates entirely at inference and enhances erasure performance, enabling plug-and-play integration. Extensive experiments demonstrate that EGLOCE improves concept removal while maintaining image quality and prompt alignment across baselines, even with adversarial attacks. To the best of our knowledge, our work is the first to establish a new paradigm for safe and controllable image generation through dual energy-based guidance during sampling.

[CV-16] Efficient Unlearning through Maximizing Relearning Convergence Delay

【速读】：该论文旨在解决机器遗忘（Machine Unlearning）中现有方法无法充分评估模型对被遗忘数据真实理解程度的问题。当前方法仅关注模型预测空间的变化，忽略了权重空间的演变，导致难以准确衡量遗忘效果及潜在的数据恢复风险。其解决方案的关键在于提出一种新指标——重学习收敛延迟（relearning convergence delay），该指标同时捕捉权重空间与预测空间的变化，从而更全面地评估模型对遗忘数据的理解；在此基础上，作者进一步提出了影响消除遗忘框架（Influence Eliminating Unlearning, IEU），通过降低遗忘集性能、引入权重衰减和噪声注入等方式，在保持保留集准确率的同时有效削弱遗忘数据的影响，实验证明该方法在分类和生成式任务中均表现出优异的保留能力和抗重学习能力，并具备理论上的指数收敛性和上界保证。

链接: https://arxiv.org/abs/2604.09391
作者: Khoa Tran,Simon S. Woo
机构: Sungkyunkwan University (成均馆大学); Secure Machines Lab
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine unlearning poses challenges in removing mislabeled, contaminated, or problematic data from a pretrained model. Current unlearning approaches and evaluation metrics are solely focused on model predictions, which limits insight into the model’s true underlying data characteristics. To address this issue, we introduce a new metric called relearning convergence delay, which captures both changes in weight space and prediction space, providing a more comprehensive assessment of the model’s understanding of the forgotten dataset. This metric can be used to assess the risk of forgotten data being recovered from the unlearned model. Based on this, we propose the Influence Eliminating Unlearning framework, which removes the influence of the forgetting set by degrading its performance and incorporates weight decay and injecting noise into the model’s weights, while maintaining accuracy on the retaining set. Extensive experiments show that our method outperforms existing metrics and our proposed relearning convergence delay metric, approaching ideal unlearning performance. We provide theoretical guarantees, including exponential convergence and upper bounds, as well as empirical evidence of strong retention and resistance to relearning in both classification and generative unlearning tasks.

[CV-17] Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing

【速读】：该论文旨在解决流模型（flow-based models）在指令引导图像编辑中因全局探索导致的噪声信用分配问题，即背景区域扰动会增加组内奖励方差，从而产生不稳定的GRPO（Generalized Reward Policy Optimization）优势估计，影响编辑区域指令遵循度与非目标区域保留效果。解决方案的关键在于提出RC-GRPO-Editing框架：通过区域解耦的初始噪声扰动实现探索局部化，降低背景引起的奖励方差；并引入注意力集中奖励机制，使跨注意力在推理过程中始终聚焦于预期编辑区域，减少非目标区域的意外变化，从而提升编辑精度和一致性。

链接: https://arxiv.org/abs/2604.09386
作者: Zhuohan Ouyang,Zhe Qian,Wenhuo Cui,Chaoqun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-guided image editing requires balancing target modification with non-target preservation. Recently, flow-based models have emerged as a strong and increasingly adopted backbone for instruction-guided image editing, thanks to their high fidelity and efficient deterministic ODE sampling. Building on this foundation, GRPO-based reward-driven post-training has been explored to directly optimize editing-specific rewards, improving instruction following and editing consistency. However, existing methods often suffer from noisy credit assignment: global exploration also perturbs non-target regions, inflating within-group reward variance and yielding noisy GRPO advantages. To address this, we propose RC-GRPO-Editing, a region-constrained GRPO post-training framework for flow-based image editing under deterministic ODE sampling. It suppresses background-induced nuisance variance to enable cleaner localized credit assignment, improving editing region instruction adherence while preserving non-target content. Concretely, we localize exploration via region-decoupled initial noise perturbations to reduce background-induced reward variance and stabilize GRPO advantages, and introduce an attention concentration reward that aligns cross-attention with the intended editing region throughout the rollout, reducing unintended changes in non-target regions. Experiments on CompBench show consistent improvements in editing region instruction adherence and non-target preservation.

[CV-18] hrough Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的用户模拟器在推荐系统评估中因仅依赖文本或结构化元数据而非视觉界面而导致的仿真保真度不足的问题。现有模拟器无法捕捉真实用户浏览推荐内容时由视觉注意力驱动且高度个性化的决策过程，从而限制了其对用户行为的准确建模。解决方案的关键在于通过将视觉-语言模型（Vision-Language Model, VLM）的内部视觉注意力与个体用户的注视模式对齐，提升模拟器的感知一致性。具体而言，作者提出Fixation-Aligned Tuning for user Emulation (FixATE)，首先利用可解释性操作符探测VLM的槽位级视觉注意力分布，并将其与人类注视热图进行比较；随后学习个性化软提示（soft prompts），引导模型注意力聚焦于每位用户的特定注视偏好。实验表明，该方法在多种可解释性探针和不同架构的VLM上均显著提升了注意力对齐程度与点击预测准确性，验证了“让模型像用户一样看”是提高推荐系统仿真可信度的有效路径。

链接: https://arxiv.org/abs/2604.09368
作者: Lingfeng Huang,Huizhong Guo,Tianjun Wei,Yingpeng Du,Zhu Sun
机构: Singapore University of Technology and Design(新加坡科技设计大学); Zhejiang University(浙江大学); Nanyang Technological University(南洋理工大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse-a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision-language model’s (VLM’s) visual attention with user-specific gaze patterns can improve simulation fidelity. Analysis of a real-world eye-tracking dataset collected in a carousel-based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation-Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM’s internal visual attention via interpretability operators to obtain a slot-level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model’s attention toward each user’s characteristic fixation pattern. Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model “see like the user” is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.

[CV-19] EpiAgent : An Agent -Centric System for Ancient Inscription Restoration CVPR2026

【速读】：该论文旨在解决古代铭文（epigraphs）在长期自然与人为因素作用下出现的视觉与文本信息交织性退化问题，这一问题严重阻碍了数字文化遗产保护中的修复精度与泛化能力。现有基于人工智能的方法多依赖固定流程，难以应对复杂多样的实际退化场景。其解决方案的关键在于提出EpiAgent——一种以代理（agent）为中心的系统架构，将铭文修复建模为分层规划问题，并采用“观察-构想-执行-重评”（Observe-Conceive-Execute-Reevaluate）范式，由大语言模型（LLM）驱动的中央规划器协调多模态分析、历史经验知识、专用修复工具及迭代自我优化机制，从而实现灵活且自适应的修复流程，显著优于传统单次通过方法，在真实退化铭文上展现出更优的修复质量与更强的泛化性能。

链接: https://arxiv.org/abs/2604.09367
作者: Shipeng Zhu,Ang Chen,Na Nie,Pengfei Fang,Min-Ling Zhang,Hui Xue
机构: Southeast University (东南大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at this https URL.

[CV-20] Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors

【速读】：该论文旨在解决动态4D场景重建中的几何模糊问题，即在动态序列中由于运动导致的结构不确定性难以准确建模。其核心解决方案在于通过多阶段不确定性建模实现动态与静态成分的解耦：首先引入熵引导的子空间投影机制，利用信息论权重自适应聚合多头注意力分布，从而从语义噪声中分离出动态运动特征；其次采用局部一致性驱动的几何净化策略，基于半径邻域约束强化空间连续性以去除结构异常点；最后构建不确定性感知的跨视角一致性优化，将多视角投影精化建模为异方差最大似然估计问题，并以深度置信度作为概率权重进行优化。该框架在无需任务特定微调或逐场景优化的前提下，实现了高效前向推理并显著提升重建精度与分割性能。

链接: https://arxiv.org/abs/2604.09366
作者: Ying Zang,Yidong Han,Chaotao Ding,Yuanqi Hu,Deyi Ji,Qi Zhu,Xuanfu Li,Jin Ma,Lingyun Sun,Tianrun Chen,Lanyun Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic 4D scenes is an important yet challenging task. While 3D foundation models like VGGT excel in static settings, they often struggle with dynamic sequences where motion causes significant geometric ambiguity. To address this, we present a framework designed to disentangle dynamic and static components by modeling uncertainty across different stages of the reconstruction process. Our approach introduces three synergistic mechanisms: (1) Entropy-Guided Subspace Projection, which leverages information-theoretic weighting to adaptively aggregate multi-head attention distributions, effectively isolating dynamic motion cues from semantic noise; (2) Local-Consistency Driven Geometry Purification, which enforces spatial continuity via radius-based neighborhood constraints to eliminate structural outliers; and (3) Uncertainty-Aware Cross-View Consistency, which formulates multi-view projection refinement as a heteroscedastic maximum likelihood estimation problem, utilizing depth confidence as a probabilistic weight. Experiments on dynamic benchmarks show that our approach outperforms current state-of-the-art methods, reducing Mean Accuracy error by 13.43% and improving segmentation F-measure by 10.49%. Our framework maintains the efficiency of feed-forward inference and requires no task-specific fine-tuning or per-scene optimization.

[CV-21] LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation CVPR

【速读】：该论文旨在解决单目深度估计（Monocular Depth Estimation, MDE）在月球探测场景下的域偏移（domain gap）问题，即现有基于地球环境训练的MDE模型难以适应月面极端光照、无纹理地形及缺乏大气散射等独特条件。解决方案的关键在于构建一个名为LuMon的综合性基准测试框架，其核心包括两个创新性数据集：一是来自真实嫦娥三号任务的高精度立体视觉地面真值深度数据，二是CERI暗模拟数据集；通过该框架对先进MDE架构进行零样本评估，并系统分析其在撞击坑、岩石、极端阴影和不同深度范围等关键任务挑战下的性能表现。此外，研究还建立了从仿真到真实的域自适应基线，揭示了当前方法在跨域迁移中的显著局限性，为未来外星感知与域适应研究提供了标准化评估基础。

链接: https://arxiv.org/abs/2604.09352
作者: Aytaç Sekmen,Fatih Emre Gunes,Furkan Horoz,Hüseyin Umut Işık,Mehmet Alp Ozaydin,Onur Altay Topaloglu,Şahin Umutcan Üstündaş,Yurdasen Alp Yeni,Halil Ersin Soken,Erol Sahin,Ramazan Gokberk Cinbis,Sinan Kalkan
机构: Middle East Technical University (中空大学); ROMER (罗马)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper will be published in CVPRW2026

点击查看摘要

Abstract:Monocular Depth Estimation (MDE) is crucial for autonomous lunar rover navigation using electro-optical cameras. However, deploying terrestrial MDE networks to the Moon brings a severe domain gap due to harsh shadows, textureless regolith, and zero atmospheric scattering. Existing evaluations rely on analogs that fail to replicate these conditions and lack actual metric ground truth. To address this, we present LuMon, a comprehensive benchmarking framework to evaluate MDE methods for lunar exploration. We introduce novel datasets featuring high-quality stereo ground truth depth from the real Chang’e-3 mission and the CHERI dark analog dataset. Utilizing this framework, we conduct a systematic zero-shot evaluation of state-of-the-art architectures across synthetic, analog, and real datasets. We rigorously assess performance against mission critical challenges like craters, rocks, extreme shading, and varying depth ranges. Furthermore, we establish a sim-to-real domain adaptation baseline by fine tuning a foundation model on synthetic data. While this adaptation yields drastic in-domain performance gains, it exhibits minimal generalization to authentic lunar imagery, highlighting a persistent cross-domain transfer gap. Our extensive analysis reveals the inherent limitations of current networks and sets a standard foundation to guide future advancements in extraterrestrial perception and domain adaptation.

[CV-22] VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

【速读】：该论文旨在解决当前机器人世界模型（World Model, WM）在政策学习中难以生成与动作轨迹对齐的视频数据的问题，尤其是在使用合成数据进行预训练时，现有方法要么缺乏强视频-动作一致性（如World-Action模型），要么依赖两阶段流程导致效率低下和误差累积。其解决方案的关键在于提出一种基于流匹配（flow-matching）的统一双流框架VAG（Video-Action Generation），该框架在视觉和语言条件驱动下联合生成视频与动作序列，通过同步两个分支的去噪过程，并引入自适应3D池化机制将紧凑的全局视频上下文传递至动作分支，从而显著提升跨模态一致性，实现高质量且可执行的动作轨迹重建，为机器人提供有效的合成预训练数据。

链接: https://arxiv.org/abs/2604.09330
作者: Xiaolei Lang,Yang Wang,Yukun Zhou,Chaojun Ni,Kerui Li,Jiagang Zhu,Tianze Liu,Jiajun Lv,Xingxing Zuo,Yun Ye,Guan Huang,Xiaofeng Wang,Zheng Zhu
机构: GigaAI; Zhejiang University (浙江大学); Peking University (北京大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Robotics Department, Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学机器人系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.

[CV-23] From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection

【速读】：该论文旨在解决当前基于姿态的视频异常检测（Pose-based Video Anomaly Detection, VAD）研究中普遍存在的评估范式偏差问题。现有方法多采用帧级（frame-level）指标进行性能评估，但这种评估方式忽略了真实场景中异常事件具有连续时间片段特征（即存在明确起始点和持续时间的事件），导致模型在实际部署时难以生成可靠、可操作的事件级警报。解决方案的关键在于提出一种以事件为中心（event-centric）的新视角：首先系统性分析主流VAD基准数据集（如SHT、CHAD、NWPUC、HuVAD）的事件结构；其次设计两种用于时间事件定位的策略——一种是基于层次化高斯平滑与自适应二值化的评分精炼流程，另一种是端到端的双分支模型直接输出事件级检测结果；最后建立首个基于时间动作定位指标（如tIoU匹配和多阈值F1）的事件级评估标准，从而揭示了当前最先进模型在事件级定位上的显著性能下降（平均F1仅为0.11），凸显了从帧级向事件级评估转变的必要性。

链接: https://arxiv.org/abs/2604.09327
作者: Narges Rashvand,Shanle Yao,Armin Danesh Pazho,Babak Rahimi Ardabili,Hamed Tabkhi
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at this https URL.

[CV-24] Multimodal Anomaly Detection for Human-Robot Interaction

【速读】：该论文旨在解决人机协作（Human-Robot Interaction, HRI）中异常事件的及时检测问题，以保障系统安全与可靠性。其核心挑战在于如何有效识别协作任务中由外部环境变化或机器人内部故障引起的偏离正常行为的异常情况。解决方案的关键在于提出一种名为MADRI的框架，该框架首先将视频流转换为语义丰富的特征向量（feature vectors），在此基础上进行基于重构的异常检测；同时，通过融合机器人内部传感器数据和场景图（Scene Graph）信息，实现对视觉环境异常和机器人自身状态异常的联合建模，从而提升多模态特征重构在人机协作场景下的异常检测鲁棒性。

链接: https://arxiv.org/abs/2604.09326
作者: Guilherme Ribeiro,Iordanis Antypas,Leonardo Bizzaro,João Bimbo,Nuno Cruz Garcia
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring safety and reliability in human-robot interaction (HRI) requires the timely detection of unexpected events that could lead to system failures or unsafe behaviours. Anomaly detection thus plays a critical role in enabling robots to recognize and respond to deviations from normal operation during collaborative tasks. While reconstruction models have been actively explored in HRI, approaches that operate directly on feature vectors remain largely unexplored. In this work, we propose MADRI, a framework that first transforms video streams into semantically meaningful feature vectors before performing reconstruction-based anomaly detection. Additionally, we augment these visual feature vectors with the robot’s internal sensors’ readings and a Scene Graph, enabling the model to capture both external anomalies in the visual environment and internal failures within the robot itself. To evaluate our approach, we collected a custom dataset consisting of a simple pick-and-place robotic task under normal and anomalous conditions. Experimental results demonstrate that reconstruction on vision-based feature vectors alone is effective for detecting anomalies, while incorporating other modalities further improves detection performance, highlighting the benefits of multimodal feature reconstruction for robust anomaly detection in human-robot collaboration.

[CV-25] Structure-Aware Fine-Grained Gaussian Splatting for Expressive Avatar Reconstruction

【速读】：该论文旨在解决从单目视频中重建具有逼真纹理和拓扑感知能力的全身人体虚拟形象（human avatar）这一挑战，现有方法虽能有效捕捉身体运动，但在手部动作和面部表情等细节建模上表现不足。解决方案的关键在于提出结构感知的细粒度高斯点绘制方法（Structure-aware Fine-grained Gaussian Splatting, SFGS），其创新性地结合仅空间的三平面（spatial-only triplane）与时间感知的六平面（time-aware hexplane）以捕捉连续帧间的动态特征，并设计结构感知高斯模块来实现姿态依赖细节的空间一致性建模，从而提升姿态和纹理表达能力；此外，引入基于细粒度手部重建的残差优化模块进一步改善手部形变建模，整体采用单阶段训练即可生成高质量、具自然运动且细节丰富的3D人体虚拟形象。

链接: https://arxiv.org/abs/2604.09324
作者: Yuze Su,Hongsong Wang,Jie Gui,Liang Wang
机构: Southeast University (东南大学); Purple Mountain Laboratories (紫金山实验室); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code is on Github: this https URL

点击查看摘要

Abstract:Reconstructing photorealistic and topology-aware human avatars from monocular videos remains a significant challenge in the fields of computer vision and graphics. While existing 3D human avatar modeling approaches can effectively capture body motion, they often fail to accurately model fine details such as hand movements and facial expressions. To address this, we propose Structure-aware Fine-grained Gaussian Splatting (SFGS), a novel method for reconstructing expressive and coherent full-body 3D human avatars from a monocular video sequence. The SFGS use both spatial-only triplane and time-aware hexplane to capture dynamic features across consecutive frames. A structure-aware gaussian module is designed to capture pose-dependent details in a spatially coherent manner and improve pose and texture expression. To better model hand deformations, we also propose a residual refinement module based on fine-grained hand reconstruction. Our method requires only a single-stage training and outperforms state-of-the-art baselines in both quantitative and qualitative evaluations, generating high-fidelity avatars with natural motion and fine details. The code is on Github: this https URL

[CV-26] VAGNet: Vision-based accident anticipation with global features

【速读】：该论文旨在解决交通事故预测中的实时性与计算效率问题，即如何在不依赖复杂对象级特征提取的前提下，实现基于行车记录仪视频的高效事故预判。现有方法通常需对每个检测到的物体进行特征提取，导致计算开销大、难以满足实时性要求。其解决方案的关键在于提出VAGNet——一种融合Transformer与图结构模块的深度神经网络，利用视觉基础模型VideoMAE-V2直接从全局交通场景中提取特征，从而避免显式的目标检测和逐对象分析，显著提升预测精度与响应时间的同时降低计算复杂度。

链接: https://arxiv.org/abs/2604.09305
作者: Vipooshan Vipulananthan,Charith D. Chitraranjan
机构: University of Moratuwa (莫鲁塔瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traffic accidents are a leading cause of fatalities and injuries across the globe. Therefore, the ability to anticipate hazardous situations in advance is essential. Automated accident anticipation enables timely intervention through driver alerts and collision avoidance maneuvers, forming a key component of advanced driver assistance systems. In autonomous driving, such predictive capabilities support proactive safety behaviors, such as initiating defensive driving and human takeover when required. Using dashcam video as input offers a cost-effective solution, but it is challenging due to the complexity of real-world driving scenes. Accident anticipation systems need to operate in real-time. However, current methods involve extracting features from each detected object, which is computationally intensive. We propose VAGNet, a deep neural network that learns to predict accidents from dash-cam video using global features of traffic scenes without requiring explicit object-level features. The network consists of transformer and graph modules, and we use the vision foundation model VideoMAE-V2 for global feature extraction. Experiments on four benchmark datasets (DAD, DoTA, DADA, and Nexar) show that our method anticipates accidents with higher average precision and mean time-to-accident while being computationally more efficient compared to existing methods.

[CV-27] GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic

【速读】：该论文旨在解决从物理基础渲染（Physically-Based Rendering, PBR）到摄影真实感渲染（Photorealistic Rendering, PRR）之间的“P2P间隙”问题，即如何在保持物理真实性的同时实现感知上的摄影级真实感。当前路径面临两难：显式模拟PRR受限于现实中难以获取的高保真数字模型，而隐式生成模型则牺牲了控制性和几何一致性。解决方案的关键在于提出首个多模态生成渲染模型GeRM，其核心是将PBR到PRR的过渡建模为分布迁移（distribution transfer），并学习一个分布迁移向量场（Distribution Transfer Vector Field, DTV Field）来引导这一过程；通过构建专家指导的成对数据集P2P-50K和设计多条件ControlNet架构，GeRM能够融合G-buffers、文本提示及增强区域线索，实现可控的渐进式图像生成，从而在严格物理保真与感知真实感之间提供平滑过渡。

链接: https://arxiv.org/abs/2604.09304
作者: Jiayuan Lu,Rengan Xie,Xuancheng Jin,Zhizhen Wu,Qi Ye,Tian Xie,Hujun Bao,Rui Wang. Yuchi Huo
机构: Zhejiang University (浙江大学); Zhejiang Lab (浙江实验室); State Key Lab of CADCG (CADCG国家重点实验室); State Key Laboratory of Industrial Control Technology (工业控制技术国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:For decades, Physically-Based Rendering (PBR) is the fundation of synthesizing photorealisitic images, and therefore sometimes roughly referred as Photorealistic Rendering (PRR). While PBR is indeed a mathematical simulation of light transport that guarantees physical reality, photorealism has additional reliance on the realistic digital model of geometry and appearance of the real world, leaving a barely explored gap from PBR to PRR (P2P). Consequently, the path toward photorealism faces a critical dilemma: the explicit simulation of PRR encumbered by unreachable realistic digital models for real-world existence, while implicit generation models sacrifice controllability and geometric consistency. Based on this insight, this paper presents the problem, data, and approach of mitigating P2P gap, followed by the first multi-modal generative rendering model, dubbed GeRM, to unify PBR and PRR. GeRM integrates physical attributes like G-buffers with text prompts, and progressive incremental injection to generate controllable photorealistic images, allowing users to fluidly navigate the continuum between strict physical fidelity and perceptual photorealism. Technically, we model the transition between PBR and PRR images as a distribution transfer and aim to learn a distribution transfer vector field (DTV Field) to guide this process. To define the learning objective, we first leverage a multi-agent VLM framework to construct an expert-guided pairwise P2P transfer dataset, named P2P-50K, where each paired sample in the dataset corresponds to a transfer vector in the DTV Field. Subsequently, we propose a multi-condition ControlNet to learn the DTV Field, which synthesizes PBR images and progressively transitions them into PRR images, guided by G-buffers, text prompts, and cues for enhanced regions.

[CV-28] Characterizing Lidar Range-Measurement Ambiguity due to Multiple Returns ALT

【速读】：该论文旨在解决激光雷达（Lidar）在道路环境中因多表面散射导致的测量不确定性问题，特别是当同一射线路径上存在多个反射面时，传统算法假设唯一返回信号的合理性受到挑战。其解决方案的关键在于通过分析两个不同旋转式激光雷达在静止状态下采集的数据集，构建代表性累积分布函数（CDF），以量化多返回概率特性，并提出一种定性方法评估此类概率性多回波对基于激光雷达定位（lidar-based localization）性能的影响。

链接: https://arxiv.org/abs/2604.09282
作者: Jason H. Rife,Yifan Li
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Proceedings of the 38th International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS+ 2025), Baltimore, Maryland, September 2025, pp. 1949-1963

点击查看摘要

Abstract:Reliable position and attitude sensing is critical for highly automated vehicles that operate on conventional roadways. Lidar sensors are increasingly incorporated into pose-estimation systems. Despite its great utility, lidar is a complex sensor, and its performance in roadway environments is not yet well understood. For instance, it is often assumed in lidar-localization algorithms that a lidar will always identify a unique surface along a given raypath. However, this assumption is not always true, as ample prior evidence exists to suggest that lidar units may generate measurements probabilistically when more than one scattering surface appears within the lidar’s conical beam. In this paper, we analyze lidar datasets to characterize cases with probabilistic returns along particular raypaths. Our contribution is to present representative cumulative distribution functions (CDFs) for raypaths observed by two different mechanically rotating lidar units with stationary bases. In subsequent discussion, we outline a qualitative methodology to assess the effect of probabilistic multi-return cases on lidar-based localization.

[CV-29] Beyond Segmentation: Structurally Informed Facade Parsing from Imperfect Images

【速读】：该论文旨在解决标准目标检测器在建筑立面解析中缺乏结构一致性的问题，即检测结果往往独立处理各个建筑元素，导致后续程序化重建所需的几何合理性不足。其解决方案的关键在于通过引入一种轻量级对齐损失（alignment loss）来增强YOLOv8的训练目标，该损失在不改变标准推理流程的前提下，引导边界框在训练过程中形成网格一致性的排列，从而有效注入几何先验信息，提升结构规则性并纠正因透视和遮挡引起的对齐错误，同时保持与检测精度之间的可控权衡。

链接: https://arxiv.org/abs/2604.09260
作者: Maciej Janicki,Aleksander Plocharski,Przemyslaw Musialski
机构: Warsaw University of Technology (华沙理工大学); Akces NCBR; New Jersey Institute of Technology (新泽西理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 4 pages, 4 figures, EUROGRAPHICS 2026 Short Paper

点击查看摘要

Abstract:Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction. We address this limitation by augmenting the YOLOv8 training objective with a custom lightweight alignment loss. This regularization encourages grid-consistent arrangements of bounding boxes during training, effectively injecting geometric priors without altering the standard inference pipeline. Experiments on the CMP dataset demonstrate that our method successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion while maintaining a controllable trade-off with standard detection accuracy.

[CV-30] Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

【速读】：该论文旨在解决生成式 AI（Generative AI）中视觉-语言模型（Vision-Language Models, VLMs）在面对多模态越狱攻击（multimodal jailbreak attacks）时的脆弱性问题，尤其关注现有攻击方法在异构封闭源VLM环境下的有效性不足。现有方法要么依赖易被检测的显式视觉提示攻击，要么采用梯度优化策略但主要在同质开源替代-目标设置下训练和评估，难以迁移至商业闭源VLM场景。为此，作者提出Mosaic框架，其核心创新在于通过多视角集成优化机制缓解“替代依赖”（surrogate dependency）现象：具体包含三个关键组件——文本侧变换模块扰动拒绝敏感词汇模式、多视图图像优化模块在多种裁剪视角下更新扰动以避免单一视觉过拟合、以及替代集成引导模块聚合多个替代模型的优化信号以降低特定替代模型偏差。实验证明，Mosaic在商业闭源VLM上实现了当前最优的攻击成功率与平均毒性水平。

链接: https://arxiv.org/abs/2604.09253
作者: Yuqin Lan,Gen Li,Yuanze Hu,Weihao Shen,Zhaoxin Fan,Faguo Wu,Xiao Zhang,Laurence T. Yang,Zhiming Zheng
机构: Beihang University(北京航空航天大学); Huazhong University of Science and Technology(华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14pages, 9 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.

[CV-31] 2D or 3D: Who Governs Salience in VLA Models? – Tri-Stage Token Pruning Framework with Modality Salience Awareness

【速读】：该论文旨在解决多视觉模态视觉-语言-动作（Multi-Visual-Modal Vision-Language-Action, MVLA）模型在引入3D输入模态后，因输入token数量激增而导致的推理加速需求问题。现有基于2D-only VLA模型设计的token pruning方法未能考虑2D与3D模态在任务中的显著性差异，导致剪枝效果不佳。解决方案的关键在于提出一种三阶段分析框架，系统捕捉2D/3D模态在MVLA模型应用流程中的显著性差异与动态变化，并据此设计相应的三阶段token剪枝策略，实现对2D和3D token的最优选择与高效剪枝，从而在仅增加5.8%计算开销的前提下，达到最高2.55倍的推理速度提升且保持最小精度损失。

链接: https://arxiv.org/abs/2604.09244
作者: Zihao Zheng,Sicheng Tian,Zhihao Mao,Lingyue Zhang,Chenyue Li,Ziyun Zhang,Hong Gao,Yuchen Huang,Yutong Xu,Guojie Luo,Xiang Chen
机构: Peking University (北京大学); ZTE Corporation (中兴通讯); Beijing Normal University (北京师范大学); China University of Geosciences (Wuhan) (中国地质大学（武汉）); School of Electronics Engineering and Computer Science, Peking University (北京大学电子工程与计算机科学学院)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as the mainstream of embodied intelligence. Recent VLA models have expanded their input modalities from 2D-only to 2D+3D paradigms, forming multi-visual-modal VLA (MVLA) models. Despite achieving improved spatial perception, MVLA faces a greater acceleration demand due to the increased number of input tokens caused by modal expansion. Token pruning is an effective optimization methods tailored to MVLA models. However, existing token pruning schemes are designed for 2D-only VLA models, ignoring 2D/3D modality salience differences. In this paper, we follow the application process of multi-modal data in MVLA models and develop a tri-stage analysis to capture the discrepancy and dynamics of 2D/3D modality salience. Based on these, we propose a corresponding tri-stage token pruning framework for MVLA models to achieve optimal 2D/3D token selection and efficient pruning. Experiments show that our framework achieves up to a 2.55x inference speedup with minimal accuracy loss, while only costing 5.8% overhead. Our Code is coming soon.

[CV-32] Neural Distribution Prior for LiDAR Out-of-Distribution Detection CVPR2026

【速读】：该论文旨在解决LiDAR感知模型在开放世界场景下对分布外（Out-of-Distribution, OOD）物体识别能力不足的问题。当前模型多基于封闭集假设，在面对未见过的物体时性能显著下降，且现有OOD评分函数因忽略LiDAR数据中固有的类别不平衡问题而表现受限。解决方案的关键在于提出神经分布先验（Neural Distribution Prior, NDP）框架：该框架通过建模网络预测的分布结构，并基于学习到的分布先验自适应地重加权OOD分数；同时引入基于Perlin噪声的OOD合成策略，从输入扫描中生成多样化辅助OOD样本，从而无需依赖外部数据即可实现鲁棒的OOD训练。此方法有效缓解了类别依赖的置信度偏差，显著提升了点级OOD检测性能。

链接: https://arxiv.org/abs/2604.09232
作者: Zizhao Li,Zhengkang Xiang,Jiayang Ao,Feng Liu,Joseph West,Kourosh Khoshelham
机构: The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026

点击查看摘要

Abstract:LiDAR-based perception is critical for autonomous driving due to its robustness to poor lighting and visibility conditions. Yet, current models operate under the closed-set assumption and often fail to recognize unexpected out-of-distribution (OOD) objects in the open world. Existing OOD scoring functions exhibit limited performance because they ignore the pronounced class imbalance inherent in LiDAR OOD detection and assume a uniform class distribution. To address this limitation, we propose the Neural Distribution Prior (NDP), a framework that models the distributional structure of network predictions and adaptively reweights OOD scores based on alignment with a learned distribution prior. NDP dynamically captures the logit distribution patterns of training data and corrects class-dependent confidence bias through an attention-based module. We further introduce a Perlin noise-based OOD synthesis strategy that generates diverse auxiliary OOD samples from input scans, enabling robust OOD training without external datasets. Extensive experiments on the SemanticKITTI and STU benchmarks demonstrate that NDP substantially improves OOD detection performance, achieving a point-level AP of 61.31% on the STU test set, which is more than 10 \times higher than the previous best result. Our framework is compatible with various existing OOD scoring formulations, providing an effective solution for open-world LiDAR perception.

[CV-33] Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation

【速读】：该论文旨在解决现有3D纹理生成方法中存在的三大问题：纹理覆盖不完整、跨视角一致性差以及几何与纹理之间的错位。其解决方案的关键在于提出Hitem3D 2.0，一个基于多视角引导的原生3D纹理生成框架，通过融合2D多视角生成先验与原生3D纹理表示来提升纹理质量。该框架包含两个核心组件：一个多视角合成模块（基于预训练图像编辑骨干网络并集成可插拔模块以强化几何对齐、跨视角一致性和光照均匀性）和一个原生3D纹理生成模型，后者在给定多视角图像和3D几何的前提下，将多视角纹理投影到3D表面，并合理补全未见区域的纹理。通过将多视角一致性约束与原生3D纹理建模相结合，Hitem3D 2.0显著提升了纹理完整性、跨视角一致性及几何对齐度。

链接: https://arxiv.org/abs/2604.09231
作者: Huiang He,Shengchu Zhao,Jianwen Huang,Jie Li,Jiaqi Wu,Hu Zhang,Pei Tang,Heliang Zheng,Yukun Li,Rongfei Jia
机构: Math Magic; South China University of Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Although recent advances have improved the quality of 3D texture generation, existing methods still struggle with incomplete texture coverage, cross-view inconsistency, and misalignment between geometry and texture. To address these limitations, we propose Hitem3D 2.0, a multi-view guided native 3D texture generation framework that enhances texture quality through the integration of 2D multi-view generation priors and native 3D texture representations. Hitem3D 2.0 comprises two key components: a multi-view synthesis framework and a native 3D texture generation model. The multi-view generation is built upon a pre-trained image editing backbone and incorporates plug-and-play modules that explicitly promote geometric alignment, cross-view consistency, and illumination uniformity, thereby enabling the synthesis of high-fidelity multi-view images. Conditioned on the generated views and 3D geometry, the native 3D texture generation model projects multi-view textures onto 3D surfaces while plausibly completing textures in unseen regions. Through the integration of multi-view consistency constraints with native 3D texture modeling, Hitem3D 2.0 significantly improves texture completeness, cross-view coherence, and geometric alignment. Experimental results demonstrate that Hitem3D 2.0 outperforms existing methods in terms of texture detail, fidelity, consistency, coherence, and alignment.

[CV-34] nyNeRV: Compact Neural Video Representations via Capacity Scaling Distillation and Low-Precision Inference

【速读】：该论文旨在解决现有神经视频表示（Neural Representations for Videos, NeRV）方法在资源受限环境下的部署难题，特别是针对极小模型配置下重建质量下降、计算复杂度高及推理效率低的问题。其关键解决方案在于提出两种轻量级架构——NeRV-T 和 NeRV-T+，通过系统性地压缩模型容量并引入频率感知焦点监督的知识蒸馏策略，在不增加推理成本的前提下显著提升低容量网络的重建保真度；同时结合后训练量化与量化感知训练技术，评估并增强小型模型在低精度推理下的鲁棒性，从而实现参数量、计算开销和内存占用的大幅降低，同时保持良好的质量-效率权衡。

链接: https://arxiv.org/abs/2604.09220
作者: Muhammad Hannan Akhtar,Ihab Amer,Tamer Shanableh
机构: American University of Sharjah (美国沙迦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to “Computers and Electrical Engineering”, Elsevier

点击查看摘要

Abstract:Implicit neural video representations encode entire video sequences within the parameters of a neural network and enable constant time frame reconstruction. Recent work on Neural Representations for Videos (NeRV) has demonstrated competitive reconstruction performance while avoiding the sequential decoding process of conventional video codecs. However, most existing studies focus on moderate or high capacity models, leaving the behavior of extremely compact configurations required for constrained environments insufficiently explored. This paper presents a systematic study of tiny NeRV architectures designed for efficient deployment. Two lightweight configurations, NeRV-T and NeRV-T+, are introduced and evaluated across multiple video datasets in order to analyze how aggressive capacity reduction affects reconstruction quality, computational complexity, and decoding throughput. Beyond architectural scaling, the work investigates strategies for improving the performance of compact models without increasing inference cost. Knowledge distillation with frequency-aware focal supervision is explored to enhance reconstruction fidelity in low-capacity networks. In addition, the impact of lowprecision inference is examined through both post training quantization and quantization aware training to study the robustness of tiny models under reduced numerical precision. Experimental results demonstrate that carefully designed tiny NeRV variants can achieve favorable quality efficiency trade offs while substantially reducing parameter count, computational cost, and memory requirements. These findings provide insight into the practical limits of compact neural video representations and offer guidance for deploying NeRV style models in resource constrained and real-time environments. The official implementation is available at https: //github.com/HannanAkhtar/TinyNeRV-Implementation.

[CV-35] SHIFT: Steering Hidden Intermediates in Flow Transformers

【速读】：该论文旨在解决扩散模型（Diffusion Models）中特定概念（如物体、场景或风格）的可控移除与操控问题，尤其在基于DiT（Vision Transformer-based Diffusion Models）架构下实现高保真图像生成时，如何精准干预中间激活状态以避免冗余或不希望出现的视觉内容。解决方案的关键在于提出SHIFT框架，通过在推理阶段对选定层和时间步的中间激活进行动态调整，学习并应用“引导向量”（steering vectors），从而有效抑制目标概念的同时保持提示词其余内容及整体图像质量；该机制还可用于将生成结果导向特定领域或增强/修改指定对象，且无需重新训练模型，具备高效性与灵活性。

链接: https://arxiv.org/abs/2604.09213
作者: Nina Konovalova,Andrey Kuznetsov,Aibek Alanov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have become leading approaches for high-fidelity image generation. Recent DiT-based diffusion models, in particular, achieve strong prompt adherence while producing high-quality samples. We propose SHIFT, a simple but effective and lightweight framework for concept removal in DiT diffusion models via targeted manipulation of intermediate activations at inference time, inspired by activation steering in large language models. SHIFT learns steering vectors that are dynamically applied to selected layers and timesteps to suppress unwanted visual concepts while preserving the prompt’s remaining content and overall image quality. Beyond suppression, the same mechanism can shift generations into a desired \emphstyle domain or bias samples toward adding or changing target objects. We demonstrate that SHIFT provides effective and flexible control over DiT generation across diverse prompts and targets without time-consuming retraining.

[CV-36] Adding Another Dimension to Image-based Animal Detection

【速读】：该论文旨在解决单目RGB图像中动物3D检测缺乏标注数据的问题，即现有方法依赖于3D输入流（如深度信息或多视角图像）来构建标签，而实际场景中难以获取此类数据。其解决方案的关键在于提出一个端到端的管道（pipeline），利用Skinned Multi Animal Linear (SMAL)模型估计动物的3D边界框，并通过专用相机位姿优化算法将这些3D边界框投影至2D图像空间，生成鲁棒的标签；同时引入立方体面可见性指标以评估动物各侧面在图像中的捕获情况，从而为未来单目3D动物检测算法提供可量化、可复现的基准与训练标签。

链接: https://arxiv.org/abs/2604.09210
作者: Vandita Shukla,Fabio Remondino,Benjamin Risse
机构: Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会); University of Muenster(明斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CV4Animals Workshop 2025

点击查看摘要

Abstract:Monocular imaging of animals inherently reduces 3D structures to 2D projections. Detection algorithms lead to 2D bounding boxes that lack information about animal’s orientation relative to the camera. To build 3D detection methods for RGB animal images, there is a lack of labeled datasets; such labeling processes require 3D input streams along with RGB data. We present a pipeline that utilises Skinned Multi Animal Linear models to estimate 3D bounding boxes and to project them as robust labels into 2D image space using a dedicated camera pose refinement algorithm. To assess which sides of the animal are captured, cuboid face visibility metrics are computed. These 3D bounding boxes and metrics form a crucial step toward developing and benchmarking future monocular 3D animal detection algorithms. We evaluate our method on the Animal3D dataset, demonstrating accurate performance across species and settings.

[CV-37] Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception CVPR2026

【速读】：该论文旨在解决长距离车辆协同三维感知（Cooperative 3D Perception）中两个关键瓶颈问题：一是密集鸟瞰图（Bird’s-Eye-View, BEV）表示带来的二次计算复杂度，二是特征关联机制在观测误差和对齐误差显著时的脆弱性。解决方案的核心在于提出一种完全稀疏的框架 Long-SCOPE，其关键创新包括：1）几何引导查询生成模块（Geometry-guided Query Generation），用于精准检测远距离小目标；2）可学习上下文感知关联模块（Context-Aware Association），能够在严重位置噪声下仍实现鲁棒的协作查询匹配。该方法在 V2X-Seq 和 Griffin 数据集上验证了在 100–150 米长距场景下的最先进性能，同时保持较低的计算与通信开销。

链接: https://arxiv.org/abs/2604.09206
作者: Jiahao Wang,Zikun Xu,Yuner Zhang,Zhongwei Jiang,Chenyang Lu,Shuocheng Yang,Yuxuan Wang,Jiaru Zhong,Chuang Zhang,Shaobing Xu,Jianqiang Wang
机构: Tsinghua University (清华大学); University of Pennsylvania (宾夕法尼亚大学); Nanyang Technological University (南洋理工大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Cooperative 3D perception via Vehicle-to-Everything communication is a promising paradigm for enhancing autonomous driving, offering extended sensing horizons and occlusion resolution. However, the practical deployment of existing methods is hindered at long distances by two critical bottlenecks: the quadratic computational scaling of dense BEV representations and the fragility of feature association mechanisms under significant observation and alignment errors. To overcome these limitations, we introduce Long-SCOPE, a fully sparse framework designed for robust long-distance cooperative 3D perception. Our method features two novel components: a Geometry-guided Query Generation module to accurately detect small, distant objects, and a learnable Context-Aware Association module that robustly matches cooperative queries despite severe positional noise. Experiments on the V2X-Seq and Griffin datasets validate that Long-SCOPE achieves state-of-the-art performance, particularly in challenging 100-150 m long-range settings, while maintaining highly competitive computation and communication costs.

[CV-38] CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

【速读】：该论文旨在解决现有视频生成方法在相机控制方面的局限性，即要么依赖文本提示进行粗粒度的相机控制，要么需要人工指定相机轨迹参数，难以实现自动化且物理上不合理的运动。其解决方案的关键在于提出一种名为CT-1（Camera Transformer 1）的视觉-语言-相机模型，该模型通过融合视觉-语言模块与扩散Transformer架构，并引入基于小波变换的频域正则化损失函数（Wavelet-based Regularization Loss），精准估计复杂相机轨迹分布，从而将空间推理知识迁移至视频生成过程中，实现与用户意图对齐的空间感知相机控制。

链接: https://arxiv.org/abs/2604.09201
作者: Haoyu Zhao,Zihao Zhang,Jiaxi Gu,Haoran Chen,Qingping Zheng,Pin Tang,Yeyin Jin,Yuang Zhang,Junqi Cheng,Zenghui Lu,Peng Shu,Zuxuan Wu,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

[CV-39] Globally Optimal Pose from Orthographic Silhouettes

【速读】：该论文旨在解决从已知三维形状的完整轮廓（unoccluded silhouettes）中精确估计其全局最优位姿（pose）的问题，且不依赖于点对应关系，适用于任意形状（无论凸性或亏格）。解决方案的关键在于利用轮廓面积在旋转空间中的连续性这一被忽视但有效的性质，并构建预计算的轮廓签名（silhouette-signature）响应面作为搜索引导；通过查询该响应面可显著缩小旋转空间的候选范围，实现基于分辨率的高效搜索；同时引入拟合到投影轮廓上的2D椭圆的长宽比作为辅助全局形状签名，进一步加速位姿搜索过程。该方法首次实现了仅凭轮廓即可高效求解任意形状的全局最优位姿估计。

链接: https://arxiv.org/abs/2604.09199
作者: Agniva Sengupta,Dilara Kuş,Jianning Li,Stefan Zachow
机构: Freie Universität Berlin (柏林自由大学); Zuse Institute Berlin (祖塞研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We solve the problem of determining the pose of known shapes in \mathbbR^3 from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by correspondences, for any shape, irrespective of its convexity and genus. We validate our method on synthetic and real examples, demonstrating significantly improved accuracy against comparable approaches. Code, data, and supplementary in: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.09199 [cs.CV] (or arXiv:2604.09199v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.09199 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. Denver, Colorado

[CV-40] Vision Transformers for Preoperative CT-Based Prediction of Histopathologic Chemotherapy Response Score in High-Grade Serous Ovarian Carcinoma

【速读】：该论文旨在解决高分级浆液性卵巢癌（High-grade serous ovarian carcinoma, HGSOC）患者在接受新辅助化疗（Neoadjuvant chemotherapy, NACT）后，如何在术前通过非侵入性手段预测组织病理学治疗反应评分（Chemotherapy Response Score, CRS）的问题。当前CRS仅能在术后获取，限制了术前多学科团队（Multidisciplinary team, MDT）对治疗策略的精准决策。解决方案的关键在于提出一种基于预训练视觉Transformer（Vision Transformer）编码器的2.5D多模态深度学习框架，通过融合CT影像特征与临床变量，在病变密集的网膜切片上实现CRS的术前预测，从而为MDT提供早期、非侵入性的治疗反应估计工具。

链接: https://arxiv.org/abs/2604.09197
作者: Francesca Fati,Felipe Coutinho,Marika Reinius,Marina Rosanu,Gabriel Funingana,Luigi De Vitis,Gabriella Schivardi,Hannah Clayton,Alice Traversa,Zeyu Gao,Guilherme Penteado,Shangqi Gao,Francesco Pastori,Ramona Woitek,Maria Cristina Ghioni,Giovanni Damiano Aletti,Mercedes Jimenez-Linan,Sarah Burge,Nicoletta Colombo,Evis Sala,Maria Francesca Spadea,Timothy L. Kline,James D. Brenton,Jaime Cardoso,Francesco Multinu,Elena De Momi,Mireia Crispin-Ortuzar,Ines P. Machado
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purpose. High-grade serous ovarian carcinoma (HGSOC) is characterized by pronounced biological and spatial heterogeneity and is frequently diagnosed at an advanced stage. Neoadjuvant chemotherapy (NACT) followed by delayed primary surgery is commonly employed in patients unsuitable for primary cytoreduction. The Chemotherapy Response Score (CRS) is a validated histopathological biomarker of response to NACT, but it is only available postoperatively. In this study, we investigate whether pre-treatment computed tomography (CT) imaging and clinical data can be used to predict CRS as an investigational decision-support adjunct to inform multidisciplinary team (MDT) discussions regarding expected treatment response. Methods. We proposed a 2.5D multimodal deep learning framework that processes lesion-dense omental slices using a pre-trained Vision Transformer encoder and integrates the resulting visual representations with clinical variables through an intermediate fusion module to predict CRS. Results. Our multimodal model, integrating imaging and clinical data, achieved a ROC-AUC of 0.95 alongside 95% accuracy and 80% precision on the internal test cohort (IEO, n=41 patients). On the external test set (OV04, n=70 patients), it achieved a ROC-AUC of 0.68, alongside 67% accuracy and 75% precision. Conclusion. These preliminary results demonstrate the feasibility of transformer-based deep learning for preoperative prediction of CRS in HGSOC using routine clinical data and CT imaging. As an investigational, pre-treatment decision-support tool, this approach may assist MDT discussions by providing early, non-invasive estimates of treatment response.

[CV-41] MixFlow: Mixed Source Distributions Improve Rectified Flows

【速读】：该论文旨在解决扩散模型及其变体（如修正流模型）在图像生成中因学习到高度弯曲的生成路径而导致采样速度缓慢的问题。研究表明，高曲率的重要成因在于源分布（标准高斯分布）与数据分布之间的独立性。为此，作者提出两项互补贡献：其一，引入κ-FC（kappa-FC），这是一种通用的源分布条件化方法，通过将源分布基于任意信号κ进行调整，使其更贴近数据分布；其二，提出MixFlow训练策略，即在固定无条件分布与κ-FC分布的线性混合上训练流模型，从而降低生成路径曲率、提升采样效率。关键在于通过混合分布优化源与数据分布的对齐，显著减少采样步数并加速训练收敛，在固定采样预算下平均提升FID指标12%（相比标准修正流）和7%（相比先前基线）。

链接: https://arxiv.org/abs/2604.09181
作者: Nazir Nayal,Christopher Wewer,Jan Eric Lenssen
机构: Max Planck Institute for Informatics (马克斯普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models and their variations, such as rectified flows, generate diverse and high-quality images, but they are still hindered by slow iterative sampling caused by the highly curved generative paths they learn. An important cause of high curvature, as shown by previous work, is independence between the source distribution (standard Gaussian) and the data distribution. In this work, we tackle this limitation by two complementary contributions. First, we attempt to break away from the standard Gaussian assumption by introducing \kappa\texttt-FC , a general formulation that conditions the source distribution on an arbitrary signal \kappa that aligns it better with the data distribution. Then, we present MixFlow, a simple but effective training strategy that reduces the generative path curvatures and considerably improves sampling efficiency. MixFlow trains a flow model on linear mixtures of a fixed unconditional distribution and a \kappa\texttt-FC -based distribution. This simple mixture improves the alignment between the source and data, provides better generation quality with less required sampling steps, and accelerates the training convergence considerably. On average, our training procedure improves the generation quality by 12% in FID compared to standard rectified flow and 7% compared to previous baselines under a fixed sampling budget. Code available at: \hrefthis https URLthis https URL

[CV-42] UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation CVPR2026

【速读】：该论文旨在解决计算病理学中半监督语义分割（semi-supervised semantic segmentation）面临的两大挑战：像素级标注数据稀缺以及伪标签监督不可靠的问题。解决方案的关键在于提出了一种双模态语义对齐框架 UniSemAlign，其核心创新是在共享嵌入空间中引入互补的原型级（prototype-level）和文本级（text-level）对齐分支，从而将显式的类别结构注入到像素级学习过程中。这种结构化引导机制有效降低了类别歧义并稳定了伪标签的迭代优化，最终通过融合对齐表示与视觉预测，为未标注组织病理图像生成更可靠的监督信号。

链接: https://arxiv.org/abs/2604.09169
作者: Le-Van Thai,Tien Dat Nguyen,Hoai Nhan Pham,Lan Anh Dinh Thi,Duy-Dong Nguyen,Ngoc Lam Quang Bui
机构: AI VIETNAM Lab (AI VIETNAM 实验室); Hanoi University of Science and Technology (河内科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Workshop. 11 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Semi-supervised semantic segmentation in computational pathology remains challenging due to scarce pixel-level annotations and unreliable pseudo-label supervision. We propose UniSemAlign, a dual-modal semantic alignment framework that enhances visual segmentation by injecting explicit class-level structure into pixel-wise learning. Built upon a pathology-pretrained Transformer encoder, UniSemAlign introduces complementary prototype-level and text-level alignment branches in a shared embedding space, providing structured guidance that reduces class ambiguity and stabilizes pseudo-label refinement. The aligned representations are fused with visual predictions to generate more reliable supervision for unlabeled histopathology images. The framework is trained end-to-end with supervised segmentation, cross-view consistency, and cross-modal alignment objectives. Extensive experiments on the GlaS and CRAG datasets demonstrate that UniSemAlign substantially outperforms recent semi-supervised baselines under limited supervision, achieving Dice improvements of up to 2.6% on GlaS and 8.6% on CRAG with only 10% labeled data, and strong improvements at 20% supervision. Code is available at: this https URL

[CV-43] ELT: Elastic Looped Transformers for Visual Generation

【速读】：该论文旨在解决视觉生成模型中参数效率与生成质量之间的权衡问题，即如何在保持高合成质量的同时显著降低模型参数量。传统生成模型依赖于深度堆叠的独立Transformer层，导致参数冗余和计算资源消耗大。解决方案的关键在于提出弹性环路Transformer（Elastic Looped Transformers, ELT），其采用循环共享权重的Transformer块结构，通过迭代机制大幅减少参数数量；同时引入环内自蒸馏（Intra-Loop Self Distillation, ILSD）策略，在单次训练中实现不同深度（循环次数）的学生配置从教师配置中蒸馏学习，从而保证模型在任意推理步数下的一致性和高质量输出。该方法实现了“任意时间”推理能力，且在相同参数规模下可动态调整计算成本与生成质量的平衡。

链接: https://arxiv.org/abs/2604.09168
作者: Sahil Goyal,Swayam Agrawal,Gautham Govind Anil,Prateek Jain,Sujoy Paul,Aditya Kusupati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model’s depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With 4\times reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of 2.0 on class-conditional ImageNet 256 \times 256 and FVD of 72.8 on class-conditional UCF-101.

[CV-44] Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection ICME2026

【速读】：该论文旨在解决长视频序列中时空特征冗余与全局依赖建模能力下降的问题，从而提升生成式 AI（Generative AI）在视频理解任务中的可扩展性。其解决方案的关键在于提出一种名为 Efficient Spatial-Temporal Focal (ESTF) Adapter 的模块，该模块融合了作者设计的 Temporal Boundary-aware State Space Model (TB-SSM)，以增强时间特征建模能力，并结合高效的空间特征处理机制，从而在保持计算效率的同时显著改善动作定位精度与鲁棒性。

链接: https://arxiv.org/abs/2604.09164
作者: Yicheng Qiu,Keiji Yanai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME2026

点击查看摘要

Abstract:Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed Temporal Boundary-aware SSM(TB-SSM) for temporal feature modeling with efficient processing of spatial features. We perform comprehensive and quantitative analyses across multiple benchmarks, comparing our proposed method against previous SSM-based and other structural methods. Extensive experiments demonstrate that our improved strategy significantly enhances both localization performance and robustness, validating the effectiveness of our proposed method.

[CV-45] Benchmarking CNN- and Transformer-Based Models for Surgical Instrument Segmentation in Robotic-Assisted Surgery

【速读】：该论文旨在解决机器人辅助手术中手术器械的精准语义分割问题（multi-class semantic segmentation of surgical instruments），以支持上下文感知的计算机辅助干预，如工具跟踪、手术流程分析和自主决策。解决方案的关键在于采用五种深度学习架构（UNet、DeepLabV3、Attention UNet、SegFormer）在SAR-RARP50数据集上进行系统性对比，并引入复合损失函数（结合交叉熵损失与Dice损失）来缓解类别不平衡问题并增强对细小边界特征的捕捉能力。实验表明，基于空洞卷积（atrous convolution）和多尺度上下文聚合的DeepLabV3表现接近基于Transformer的SegFormer，后者通过全局上下文建模进一步提升了模型在不同器械外观和手术条件下的泛化性能，为手术AI应用中的模型选择提供了关键权衡依据。

链接: https://arxiv.org/abs/2604.09151
作者: Sara Ameli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Pattern Formation and Solitons (nlin.PS)
备注:

点击查看摘要

Abstract:Accurate segmentation of surgical instruments in robotic-assisted surgery is critical for enabling context-aware computer-assisted interventions, such as tool tracking, workflow analysis, and autonomous decision-making. In this study, we benchmark five deep learning architectures-UNet, UNet, DeepLabV3, Attention UNet, and SegFormer on the SAR-RARP50 dataset for multi-class semantic segmentation of surgical instruments in real-world radical prostatectomy videos. The models are trained with a compound loss function combining Cross Entropy and Dice loss to address class imbalance and capture fine object boundaries. Our experiments reveal that while convolutional models such as UNet and Attention UNet provide strong baseline performance, DeepLabV3 achieves results comparable to SegFormer, demonstrating the effectiveness of atrous convolution and multi-scale context aggregation in capturing complex surgical scenes. Transformer-based architectures like SegFormer further enhance global contextual understanding, leading to improved generalization across varying instrument appearances and surgical conditions. This work provides a comprehensive comparison and practical insights for selecting segmentation models in surgical AI applications, highlighting the trade-offs between convolutional and transformer-based approaches. Subjects: Computer Vision and Pattern Recognition (cs.CV); Pattern Formation and Solitons (nlin.PS) Cite as: arXiv:2604.09151 [cs.CV] (or arXiv:2604.09151v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.09151 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-46] Deep Light Pollution Removal in Night Cityscape Photographs

【速读】：该论文旨在解决城市环境中由人工光源引起的光污染对夜间摄影质量的严重退化问题，其核心挑战在于去除由地面照明产生的天空辉光（skyglow）及光源周围的光晕和发光伪影，从而恢复原始夜空的真实视觉效果。解决方案的关键在于提出了一种基于物理的退化模型，该模型在已有夜间去雾方法基础上新增两个关键要素：(i) 方向性光源的各向异性扩散特性，以及(ii) 由地平线后不可见地表光源引发的天空辉光；同时构建了利用大规模生成式模型与合成-真实数据耦合的训练策略，有效缓解配对真实数据稀缺的问题并提升模型泛化能力。

链接: https://arxiv.org/abs/2604.09145
作者: Hao Wang,Xiaolin Wu,Xi Zhang,Baoqing Sun
机构: Shandong University (山东大学); Southwest Jiaotong University (西南交通大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, supplementary material included

点击查看摘要

Abstract:Nighttime photography is severely degraded by light pollution induced by pervasive artificial lighting in urban environments. After long-range scattering and spatial diffusion, unwanted artificial light overwhelms natural night luminance, generates skyglow that washes out the view of stars and celestial objects and produces halos and glow artifacts around light sources. Unlike nighttime dehazing, which aims to improve detail legibility through thick air, the objective of light pollution removal is to restore the pristine night appearance by neutralizing the radiative footprint of ground lighting. In this paper we introduce a physically-based degradation model that adds to the previous ones for nighttime dehazing two critical aspects; (i) anisotropic spread of directional light sources, and (ii) skyglow caused by invisible surface lights behind skylines. In addition, we construct a training strategy that leverages large generative model and synthetic-real coupling to compensate for the scarcity of paired real data and enhance generalization. Extensive experiments demonstrate that the proposed formulation and learning framework substantially reduce light pollution artifacts and better recover authentic night imagery than prior nighttime restoration methods.

[CV-47] Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

【速读】：该论文旨在解决合成数据到真实场景的零样本（Syn-to-Real Zero-Shot）立体匹配泛化能力不足的问题，其核心挑战源于图像纹理在遮挡、无纹理、重复及非朗伯（non-Lambertian）区域中存在的跨域偏移与病态歧义。解决方案的关键在于引入表面法向量（surface normals）作为域不变、对象内在且具有判别性的几何线索，构建了GREATEN框架：首先通过门控上下文-几何融合（GCGF）模块自适应抑制不可靠的图像上下文特征并融合法向量驱动的几何特征以生成域不变表示；其次采用镜面-透明增强（STA）策略提升对非朗伯区域误导性视觉线索的鲁棒性；最后利用稀疏注意力机制（包括Sparse Spatial Attention、Sparse Dual-Matching Attention和Simple Volume Attention）在保持细粒度全局特征提取能力的同时显著降低计算开销，从而实现高效且高精度的跨域立体匹配。

链接: https://arxiv.org/abs/2604.09142
作者: Jiahao Li,Xinhong Chen,Zhengmin Jiang,Cheng Huang,Yung-Hui Li,Jianping Wang
机构: City University of Hong Kong (香港城市大学); Southern Methodist University (南卫理公会大学); Hon Hai Research Institute (鸿海研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.

[CV-48] Strips as Tokens: Artist Mesh Generation with Native UV Segmentation

【速读】：该论文旨在解决当前基于自回归Transformer的网格生成方法在token排序策略上难以满足专业艺术家标准的问题，具体表现为：基于坐标的排序导致序列过长、效率低下，而基于patch的启发式策略则破坏了连续边流和结构规则性，影响高质量建模。其解决方案的关键在于提出Strips as Tokens (SATO)框架，该框架受三角形条带（triangle strips）启发，通过构建一个显式编码UV边界、由面连接成链的token序列，自然保留了艺术家创建网格所具有的有序边流与语义布局；此外，该方法实现了统一表示，使同一token序列可解码为三角形或四边形网格，从而支持联合训练：利用大规模三角形数据学习基础结构先验，同时借助高质量四边形数据提升输出几何规则性，显著提升了几何质量、结构一致性和UV分割性能。

链接: https://arxiv.org/abs/2604.09132
作者: Rui Xu,Dafei Qin,Kaichun Qiao,Qiujie Dong,Huaijin Pi,Qixuan Zhang,Longwen Zhang,Lan Xu,Jingyi Yu,Wenping Wang,Taku Komura
机构: The University of Hong Kong (香港大学); Deemos Technology Co., Ltd. (德模科技有限公司); ShanghaiTech University (上海科技大学); Shandong University (山东大学); Texas AM University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Recent advancements in autoregressive transformers have demonstrated remarkable potential for generating artist-quality meshes. However, the token ordering strategies employed by existing methods typically fail to meet professional artist standards, where coordinate-based sorting yields inefficiently long sequences, and patch-based heuristics disrupt the continuous edge flow and structural regularity essential for high-quality modeling. To address these limitations, we propose Strips as Tokens (SATO), a novel framework with a token ordering strategy inspired by triangle strips. By constructing the sequence as a connected chain of faces that explicitly encodes UV boundaries, our method naturally preserves the organized edge flow and semantic layout characteristic of artist-created meshes. A key advantage of this formulation is its unified representation, enabling the same token sequence to be decoded into either a triangle or quadrilateral mesh. This flexibility facilitates joint training on both data types: large-scale triangle data provides fundamental structural priors, while high-quality quad data enhances the geometric regularity of the outputs. Extensive experiments demonstrate that SATO consistently outperforms prior methods in terms of geometric quality, structural coherence, and UV segmentation.

[CV-49] FaceLiVTv2: An Improved Hybrid Architecture for Efficient Mobile Face Recognition

【速读】：该论文旨在解决轻量化人脸识别人工智能模型在边缘和移动设备上部署时面临的挑战，即如何在严格受限的延迟、内存和能耗条件下实现高准确率。现有混合卷积神经网络（CNN）与Transformer架构虽提升了全局上下文建模能力，但难以平衡识别性能与计算效率。解决方案的关键在于提出FaceLiVTv2架构，其核心是Lite MHLA模块——通过多头线性令牌投影（multi-head linear token projections）和仿射缩放变换（affine rescale transformations）替代传统多层注意力设计，有效降低冗余并保留各头的表征多样性；同时引入统一的RepMix块整合局部与全局特征交互，并采用全局深度可分离卷积进行嵌入阶段的自适应空间聚合，从而显著提升移动端推理速度（相较FaceLiVTv1降低22%延迟，较GhostFaceNets提速达30.8%），并在多个基准数据集上保持更高识别精度。

链接: https://arxiv.org/abs/2604.09127
作者: Novendra Setyawan,Chi-Chia Sun,Mao-Hsiu Hsu,Wen-Kai Kuo,Jun-Wei Hsieh
机构: National Formosa University (国立高雄大学); National Taipei University (国立台北大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lightweight face recognition is increasingly important for deployment on edge and mobile devices, where strict constraints on latency, memory, and energy consumption must be met alongside reliable accuracy. Although recent hybrid CNN-Transformer architectures have advanced global context modeling, striking an effective balance between recognition performance and computational efficiency remains an open challenge. In this work, we present FaceLiVTv2, an improved version of our FaceLiVT hybrid architecture designed for efficient global–local feature interaction in mobile face recognition. At its core is Lite MHLA, a lightweight global token interaction module that replaces the original multi-layer attention design with multi-head linear token projections and affine rescale transformations, reducing redundancy while preserving representational diversity across heads. We further integrate Lite MHLA into a unified RepMix block that coordinates local and global feature interactions and adopts global depthwise convolution for adaptive spatial aggregation in the embedding stage. Under our experimental setup, results on LFW, CA-LFW, CP-LFW, CFP-FP, AgeDB-30, and IJB show that FaceLiVTv2 consistently improves the accuracy-efficiency trade-off over existing lightweight methods. Notably, FaceLiVTv2 reduces mobile inference latency by 22% relative to FaceLiVTv1, achieves speedups of up to 30.8% over GhostFaceNets on mobile devices, and delivers 20-41% latency improvements over EdgeFace and KANFace across platforms while maintaining higher recognition accuracy. These results demonstrate that FaceLiVTv2 offers a practical and deployable solution for real-time face recognition. Code is available at this https URL.

[CV-50] Few-Shot Personalized Age Estimation

【速读】：该论文旨在解决传统年龄估计方法忽略个体差异性的问题，即现有方法将每张人脸视为独立样本，学习全局的外观到年龄映射，而未考虑个体因遗传、生活方式和健康等因素导致的差异化老化速率（ageing rate）。其解决方案的关键在于引入个性化年龄估计（personalized age estimation），利用同一人多张带标签年龄的参考图像，通过上下文信息对个体进行建模，从而提升预测精度。论文提出首个开放基准OpenPAE，支持N-shot个性化设置，并设计从线性偏移到条件注意力神经过程（conditional attentive neural process）的一系列渐进式基线模型，实验表明个性化策略能显著优于非个性化方法，且非简单域适应效应，强调了非线性建模的重要性。

链接: https://arxiv.org/abs/2604.09125
作者: Jakub Paplhám,Vojtěch Franc,Artem Moroz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing age estimation methods treat each face as an independent sample, learning a global mapping from appearance to age. This ignores a well-documented phenomenon: individuals age at different rates due to genetics, lifestyle, and health, making the mapping from face to age identity-dependent. When reference images of the same person with known ages are available, we can exploit this context to personalize the estimate. The only existing benchmark for this task (NIST FRVT) is closed-source and limited to a single reference image. In this work, we introduce OpenPAE, the first open benchmark for N -shot personalized age estimation with strict evaluation protocols. We establish a hierarchy of increasingly sophisticated baselines: from arithmetic offset, through closed-form Bayesian linear regression, to a conditional attentive neural process. Our experiments show that personalization consistently improves performance, that the gains are not merely domain adaptation, and that nonlinear methods significantly outperform simpler alternatives. We release all models, code, protocols, and evaluation splits.

[CV-51] FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval

【速读】：该论文旨在解决组合图像检索（Composed Image Retrieval, CIR）中现有视觉语言模型（Vision-Language Models, VLMs）缺乏对图像修改内容的细粒度语义推理能力的问题，即模型难以准确判断哪些特征应保留、哪些应改变，导致检索结果不可解释且在细粒度领域（如时尚）表现不佳。解决方案的关键在于提出FIRE-CIR模型，其核心创新是引入基于问题驱动的视觉推理机制：模型从文本描述中自动生成聚焦于属性的视觉问题，并在参考图与候选图中验证对应的视觉证据，从而实现可解释的逐属性级推理；同时构建了一个大规模时尚领域的视觉问答数据集用于训练该推理系统，在检索阶段通过显式推理对候选结果进行重排序，剔除与意图修改不一致的图像，显著提升检索准确性与可解释性。

链接: https://arxiv.org/abs/2604.09114
作者: François Gardères,Camille-Sovanneary Gauthier,Jean Ponce,Shizhe Chen
机构: Louis Vuitton; Inria, École normale supérieure, CNRS, PSL Research University; Courant Institute of Mathematical Sciences and Center for Data Science, New York University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Composed image retrieval (CIR) aims to retrieve a target image that depicts a reference image modified by a textual description. While recent vision-language models (VLMs) achieve promising CIR performance by embedding images and text into a shared space for retrieval, they often fail to reason about what to preserve and what to change. This limitation hinders interpretability and yields suboptimal results, particularly in fine-grained domains like fashion. In this paper, we introduce FIRE-CIR, a model that brings compositional reasoning and interpretability to fashion CIR. Instead of relying solely on embedding similarity, FIRE-CIR performs question-driven visual reasoning: it automatically generates attribute-focused visual questions derived from the modification text, and verifies the corresponding visual evidence in both reference and candidate images. To train such a reasoning system, we automatically construct a large-scale fashion-specific visual question answering dataset, containing questions requiring either single- or dual-image analysis. During retrieval, our model leverages this explicit reasoning to re-rank candidate results, filtering out images inconsistent with the intended modifications. Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy. It also provides interpretable, attribute-level insights into retrieval decisions.

[CV-52] Detecting Diffusion-generated Images via Dynamic Assembly ForestsDetecting Diffusion-generated Images via Dynamic Assembly Forests

【速读】：该论文旨在解决扩散模型（Diffusion Models）生成的图像带来的安全威胁问题，即如何高效、准确地检测出由扩散模型生成的图像。现有方法主要依赖深度神经网络（如CNNs和Transformers），但存在参数量大、计算成本高且难以在资源受限场景部署的问题。论文提出了一种新颖的动态集成森林模型（Dynamic Assembly Forest, DAF），其关键在于基于深度森林（Deep Forest）范式改进特征学习能力与可扩展训练机制，从而在显著减少参数量和计算开销的同时，实现与主流深度神经网络方法相当甚至更优的检测性能，尤其适用于无GPU环境下的实际应用。

链接: https://arxiv.org/abs/2604.09106
作者: Mengxin Fu,Yuezun Li
机构: Ocean University of China (中国海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models are known for generating high-quality images, causing serious security concerns. To combat this, most efforts rely on deep neural networks (e.g., CNNs and Transformers), while largely overlooking the potential of traditional machine learning models. In this paper, we freshly investigate such alternatives and proposes a novel Dynamic Assembly Forest model (DAF) to detect diffusion-generated images. Built upon the deep forest paradigm, DAF addresses the inherent limitations in feature learning and scalable training, making it an effective diffusion-generated image detector. Compared to existing DNN-based methods, DAF has significantly fewer parameters, much lower computational cost, and can be deployed without GPUs, while achieving competitive performance under standard evaluation protocols. These results highlight the strong potential of the proposed method as a practical substitute for heavyweight DNN models in resource-constrained scenarios. Our code and models are available at this https URL.

[CV-53] CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion CVPR

【速读】：该论文旨在解决提示微调（prompt tuning）的视觉-语言模型（VLMs）在机器学习即服务（MLaaS）场景下存在的后门安全风险问题，即恶意服务提供商可在不修改编码器的前提下植入隐蔽后门，使模型对特定触发输入（triggered inputs）产生错误分类，且此类后门无法被现有基于编码器检测或数据清洗的方法识别。解决方案的关键在于提出一种名为CLIP-Inspector（CI）的模型级验证方法：通过白盒访问交付模型和少量未标注的分布外（OOD）图像，CI能够重建每个类别的潜在触发器，并据此判断模型是否具有后门行为；进一步地，利用重建的触发器对正确标注的触发样本进行微调，可实现模型的后门修复与性能恢复，实验表明CI在10个数据集上仅用1,000张OOD图像即可在单个epoch内实现94%的检测准确率（47/50模型），显著优于基线方法（AUROC提升至0.973）。

链接: https://arxiv.org/abs/2604.09101
作者: Akshit Jindal,Saket Anand,Chetan Arora,Vikram Goyal
机构: IIIT Delhi (印度国际技术研究所); IIT Delhi (印度理工学院德里分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages (8 main + 2 references + 7 supplementary), Accepted to CVPR Findings 2026

点击查看摘要

Abstract:Organisations with limited data and computational resources increasingly outsource model training to Machine Learning as a Service (MLaaS) providers, who adapt vision-language models (VLMs) such as CLIP to downstream tasks via prompt tuning rather than training from scratch. This semi-honest setting creates a security risk where a malicious provider can follow the prompt-tuning protocol yet implant a backdoor, forcing triggered inputs to be classified into an attacker-chosen class, even for out-of-distribution (OOD) data. Such backdoors leave encoders untouched, making them undetectable to existing methods that focus on encoder corruption. Other data-level methods that sanitize data before training or during inference, also fail to answer the critical question, “Is the delivered model backdoored or not?” To address this model-level verification problem, we introduce CLIP-Inspector (CI), a backdoor detection method designed for prompt-tuned CLIP models. Assuming white-box access to the delivered model and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to determine if the model exhibits backdoor behaviour or not. Additionally, we demonstrate that using CI’s reconstructed trigger for fine-tuning on correctly labeled triggered inputs enables us to re-align the model and reduce backdoor effectiveness. Through extensive experiments across ten datasets and four backdoor attacks, we demonstrate that CI can reconstruct effective triggers in a single epoch using only 1,000 OOD images, achieving a 94% detection accuracy (47/50 models). Compared to adapted trigger-inversion baselines, CI yields a markedly higher AUROC score (0.973 vs 0.495/0.687), thus enabling the vetting and post-hoc repair of prompt-tuned CLIP models to ensure safe deployment.

[CV-54] Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

【速读】：该论文旨在解决在严重手部遮挡条件下，对物体进行度量尺度（metric-scale）的非可见部分（amodal）重建与位姿估计的问题。传统基于视觉的3D生成方法因缺乏物理约束，在遮挡区域易产生歧义或不合理的结构。其解决方案的关键在于引入物理交互信号：利用本体感觉（proprioception）获取手部姿态几何信息，通过多点接触触觉（multi-contact touch）约束物体表面必须位于接触位置，从而显著减少遮挡区域的不确定性；同时，采用姿态感知、相机对齐的符号距离场（Signed Distance Field, SDF）表示物体结构，并结合Structure-VAE学习紧凑潜在空间，再训练一个条件流匹配扩散模型（conditional flow-matching diffusion model），在微调阶段融合可见RGB证据、遮挡掩码、手部潜在表示及触觉信息，并引入物理驱动的目标函数和可微解码器引导机制，以抑制手-物穿插并使重建表面与接触观测对齐，最终实现高保真、物理一致且具度量尺度的物体重建。

链接: https://arxiv.org/abs/2604.09100
作者: Gabriele Mario Caddeo,Pasquale Marra,Lorenzo Natale
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 27 pages, 10 figures, under review

点击查看摘要

Abstract:We propose a multimodal, physically grounded approach for metric-scale amodal object reconstruction and pose estimation under severe hand occlusion. Unlike prior occlusion-aware 3D generation methods that rely only on vision, we leverage physical interaction signals: proprioception provides the posed hand geometry, and multi-contact touch constrains where the object surface must lie, reducing ambiguity in occluded regions. We represent object structure as a pose-aware, camera-aligned signed distance field (SDF) and learn a compact latent space with a Structure-VAE. In this latent space, we train a conditional flow-matching diffusion model, pretraining on vision-only images and finetuning on occluded manipulation scenes while conditioning on visible RGB evidence, occluder/visibility masks, the hand latent representation, and tactile information. Crucially, we incorporate physics-based objectives and differentiable decoder-guidance during finetuning and inference to reduce hand–object interpenetration and to align the reconstructed surface with contact observations. Because our method produces a metric, physically consistent structure estimate, it integrates naturally into existing two-stage reconstruction pipelines, where a downstream module refines geometry and predicts appearance. Experiments in simulation show that adding proprioception and touch substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale compared to vision-only baselines; we further validate transfer by deploying the model on a real humanoid robot with an end-effector different from those used during training.

[CV-55] Off-the-shelf Vision Models Benefit Image Manipulation Localization

【速读】：该论文旨在解决图像篡改定位（Image Manipulation Localization, IML）与通用视觉任务之间因特征差异而长期被割裂的问题，即如何利用通用视觉模型中的语义先验信息提升IML性能。其解决方案的关键在于提出一种可训练的适配器模块（ReVi），该模块基于鲁棒主成分分析（Robust Principal Component Analysis）思想，从预训练的通用视觉模型（如图像生成和分割网络）中解耦出语义冗余信息，并选择性增强篡改特异性信息；该方法仅需微调适配器参数，无需对冻结的通用模型进行重新设计或全量训练，从而实现了高效、可扩展的IML框架。

链接: https://arxiv.org/abs/2604.09096
作者: Zhengxuan Zhang,Keji Song,Junmin Hu,Ao Luo,Yuezun Li
机构: Ocean University of China (中国海洋大学); Southwest Jiaotong University (西南交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Image manipulation localization (IML) and general vision tasks are typically treated as two separate research directions due to the fundamental differences between manipulation-specific and semantic features. In this paper, however, we bridge this gap by introducing a fresh perspective: these two directions are intrinsically connected, and general semantic priors can benefit IML. Building on this insight, we propose a novel trainable adapter (named ReVi) that repurposes existing off-the-shelf general-purpose vision models (e.g., image generation and segmentation networks) for IML. Inspired by robust principal component analysis, the adapter disentangles semantic redundancy from manipulation-specific information embedded in these models and selectively enhances the latter. Unlike existing IML methods that require extensive model redesign and full retraining, our method relies on the off-the-shelf vision models with frozen parameters and only fine-tunes the proposed adapter. The experimental results demonstrate the superiority of our method, showing the potential for scalable IML frameworks.

[CV-56] Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation CVPR2026

【速读】：该论文旨在解决记忆高效迁移学习（Memory-efficient Transfer Learning, METL）方法在推理阶段引入额外内存和时间开销的问题，这与高效迁移学习的最终目标相悖。现有METL方法通常依赖轻量级可学习的侧网络（side network）来减少训练时的参数量和显存消耗，但该侧网络在推理时仍需运行，导致效率下降。解决方案的关键在于提出一种名为“掩码双路径蒸馏”（Masked Dual Path Distillation, MDPD）的新框架：通过在训练阶段对冻结的主干网络（frozen backbone）与可学习侧网络进行相互知识蒸馏（knowledge distillation），从而提升模型性能；并在推理阶段完全移除侧网络，实现无损加速。此外，该方法还设计了一种基于特征的知识蒸馏策略，适用于多层编码器结构，显著提升了准确率并实现了至少25.2%的推理加速，同时保持了与SOTA方法相当的参数和内存消耗。

链接: https://arxiv.org/abs/2604.09088
作者: Yutong Zhang,Jiaxin Chen,Honglin Chen,Kaiqi Zheng,Shengcai Liao,Hanwen Zhong,Weixin Li,Yunhong Wang
机构: Beihang University (北京航空航天大学); United Arab Emirates University (阿联酋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026 Accepted

点击查看摘要

Abstract:Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code is available at this https URL.

[CV-57] Cross-Modal Knowledge Distillation from Spatial Transcriptomics to Histology CVPR2026

【速读】：该论文旨在解决空间转录组学（spatial transcriptomics）数据成本高、获取困难与HE染色组织病理图像（HE histology）信息粒度不足之间的矛盾问题。其核心挑战在于如何利用有限的配对数据，将空间转录组学所揭示的分子层面组织微环境结构（niche structure）迁移至仅基于HE图像的模型中，从而在无需任何转录组输入的情况下实现对组织空间结构的精准解析。解决方案的关键在于提出一种跨模态蒸馏（cross-modal distillation）策略：在训练阶段利用配对的空间转录组和HE图像数据，使模型学习从HE图像中提取与转录组定义的微环境结构一致的特征表示；推理时则仅依赖HE图像即可获得与空间转录组高度一致的组织分区结果，显著优于传统仅基于形态学的无监督方法，并通过细胞类型分析验证了其生物学合理性。

链接: https://arxiv.org/abs/2604.09076
作者: Arbel Hizmi,Artemii Bakulin,Shai Bagon,Nir Yosef
机构: Weizmann Institute of Science (魏兹曼科学研究所); Reichman University (里希曼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CVMI Workshop at CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Spatial transcriptomics provides a molecularly rich description of tissue organization, enabling unsupervised discovery of tissue niches – spatially coherent regions of distinct cell-type composition and function that are relevant to both biological research and clinical interpretation. However, spatial transcriptomics remains costly and scarce, while HE histology is abundant but carries a less granular signal. We propose to leverage paired spatial transcriptomics and HE data to transfer transcriptomics-derived niche structure to a histology-only model via cross-modal distillation. Across multiple tissue types and disease contexts, the distilled model achieves substantially higher agreement with transcriptomics-derived niche structure than unsupervised morphology-based baselines trained on identical image features, and recovers biologically meaningful neighborhood composition as confirmed by cell-type analysis. The resulting framework leverages paired spatial transcriptomic and HE data during training, and can then be applied to held-out tissue regions using histology alone, without any transcriptomic input at inference time.

[CV-58] Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

【速读】：该论文旨在解决零样本骨架动作识别（Zero-Shot Skeleton Action Recognition, ZSAR）中因扩散模型的频谱偏差（spectral bias）导致高频运动细节过度平滑的问题，从而限制了模型对新动作的泛化能力。解决方案的关键在于提出频率感知扩散模型（Frequency-Aware Diffusion for Skeleton-Text Matching, FDSM），其核心创新包括：语义引导的频谱残差模块（Semantic-Guided Spectral Residual Module）、时间步自适应频谱损失（Timestep-Adaptive Spectral Loss）以及基于课程学习的语义抽象机制（Curriculum-based Semantic Abstraction），有效恢复细粒度运动特征，在NTU RGB+D、PKU-MMD和Kinetics-skeleton等数据集上达到当前最优性能。

链接: https://arxiv.org/abs/2604.09063
作者: Yuxi Zhou,Zhengbo Zhang,Jingyu Pan,Zhiyu Lin,Zhigang Tu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at this https URL. Project homepage: this https URL

[CV-59] Nested Radially Monotone Polar Occupancy Estimation: Clinically-Grounded Optic Disc and Cup Segmentation for Glaucoma Screening

【速读】：该论文旨在解决现有深度学习方法在视盘（Optic Disc, OD）与视杯（Optic Cup, OC）分割任务中无法保证临床有效性（如星凸性与嵌套结构）的问题，尤其在跨数据集域偏移场景下导致诊断指标失真。其解决方案的关键在于提出NPS-Net（Nested Polar Shape Network），首次将OD/OC分割建模为具有嵌套关系的径向单调极坐标占用表示（nested radially monotone polar occupancy），该输出表示形式天然保障了临床解剖学有效性和高精度，从而实现强零样本泛化能力。

链接: https://arxiv.org/abs/2604.09062
作者: Rimsa Goperma,Rojan Basnet,Liang Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Valid segmentation of the optic disc (OD) and optic cup (OC) from fundus photographs is essential for glaucoma screening. Unfortunately, existing deep learning methods do not guarantee clinical validness including star-convexity and nested structure of OD and OC, resulting corruption in diagnostic metric, especially under cross-dataset domain shift. To adress this issue, this paper proposed NPS-Net (Nested Polar Shape Network), the first framework that formulates the OD/OC segmentation as nested radially monotone polar occupancy this http URL output representation can guarantee the aforementioned clinical validness and achieve high accuracy. Evaluated across seven public datasets, NPS-Net shows strong zero-shot generalization. On RIM-ONE, it maintains 100% anatomical validity and improves Cup Dice by 12.8% absolute over the best baseline, reducing vCDR MAE by over 56%. On PAPILA, it achieves Disc Dice of 0.9438 and Disc HD95 of 2.78 px, an 83% reduction over the best competing method.

[CV-60] Learning Vision-Language-Action World Models for Autonomous Driving CVPR2026

【速读】：该论文旨在解决当前视觉-语言-动作（Vision-Language-Action, VLA）模型在端到端自动驾驶中缺乏显式时间动态建模与全局世界一致性的问题，从而限制了其预测前瞻性和安全性。解决方案的关键在于提出VLA-World，一种将预测性想象（predictive imagination）与反思式推理（reflective reasoning）统一的VLA世界模型：首先利用动作导出的可行轨迹引导下一帧图像生成，捕捉丰富的空间和时间线索以描述环境演化；随后对自生成的未来图像进行推理以优化预测轨迹，实现更高性能与更强可解释性。

链接: https://arxiv.org/abs/2604.09059
作者: Guoqing Wang,Pin Tang,Xiangxuan Ren,Guodongfang Zhao,Bailan Feng,Chao Ma
机构: Shanghai Jiao Tong University (上海交通大学); Huawei (华为)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR2026 findings

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: this https URL

[CV-61] ora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

【速读】：该论文旨在解决音频-视频（Audio-Video, AV）生成中运动与声音关系不一致的问题，即现有方法常导致物体运动视觉不稳定、声音与显著运动或接触事件关联松散，根本原因在于缺乏一个共享的、显式的运动感知结构来统一视频和音频的生成过程。解决方案的关键在于提出Tora3框架，其核心创新是利用物体轨迹作为共享的运动学先验（kinematic prior），通过三个关键技术实现：1）设计轨迹对齐的运动表示用于视频生成；2）构建基于轨迹导出二阶运动状态的声学-运动对齐模块；3）采用混合流匹配机制，在轨迹条件区域保持轨迹保真度，同时在其他区域维持局部一致性。该方案显著提升了AV生成中的物理合理性、运动-声音同步性和整体质量。

链接: https://arxiv.org/abs/2604.09057
作者: Junchao Liao,Zhenghao Zhang,Xiangyu Meng,Litao Li,Ziying Zhang,Siyu Zhu,Long Qin,Weizhi Wang
机构: Alibaba Cloud Computing (阿里巴巴云计算); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Audio-video (AV) generation has recently made strong progress in perceptual quality and multimodal coherence, yet generating content with plausible motion-sound relations remains challenging. Existing methods often produce object motions that are visually unstable and sounds that are only loosely aligned with salient motion or contact events, largely because they lack an explicit motion-aware structure shared by video and audio generation. We present Tora3, a trajectory-guided AV generation framework that improves physical coherence by using object trajectories as a shared kinematic prior. Rather than treating trajectories as a video-only control signal, Tora3 uses them to jointly guide visual motion and acoustic events. Specifically, we design a trajectory-aligned motion representation for video, a kinematic-audio alignment module driven by trajectory-derived second-order kinematic states, and a hybrid flow matching scheme that preserves trajectory fidelity in trajectory-conditioned regions while maintaining local coherence elsewhere. We further curate PAV, a large-scale AV dataset emphasizing motion-relevant patterns with automatically extracted motion annotations. Extensive experiments show that Tora3 improves motion realism, motion-sound synchronization, and overall AV generation quality over strong open-source baselines.

[CV-62] Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy

【速读】：该论文旨在解决机器人辅助部分肾切除术（robot-assisted partial nephrectomy, RAPN）中精细动作分割问题，即在视觉相似的缝合动作（suture gesture）之间进行帧级识别，且这些动作持续时间不一、类别分布严重不平衡。解决方案的关键在于构建并评估四种基于I3D特征的时序模型（MS-TCN++、AsFormer、TUT和DiffAct），通过多指标（包括平衡准确率、编辑分数、不同重叠阈值下的段落F1分数、帧级准确率与平均精度均值）对模型性能进行全面比较，并引入跨域测试验证模型泛化能力，最终发现DiffAct在多数关键指标上表现最优，而MS-TCN++在平衡准确率上领先。

链接: https://arxiv.org/abs/2604.09051
作者: Jiaheng Dai,Huanrong Liu,Tailai Zhou,Tongyu Jia,Qin Liu,Yutong Ban,Zeju Li,Yu Gao,Xin Ma,Qingbiao Li
机构: Fudan University (复旦大学); University of Macau (澳门大学); The Chinese PLA General Hospital (中国人民解放军总医院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance. The SIA-RAPN benchmark defines this problem on 50 clinical videos acquired with the da Vinci Xi system and annotated with 12 frame-level labels. The benchmark compares four temporal models built on I3D features: MS-TCN++, AsFormer, TUT, and DiffAct. Evaluation uses balanced accuracy, edit score, segmental F1 at overlap thresholds of 10, 25, and 50, frame-wise accuracy, and frame-wise mean average precision. In addition to the primary evaluation across five released split configurations on SIA-RAPN, the benchmark reports cross-domain results on a separate single-port RAPN dataset. Across the strongest reported values over those five runs on the primary dataset, DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy.

[CV-63] xt-Conditioned Multi-Expert Regression Framework for Fully Automated Multi-Abutment Design

【速读】：该论文旨在解决当前牙种植体基台（abutment）设计高度依赖人工、效率低下且在多基台场景下难以扩展的问题。现有基于深度神经网络的方法大多仍为半自动化，需大量临床医生干预，缺乏系统性与可扩展性。其解决方案的关键在于提出一个全自动化、文本条件驱动的多专家架构TEMAD，该框架将种植位点定位、种植系统识别与基台参数回归整合为统一流程：通过植入位点识别网络（ISIN）自动定位种植区域，并引入牙齿条件特征调制模块（TC-FiLM）实现基于牙周嵌入的网格特征自适应校准；同时采用系统提示的专家混合机制（SPMoE），利用种植系统提示动态选择最优专家进行回归预测，从而实现精准、高效的多基台自动设计。

链接: https://arxiv.org/abs/2604.09047
作者: Mianjie Zheng,Xinquan Yang,Xuefen Liu,Xuguang Li,Kun Tang,He Meng,Linlin Shen
机构: Shenzhen University (深圳大学); National Engineering Laboratory for Big Data System Computing Technology (国家大数据系统计算技术工程实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dental implant abutments serve as the geometric and biomechanical interface between the implant fixture and the prosthetic crown, yet their design relies heavily on manual effort and is time-consuming. Although deep neural networks have been proposed to assist dentists in designing abutments, most existing approaches remain largely manual or semi-automated, requiring substantial clinician intervention and lacking scalability in multi-abutment scenarios. To address these limitations, we propose TEMAD, a fully automated, text-conditioned multi-expert architecture for multi-abutment design. This framework integrates implant site localization and implant system, compatible abutment parameter regression into a unified pipeline. Specifically, we introduce an Implant Site Identification Network (ISIN) to automatically localize implant sites and provide this information to the subsequent multi-abutment regression network. We further design a Tooth-Conditioned Feature-wise Linear Modulation (TC-FiLM) module, which adaptively calibrates mesh representations using tooth embeddings to enable position-specific feature modulation. Additionally, a System-Prompted Mixture-of-Experts (SPMoE) mechanism leverages implant system prompts to guide expert selection, ensuring system-aware regression. Extensive experiments on a large-scale abutment design dataset show that TEMAD achieves state-of-the-art performance compared to existing methods, particularly in multi-abutment settings, validating its effectiveness for fully automated dental implant planning.

[CV-64] Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting

【速读】：该论文旨在解决现有3D场景理解方法中基于视觉基础模型（Visual Foundation Models, VFMs）的2D掩码监督信号缺乏本质对象中心性（object-centric）的问题，这导致跨视角和跨场景时存在掩码身份冲突，且需额外的掩码预/后处理或专门训练策略来对齐对象标识，从而限制了3D表示的泛化能力。解决方案的关键在于提出一种数据集级别的、面向对象的监督机制，结合预训练的基于槽注意力（slot attention）的全局对象中心学习（Global Object Centric Learning, GOCL）模块，构建一个与场景无关的对象代码本（scene-agnostic object codebook），该代码本提供一致的身份锚定表示，可直接用于监督3D高斯点（3D Gaussians）的身份特征，无需额外掩码处理或显式的多视角对齐。此方法首次将无监督对象中心学习（Unsupervised Object-Centric Learning, OCL）引入3D高斯泼溅（3D Gaussian Splatting, 3DGS），显著提升了结构化表示能力和下游任务（如机器人交互、场景理解及跨场景泛化）的性能。

链接: https://arxiv.org/abs/2604.09045
作者: Tsuheng Hsu,Guiyu Liu,Juho Kannala,Janne Heikkilä
机构: Aalto University (阿尔托大学); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent works on 3D scene understanding leverage 2D masks from visual foundation models (VFMs) to supervise radiance fields, enabling instance-level 3D segmentation. However, the supervision signals from foundation models are not fundamentally object-centric and often require additional mask pre/post-processing or specialized training and loss design to resolve mask identity conflicts across views. The learned identity of the 3D scene is scene-dependent, limiting generalizability across scenes. Therefore, we propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting (3DGS). Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module’s unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining. Our method thus introduces unsupervised object-centric learning (OCL) into 3DGS, yielding more structured representations and better generalization for downstream tasks such as robotic interaction, scene understanding, and cross-scene generalization.

[CV-65] owards Lifelong Aerial Autonomy: Geometric Memory Management for Continual Visual Place Recognition in Dynamic Environments

【速读】：该论文旨在解决长期空中自主导航中因环境变化导致的视觉位置识别（VPR）模型灾难性遗忘问题。在连续任务场景下，传统持续学习（CL）方法难以应对地理特征的严重类内差异，从而影响模型的泛化能力。其解决方案的关键在于将航空VPR建模为基于任务的域增量学习（DIL）问题，并提出一种异构记忆框架：通过“学与丢”（Learn-and-Dispose）机制将地理知识解耦为静态卫星锚点（保留全局几何先验）和动态经验回放缓冲区（保留域特定特征），并引入空间约束分配策略优化缓冲区选择，依据样本难度或特征空间多样性进行采样。实验表明，这种以结构多样性为导向的缓冲区选择策略显著优于随机基线，在知识保留上提升7.8%，并在无序任务序列中实现更优的可塑性-稳定性平衡，证明维持结构特征覆盖比单纯关注样本难度更能有效缓解灾难性遗忘。

链接: https://arxiv.org/abs/2604.09038
作者: Xingyu Shao,Zhiqiang Yan,Liangzheng Sun,Mengfan He,Chao Chen,Jinhui Zhang,Chunyu Li,Ziyang Meng
机构: Tsinghua University (清华大学); Beijing Institute of Technology (北京理工大学); Beihang University (北京航空航天大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robust geo-localization in changing environmental conditions is critical for long-term aerial autonomy. While visual place recognition (VPR) models perform well when airborne views match the training domain, adapting them to shifting distributions during sequential missions triggers catastrophic forgetting. Existing continual learning (CL) methods often fail here because geographic features exhibit severe intra-class variations. In this work, we formulate aerial VPR as a mission-based domain-incremental learning (DIL) problem and propose a novel heterogeneous memory framework. To respect strict onboard storage constraints, our “Learn-and-Dispose” pipeline decouples geographic knowledge into static satellite anchors (preserving global geometric priors) and a dynamic experience replay buffer (retaining domain-specific features). We introduce a spatially-constrained allocation strategy that optimizes buffer selection based on sample difficulty or feature space diversity. To facilitate systematic assessment, we provide three evaluation criteria and a comprehensive benchmark derived from 21 diverse mission sequences. Extensive experiments demonstrate that our architecture significantly boosts spatial generalization; our diversity-driven buffer selection outperforms the random baseline by 7.8% in knowledge retention. Unlike class-mean preservation methods that fail in unstructured environments, maximizing structural diversity achieves a superior plasticity-stability balance and ensures order-agnostic robustness across randomized sequences. These results prove that maintaining structural feature coverage is more critical than sample difficulty for resolving catastrophic forgetting in lifelong aerial autonomy.

[CV-66] NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Multi-Exposure Image Fusion in Dynamic Scenes (Track 2) CVPR

【速读】：该论文旨在解决动态场景下多曝光图像融合（multi-exposure image fusion in dynamic scenes）的难题，特别是在存在场景运动、光照变化和手持相机抖动等实际条件下的高动态范围（HDR）成像问题。其解决方案的关键在于构建了一个包含100个训练序列（7个曝光级别）和100个测试序列（5个曝光级别）的真实世界基准数据集，用于评估模型在产生误对齐和鬼影伪影（ghosting artefacts）情况下的鲁棒性与细节恢复能力。通过结合PSNR、SSIM和LPIPS指标进行客观评价，并辅以主观感知质量、计算效率和可复现性的综合考量，最终推动了生成式AI（Generative AI）方法在去除多曝光融合伪影和恢复精细结构方面的显著进步。

链接: https://arxiv.org/abs/2604.09030
作者: Lishen Qu,Yao Liu,Jie Liang,Hui Zeng,Wen Dai,Guanyi Qin,Ya-nan Guan,Shihao Zhou,Jufeng Yang,Lei Zhang,Radu Timofte,Xiyuan Yuan,Wanjie Sun,Shihang Li,Bo Zhang,Bin Chen,Jiannan Lin,Yuxu Chen,Qinquan Gao,Tong Tong,Song Gao,Jiacong Tang,Tao Hu,Xiaowen Ma,Qingsen Yan,Sunhan Xu,Juan Wang,Xinyu Sun,Lei Qi,He Xu,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPRW 2026

点击查看摘要

Abstract:This paper presents NTIRE 2026, the 3rd Restore Any Image Model (RAIM) challenge on multi-exposure image fusion in dynamic scenes. We introduce a benchmark that targets a practical yet difficult HDR imaging setting, where exposure bracketing must be fused under scene motion, illumination variation, and handheld camera jitter. The challenge data contains 100 training sequences with 7 exposure levels and 100 test sequences with 5 exposure levels, reflecting real-world scenarios that frequently cause misalignment and ghosting artefacts. We evaluate submissions with a leaderboard score derived from PSNR, SSIM, and LPIPS, while also considering perceptual quality, efficiency, and reproducibility during the final review. This track attracted 114 participating teams and received 987 submissions. The winning methods significantly improved the ability to remove artifacts from multi-exposure fusion and recover fine details. The dataset and the code of each team can be found at the repository: this https URL.

[CV-67] Skill-Conditioned Visual Geolocation for Vision-Language

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在图像地理定位任务中缺乏结构化地理推理能力以及自主自我进化机制的问题。现有方法主要依赖隐式参数记忆，易受过时知识影响并产生幻觉推理，且推理过程为一次性操作，缺少基于推理结果的反馈循环以实现持续优化。其解决方案的关键在于提出一个无需训练的框架GeoSkill，该框架基于一个可演化的技能图（Skill-Graph）：首先将人类专家轨迹提炼为原子级自然语言技能；执行阶段通过推理模型直接依据当前技能图进行引导式推理；演化阶段则利用大模型对网络规模数据中的图像-坐标对进行多轮推理，并结合真实世界验证结果，分析成功与失败轨迹，迭代合成与修剪技能，从而在不更新模型参数的前提下扩展技能图并纠正地理偏差，最终实现系统认知能力的持续增强和真实世界地理知识的自发增长。

链接: https://arxiv.org/abs/2604.09025
作者: Chenjie Yang,Yutian Jiang,Chenyu Wu
机构: Southwest Jiaotong University (西南交通大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a “one-off” process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system’s cognition of real-world geographic knowledge beyond isolated case studies.

[CV-68] Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection ACL2026

【速读】：该论文旨在解决开放权重的多模态大语言模型（Multi-modal Large Language Models, MLLs）在处理互联网规模图像数据时可能被滥用于提取个人敏感信息（如身份、位置等）所带来的隐私安全风险。解决方案的关键在于提出ImageProtector，一种用户侧的主动防护方法：通过嵌入精心设计的、几乎不可察觉的扰动作为视觉提示注入攻击，在不影响人类感知的前提下，诱导MLLM在分析受保护图像时持续生成拒绝响应（如“抱歉，我无法协助此请求”），从而实现对隐私信息的有效保护。

链接: https://arxiv.org/abs/2604.09024
作者: Zedian Shao,Hongbin Liu,Yuepeng Hu,Neil Zhenqiang Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Appeared in ACL 2026 main conference

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have emerged as powerful tools for analyzing Internet-scale image data, offering significant benefits but also raising critical safety and societal concerns. In particular, open-weight MLLMs may be misused to extract sensitive information from personal images at scale, such as identities, locations, or other private details. In this work, we propose ImageProtector, a user-side method that proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs. As a result, when an adversary analyzes a protected image with an MLLM, the MLLM is consistently induced to generate a refusal response such as “I’m sorry, I can’t help with that request.” We empirically demonstrate the effectiveness of ImageProtector across six MLLMs and four datasets. Additionally, we evaluate three potential countermeasures, Gaussian noise, DiffPure, and adversarial training, and show that while they partially mitigate the impact of ImageProtector, they simultaneously degrade model accuracy and/or efficiency. Our study focuses on the practically important setting of open-weight MLLMs and large-scale automated image analysis, and highlights both the promise and the limitations of perturbation-based privacy protection.

[CV-69] CAD 100K: A Comprehensive Multi-Task Dataset for Car Related Visual Anomaly Detection

【速读】：该论文旨在解决汽车制造质量评估中多任务视觉异常检测（Multi-task Visual Anomaly Detection）缺乏统一基准的问题，现有方法普遍任务特异、难以跨任务协同。解决方案的关键在于提出首个面向汽车相关领域的多任务学习（Multi-task Learning, MTL）专用大规模基准数据集——CAD Dataset，其涵盖7个车辆领域和3类异常检测任务，包含超过100张图像，并引入合成数据增强以应对少样本异常图像挑战。通过构建多任务基线模型并开展系统性实证研究，验证了MTL能够促进任务间知识迁移与交互，同时揭示了任务间存在的复杂冲突关系，从而为未来汽车场景下的多任务异常检测提供标准化评测平台与技术支撑。

链接: https://arxiv.org/abs/2604.09023
作者: Jiahua Pang,Ying Li,Dongpu Cao,Jingcai Luo,Yanuo Zheng,Bao Yunfan,Yujie Lei,Rui Yuan,Yuxi Tian,Guojin Yuan,Hongchang Chen,Zhi Zheng,Yongchun Liu
机构: Beijing Institute of Technology; Tsinghua University; China Agricultural University; Beijing Jiaotong University; The Hong Kong Polytechnic University; Li Auto
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-task visual anomaly detection is critical for car-related manufacturing quality assessment. However, existing methods remain task-specific, hindered by the absence of a unified benchmark for multi-task evaluation. To fill in this gap, We present the CAD Dataset, a large-scale and comprehensive benchmark designed for car-related multi-task visual anomaly detection. The dataset contains over 100 images crossing 7 vehicle domains and 3 tasks, providing models a comprehensive view for car-related anomaly detection. It is the first car-related anomaly dataset specialized for multi-task learning(MTL), while combining synthesis data augmentation for few-shot anomaly images. We implement a multi-task baseline and conduct extensive empirical studies. Results show MTL promotes task interaction and knowledge transfer, while also exposing challenging conflicts between tasks. The CAD dataset serves as a standardized platform to drive future advances in car-related multi-task visual anomaly detection. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.09023 [cs.CV] (or arXiv:2604.09023v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.09023 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-70] BlendFusion – Scalable Synthetic Data Generation for Diffusion Model Training

【速读】：该论文旨在解决扩散模型（diffusion models）生成的合成图像存在视觉不一致性的问题，以及由此导致的训练模型时可能引发的自噬性反馈循环（Model Autophagy Disorder, MAD），进而造成模型性能退化。其解决方案的关键在于提出了一种名为BlendFusion的可扩展合成数据生成框架，该框架基于路径追踪（path tracing）从3D场景中生成高质量图像-文本对；其核心创新包括面向对象的相机布局策略、鲁棒的过滤机制和自动标注功能，从而有效提升合成数据的质量与多样性，并通过FineBLEND数据集验证了方案的有效性。

链接: https://arxiv.org/abs/2604.09022
作者: Thejas Venkatesh,Suguna Varshini Velury
机构: Samaya AI, Inc.(Samaya AI, Inc.); Stanford University(斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid adoption of diffusion models, synthetic data generation has emerged as a promising approach for addressing the growing demand for large-scale image datasets. However, images generated purely by diffusion models often exhibit visual inconsistencies, and training models on such data can create an autophagous feedback loop that leads to model collapse, commonly referred to as Model Autophagy Disorder (MAD). To address these challenges, we propose BlendFusion, a scalable framework for synthetic data generation from 3D scenes using path tracing. Our pipeline incorporates an object-centric camera placement strategy, robust filtering mechanisms, and automatic captioning to produce high-quality image-caption pairs. Using this pipeline, we curate FineBLEND, an image-caption dataset constructed from a diverse set of 3D scenes. We empirically analyze the quality of FineBLEND and compare it to several widely used image-caption datasets. We also demonstrate the effectiveness of our object-centric camera placement strategy relative to object-agnostic sampling approaches. Our open-source framework is designed for high configurability, enabling the community to create their own datasets from 3D scenes.

[CV-71] Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion ATC

【速读】：该论文旨在解决人脸活体检测（Face Anti-Spoofing, FAS）算法在面对未见视觉域（unseen visual domains）和新型欺骗手段时泛化能力不足的问题。其核心解决方案是提出一种模式转换生成对抗网络（Pattern Conversion Generative Adversarial Network, PCGAN），通过有效解耦潜在向量中的伪造特征（spoof artifacts）与人脸特征（facial features），实现多样伪造模式的图像生成，从而增强模型对未知攻击场景的适应性；同时引入基于块（patch-based）和多任务学习机制，以提升对局部攻击（partial attacks）的检测能力并缓解因过度拟合面部特征导致的过拟合问题。

链接: https://arxiv.org/abs/2604.09018
作者: Seungjin Jung,Yonghyun Jeong,Minha Kim,Jimin Min,Youngjoon Yoo,Jongwon Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The published version is available at DOI: this https URL

点击查看摘要

Abstract:Face Anti-Spoofing (FAS) algorithms, designed to secure face recognition systems against spoofing, struggle with limited dataset diversity, impairing their ability to handle unseen visual domains and spoofing methods. We introduce the Pattern Conversion Generative Adversarial Network (PCGAN) to enhance domain generalization in FAS. PCGAN effectively disentangles latent vectors for spoof artifacts and facial features, allowing to generate images with diverse artifacts. We further incorporate patch-based and multi-task learning to tackle partial attacks and overfitting issues to facial features. Our extensive experiments validate PCGAN’s effectiveness in domain generalization and detecting partial attacks, giving a substantial improvement in facial recognition security.

[CV-72] Robust by Design: A Continuous Monitoring and Data Integration Framework for Medical AI

【速读】：该论文旨在解决自适应医疗人工智能（AI）模型在动态临床环境中因数据漂移（data drift）导致性能下降的问题。其解决方案的关键在于提出一个自主的连续监控与数据集成框架，通过多指标特征分析（Euclidean、cosine、Mahalanobis距离）和基于蒙特卡洛丢弃（Monte Carlo dropout）的不确定性门控机制，智能判断何时对新数据进行增量训练；仅整合统计上与训练分布相似且预测熵低的新图像，并在严格性能保障（如AUC或准确率不下降超过5%）下进行增量重训练，从而有效应对数据偏移并避免灾难性遗忘（catastrophic forgetting），实现医学影像AI模型的持续学习与稳定性能。

链接: https://arxiv.org/abs/2604.09009
作者: Mohammad Daouk,Jan Ulrich Becker,Neeraja Kambham,Anthony Chang,Chandra Mohan,Hien Van Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ISBI 2026. Chandra Mohan and Hien Van Nguyen jointly supervised this work

点击查看摘要

Abstract:Adaptive medical AI models often face performance drops in dynamic clinical environments due to data drift. We propose an autonomous continuous monitoring and data integration framework that maintains robust performance over time. Focusing on glomerular pathology image classification (proliferative vs. non-proliferative lupus nephritis), our three-stage method uses multi-metric feature analysis and Monte Carlo dropout-based uncertainty gating to decide when to retrain on new data. Only images statistically similar to the training distribution (via Euclidean, cosine, Mahalanobis metrics) and with low predictive entropy are integrated. The model is then incrementally retrained with these images under strict performance safeguards (no metric degradation 5%). In experiments with a ResNet18 ensemble on a multi-center dataset, the framework prevents performance degradation: new images were added without significant change in AUC (~0.92) or accuracy (~89%). This approach addresses data shift and avoids catastrophic forgetting, enabling sustained learning in medical imaging AI.

[CV-73] StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding ACL

【速读】：该论文旨在解决视觉代理记忆（Vision Agent Memory）在流式视频理解中因存储大量记忆节点而导致的高内存开销问题，从而引发的存储与计算成本过高的挑战。其解决方案的关键在于提出了一种高效的流式代理记忆压缩框架 StreamMeCo：首先基于记忆图的连通性，对孤立节点采用无边最小-最大采样策略、对连接节点采用感知边的权重剪枝策略，以剔除冗余记忆节点并维持模型精度；其次引入时间衰减的记忆检索机制，进一步缓解因压缩导致的性能下降。实验表明，在70%的记忆图压缩率下，StreamMeCo实现了1.87倍的内存检索加速，并带来平均1.0%的准确率提升。

链接: https://arxiv.org/abs/2604.09000
作者: Junxi Wang,Te Sun,Jiayi Zhu,Junxian Li,Haowen Xu,Zichen Wen,Xuming Hu,Zhiyu Li,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); Hong Kong University of Science and Technology (香港科技大学); MemTensor (Shanghai) Technology Co., Ltd. (MemTensor(上海)科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2026ACL Findings

点击查看摘要

Abstract:Vision agent memory has shown remarkable effectiveness in streaming video understanding. However, storing such memory for videos incurs substantial memory overhead, leading to high costs in both storage and computation. To address this issue, we propose StreamMeCo, an efficient Stream Agent Memory Compression framework. Specifically, based on the connectivity of the memory graph, StreamMeCo introduces edge-free minmax sampling for the isolated nodes and an edge-aware weight pruning for connected nodes, evicting the redundant memory nodes while maintaining the accuracy. In addition, we introduce a time-decay memory retrieval mechanism to further eliminate the performance degradation caused by memory compression. Extensive experiments on three challenging benchmark datasets (M3-Bench-robot, M3-Bench-web and Video-MME-Long) demonstrate that under 70% memory graph compression, StreamMeCo achieves a 1.87* speedup in memory retrieval while delivering an average accuracy improvement of 1.0%. Our code is available at this https URL.

[CV-74] Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

【速读】：该论文旨在解决当前交互式视频生成中难以同时实现记忆增强的长期时序一致性与高分辨率实时生成的问题，从而提升模型在真实场景中的可用性。其核心解决方案在于：首先构建一个工业级无限数据引擎，融合Unreal Engine合成数据、AAA游戏大规模自动化采集及真实视频增强，以生成高质量的Video-Pose-Action-Prompt四元组数据；其次提出一种长时程一致性训练框架，通过建模预测残差并在训练中重新注入不完美生成帧，使基础模型具备自校正能力，同时结合相机感知的记忆检索与注入机制，确保空间-时间一致性；最后设计基于Distribution Matching Distillation（DMD）的多段自回归蒸馏策略，并配合模型量化和VAE解码器剪枝，实现高效实时推理。该方法在720p分辨率下达到最高40 FPS的实时生成速度，且支持分钟级稳定记忆一致性，为可部署的世界模型提供了可行路径。

链接: https://arxiv.org/abs/2604.08995
作者: Zile Wang,Zexiang Liu,Jaixing Li,Kaichen Huang,Baixin Xu,Fei Kang,Mengyin An,Peiyu Wang,Biao Jiang,Yichen Wei,Yidan Xietian,Jiangbo Pei,Liang Hu,Boyi Jiang,Hua Xue,Zidong Wang,Haofeng Sun,Wei Li,Wanli Ouyang,Xianglong He,Yang Liu,Yangguang Li,Yahui Zhou
机构: Skywork AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.

[CV-75] PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

【速读】：该论文旨在解决室内视频中小目标为中心的空间理解问题（small object-centric spatial understanding in indoor videos），这是多模态大语言模型（MLLMs）在物体搜索和辅助应用中面临的关键挑战。现有基准主要关注视频空间智能、具身推理和诊断感知，但缺乏对模型能否精确定位目标物体并提供下游可用的空间描述能力的直接评估。解决方案的关键在于提出PinpointQA，这是首个针对该任务的基准数据集，包含1,024个场景和10,094组问答对，分为四个逐步提升难度的任务：目标存在性验证（TPV）、最近参照物识别（NRI）、细粒度空间描述（FSD）和结构化空间预测（SSP）。该数据集基于中间空间表示自动生成问答对，并通过质量控制优化，实验表明其不仅可作为诊断基准揭示模型能力差距，还可通过监督微调显著提升模型在困难任务上的表现，尤其在SSP任务上效果明显。

链接: https://arxiv.org/abs/2604.08991
作者: Zhiyu Zhou,Peilin Liu,Ruoxuan Zhang,Luyang Zhang,Cheng Zhang,Hongxia Xie,Wen-Huang Cheng
机构: Jilin University (吉林大学); National Taiwan University (台湾国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at this https URL.

[CV-76] ActFER: Agent ic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

【速读】：该论文旨在解决当前基于多模态大语言模型（Multimodal Large Language Models, MLLMs）的面部表情识别（Facial Expression Recognition, FER）方法存在的被动感知局限性问题，即这些方法依赖预设的面部输入，仅进行单次推理，缺乏主动获取视觉证据的能力。解决方案的关键在于提出ActFER框架，将FER重构为“主动视觉证据获取 + 多模态推理”的范式：通过动态调用工具实现人脸检测与对齐，选择性放大信息丰富的局部区域，并基于面部动作单元（Action Units, AUs）进行视觉思维链（Chain-of-Thought）推理。为支持该行为，进一步设计了适用于代理式FER的强化学习算法Utility-Calibrated GRPO（UC-GRPO），其核心创新包括基于AU的多层次可验证奖励以增强监督信号密度、查询条件对比效用估计以实现样本感知的动态信用分配，以及情感感知的EMA校准机制以降低效用估计噪声并捕捉不同情绪下的局部检查倾向，从而让模型学会何时进行局部检查及如何基于获取的证据进行推理。

链接: https://arxiv.org/abs/2604.08990
作者: Shifeng Liu,Zhengye Zhang,Sirui Zhao,Xinglong Mao,Zhehan Kan,Zhixiang Wei,Shiwei Wu,Chaoyou Fu,Tong Xu,Enhong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have created new opportunities for facial expression recognition (FER), moving it beyond pure label prediction toward reasoning-based affect understanding. However, existing MLLM-based FER methods still follow a passive paradigm: they rely on externally prepared facial inputs and perform single-pass reasoning over fixed visual evidence, without the capability for active facial perception. To address this limitation, we propose ActFER, an agentic framework that reformulates FER as active visual evidence acquisition followed by multimodal reasoning. Specifically, ActFER dynamically invokes tools for face detection and alignment, selectively zooms into informative local regions, and reasons over facial Action Units (AUs) and emotions through a visual Chain-of-Thought. To realize such behavior, we further develop Utility-Calibrated GRPO (UC-GRPO), a reinforcement learning algorithm tailored to agentic FER. UC-GRPO uses AU-grounded multi-level verifiable rewards to densify supervision, query-conditional contrastive utility estimation to enable sample-aware dynamic credit assignment for local inspection, and emotion-aware EMA calibration to reduce noisy utility estimates while capturing emotion-wise inspection tendencies. This algorithm enables ActFER to learn both when local inspection is beneficial and how to reason over the acquired evidence. Comprehensive experiments show that ActFER trained with UC-GRPO consistently outperforms passive MLLM-based FER baselines and substantially improves AU prediction accuracy.

[CV-77] How Should Video LLM s Output Time? An Analysis of Efficient Temporal Grounding Paradigms CVPR2026

【速读】：该论文旨在解决视频时间定位（Video Temporal Grounding, VTG）系统中输出范式设计对定位精度与系统效率之间权衡关系不明确的问题，尤其在资源受限的边缘部署场景下，现有方法因耦合不同骨干网络、数据集和训练协议而难以隔离输出设计的影响。解决方案的关键在于开展受控的实证研究，对比三种主流VTG输出范式——文本数值生成、时间标记生成和连续时间解码——在相同紧凑视觉语言模型（VLMs）架构、一致数据集和LoRA微调协议下的性能表现，从而客观揭示输出形式对定位准确率与推理延迟、训练吞吐量及参数开销等系统级指标的独立影响。结果表明，连续分布范式在帕累托前沿上展现出最优的效率-准确性平衡，为高效可部署的VTG系统设计提供了实证依据。

链接: https://arxiv.org/abs/2604.08966
作者: Shengji Jin,Yuanhao Zou,Victor Zhu,Zhengping Ji,Chen Chen
机构: The University of Central Florida (中佛罗里达大学); Axon (安讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Workshop Paper

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. This makes it challenging to isolate the specific impact of the output design. Additionally, as VTG systems are increasingly considered for resource-constrained edge deployment, the trade-off between output formulation and system-level efficiency requires systematic investigation. In this paper, we present a controlled empirical study comparing three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. We evaluate these paradigms across identical compact VLMs (SmolVLM2, FastVLM, and Molmo2) using consistent datasets and LoRA fine-tuning protocols. Evaluations on Charades-STA, QVHighlights, and YouCook2 measure both localization accuracy and system efficiency, including inference latency, training throughput, and parameter overhead. Our results demonstrate that the choice of output formulation significantly affects both grounding accuracy and computational cost, independent of model scale. Specifically, the continuous distribution paradigm consistently achieves the most favorable efficiency-accuracy trade-off on the Pareto frontier, delivering robust localization with minimal latency overhead. These findings provide objective empirical guidelines for designing efficient, deployment-ready VTG systems.

[CV-78] Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation

【速读】：该论文旨在解决卫星遥感影像语义分割中因标注成本高、类别不平衡导致模型性能受限的问题。传统主动学习（Active Learning, AL）方法通常依赖全局不确定性或多样性度量进行样本选择，缺乏对训练过程中表现较差或稀有类别的动态关注能力，从而加剧类别偏差。其解决方案的关键在于提出一种新的自适应获取函数——基于动态类感知不确定性的主动学习（Dynamic Class-Aware Uncertainty based Active Learning, DCAU-AL），该机制能够实时追踪每个类别的分割性能差距，并动态调整采样权重，持续聚焦于低性能或欠采样的类别，从而有效缓解类别不平衡问题，在保持高精度的同时显著提升标注效率。

链接: https://arxiv.org/abs/2604.08965
作者: Gadi Hemanth Kumar,Athira Nambiar,Pankaj Bodani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation of satellite imagery plays a vital role in land cover mapping and environmental monitoring. However, annotating large-scale, high-resolution satellite datasets is costly and time consuming, especially when covering vast geographic regions. Instead of randomly labeling data or exhaustively annotating entire datasets, Active Learning (AL) offers an efficient alternative by intelligently selecting the most informative samples for annotation with the help of Human-in-the-loop (HITL), thereby reducing labeling costs while maintaining high model performance. AL is particularly beneficial for large-scale or resource-constrained satellite applications, as it enables high segmentation accuracy with significantly fewer labeled samples. Despite these advantages, standard AL strategies typically rely on global uncertainty or diversity measures and lack the adaptability to target underperforming or rare classes as training progresses, leading to bias in the system. To overcome these limitations, we propose a novel adaptive acquisition function, Dynamic Class-Aware Uncertainty based Active learning (DCAU-AL) that prioritizes sample selection based on real-time class-wise performance gaps, thereby overcoming class-imbalance issue. The proposed DCAU-AL mechanism continuously tracks the performance of the segmentation per class and dynamically adjusts the sampling weights to focus on poorly performing or underrepresented classes throughout the active learning process. Extensive experiments on the OpenEarth land cover dataset show that DCAU-AL significantly outperforms existing AL methods, especially under severe class imbalance, delivering superior per-class IoU and improved annotation efficiency.

[CV-79] Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift CVPR2026

【速读】：该论文旨在解决将视觉-语言模型（Vision-Language Models, VLMs）适配至遥感影像时面临的分布不匹配问题，即卫星图像的视觉与语言分布显著偏离自然图像预训练语料库，导致传统提示（prompting）方法失效。其核心发现是：尽管通过精心设计的语言提示（如领域术语、外观描述和上下文线索）试图引导冻结模型表示以适应特定任务（云分割），但所有提示变体均显著劣于零样本基线（mIoU从0.255降至低至0.07），表明语言引导无法弥合自然图像表征与卫星光谱图像之间的鸿沟。解决方案的关键在于监督微调（supervised fine-tuning），仅需0.1%标注数据（约8张图像）即可整体超越零样本性能，而使用5–10%数据可恢复约85%的最大mIoU；此外，全量微调优于低秩适应（LoRA），尤其在光谱模糊类别上差距最大，且在少量数据下存在“监督跌落”现象（supervision dip），可能被整体mIoU掩盖。因此，论文强调：标注数据不是提示的替代品，而是适配VLM至专业遥感场景的必要路径。

链接: https://arxiv.org/abs/2604.08956
作者: Harshith Kethavath,Weiming Hu
机构: University of Georgia, USA (佐治亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures, to be published in EarthVision @ CVPR 2026

点击查看摘要

Abstract:Adapting vision-language models to remote sensing imagery presents a fundamental challenge: both the visual and linguistic distributions of satellite data lie far outside natural image pretraining corpora. Despite this, prompting remains the dominant deployment paradigm, driven by the assumption that domain-specific language can guide frozen model representations toward specialized tasks. We test this assumption directly on a domain where the mismatch is prominent: cloud segmentation for satellite imagery. Using CLIPSeg on the CloudSEN12+ cloud segmentation benchmark, we evaluate 60 prompt variants spanning simple labels, domain terminology, appearance descriptors, and contextual cues, finding that every variant underperforms the zero-shot baseline (0.255 mIoU), with engineered prompts scoring as low as 0.07 mIoU. No amount of linguistic refinement bridges the gap between CLIP’s natural image representations and satellite spectral imagery. In contrast, supervised fine-tuning with just 0.1% labeled data (~8 images) surpasses zero-shot performance overall, and 5-10% data recovers ~85% of maximum achievable mIoU. Full fine-tuning consistently outperforms low-rank adaptation by 0.03-0.09 mIoU, with the largest gaps for spectrally ambiguous classes, and at 0.5 to 1% labeled data, fine-tuning temporarily degrades performance on these classes before recovering, a supervision dip that aggregate mIoU can mask. For practitioners adapting vision-language models to specialized imagery, our results deliver a clear message: labeled data is not the expensive alternative to prompting; it is the worthwhile path.

[CV-80] ouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches

【速读】：该论文旨在解决在遮挡或光照条件恶劣等视觉不可靠场景下，仅依靠稀疏触觉测量难以准确重建物体全局3D几何形状的问题。其关键解决方案是提出TouchAnything框架，利用预训练的大规模2D视觉扩散模型（vision diffusion model）作为语义与几何先验，将其中编码的几何知识迁移至触觉域，通过优化问题约束触觉一致性并引导重建结果符合扩散模型先验，从而实现仅需少量触觉接触即可高精度重建未知物体的3D形状。

链接: https://arxiv.org/abs/2604.08945
作者: Langzhe Gu,Hung-Jui Huang,Mohamad Qadri,Michael Kaess,Wenzhen Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Accurate object geometry estimation is essential for many downstream tasks, including robotic manipulation and physical interaction. Although vision is the dominant modality for shape perception, it becomes unreliable under occlusions or challenging lighting conditions. In such scenarios, tactile sensing provides direct geometric information through physical contact. However, reconstructing global 3D geometry from sparse local touches alone is fundamentally underconstrained. We present TouchAnything, a framework that leverages a pretrained large-scale 2D vision diffusion model as a semantic and geometric prior for 3D reconstruction from sparse tactile measurements. Unlike prior work that trains category-specific reconstruction networks or learns diffusion models directly from tactile data, we transfer the geometric knowledge encoded in pretrained visual diffusion models to the tactile domain. Given sparse contact constraints and a coarse class-level description of the object, we formulate reconstruction as an optimization problem that enforces tactile consistency while guiding solutions toward shapes consistent with the diffusion prior. Our method reconstructs accurate geometries from only a few touches, outperforms existing baselines, and enables open-world 3D reconstruction of previously unseen object instances. Our project page is this https URL .

[CV-81] MASS: Mesh-inellipse Aligned Deformable Surfel Splatting for Hand Reconstruction and Rendering from Egocentric Monocular Video

【速读】：该论文旨在解决从第一人称单目视频中重建高保真三维手部形状的难题，尤其针对现有方法在捕捉高分辨率几何结构、手与物体交互关系以及复杂物体附着于手部时表现不足的问题，同时克服计算成本过高难以实现实时应用的局限。其解决方案的关键在于提出Mesh-inellipse Aligned deformable Surfel Splatting (MASS)框架，通过引入基于网格对齐的Steiner内切椭圆（Steiner Inellipse）和分形密度增强策略，实现从粗略参数化手部网格到高分辨率2D高斯Surfel表示的高效转换；并设计高斯Surfel变形机制（Gaussian Surfel Deformation），通过预测Surfel属性残差更新并引入不透明度掩膜，在无需自适应密度控制的前提下实现手部形变与个性化特征的高效建模，从而显著提升重建质量与效率。

链接: https://arxiv.org/abs/2604.08943
作者: Haoyu Zhu,Yi Zhang,Lei Yao,Lap-pui Chau,Yi Wang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This paper has been accepted to CVM 2026 Journal Track and is under consideration for publication in IEEE TVCG

点击查看摘要

Abstract:Reconstructing high-fidelity 3D hands from egocentric monocular videos remains a challenge due to the limitations in capturing high-resolution geometry, hand-object interactions, and complex objects on hands. Additionally, existing methods often incur high computational costs, making them impractical for real-time applications. In this work, we propose Mesh-inellipse Aligned deformable Surfel Splatting (MASS) to address these challenges by leveraging a deformable 2D Gaussian Surfel representation. We introduce the mesh-aligned Steiner Inellipse and fractal densification for mesh-to-surfel conversion that initiates high-resolution 2D Gaussian surfels from coarse parametric hand meshes, providing surface representation with photorealistic rendering potential. Second, we propose Gaussian Surfel Deformation, which enables efficient modeling of hand deformations and personalized features by predicting residual updates to surfel attributes and introducing an opacity mask to refine geometry and texture without adaptive density control. In addition, we propose a two-stage training strategy and a novel binding loss to improve the optimization robustness and reconstruction quality. Extensive experiments on the ARCTIC dataset, the Hand Appearance dataset, and the Interhand2.6M dataset demonstrate that our model achieves superior reconstruction performance compared to state-of-the-art methods.

[CV-82] M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model

【速读】：该论文旨在解决现有医学基础模型（Medical Foundation Models, MFMs）中存在的信息模糊问题，即多模态医学图像表示在单一嵌入空间中混合，导致模态特异性（modality specificity）和多样性（diversity）下降的问题。解决方案的关键在于提出一种自监督的MFMs——M-IDoL，其核心是引入信息分解（Information Decomposition）机制，通过两个目标实现多模态表征学习：一是最大化模态间熵，将不同模态表示分散到可分离的专家混合（Mixture-of-Experts, MoE）子空间中，以增强模态间的特异性；二是最小化模态内不确定性，在每个MoE子空间内进行细粒度语义区分，从而提升每种模态内部的表征多样性。该方法在115万张医学图像上预训练后，在21个下游临床任务中显著优于20个现有基础模型，并展现出更清晰的模态间特征聚类与更精细的模态内特征区分能力。

链接: https://arxiv.org/abs/2604.08936
作者: Yihang Liu,Ying Wen,Jiaxiong Yang,Longzhen Yang,Lianghua He,Heng Tao Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical foundation models (MFMs) aim to learn universal representations from multimodal medical images that can generalize effectively to diverse downstream clinical tasks. However, most existing MFMs suffer from information ambiguity that blend multimodal representations in a single embedding space, leading to the degradation of modality specificity and diversity. In this paper, we propose M-IDoL, a self-supervised \underline\textitMFM that introduces Information Decomposition for multimodal representation Learning via two objectives: i) maximize inter-modality entropy by dispersing multimodal representation into separable Mixture-of-Experts (MoE) subspaces to achieve representation specificity across modalities; and ii) minimize intra-modality uncertainty by performing fine-grained semantic discrimination within each MoE subspace to enrich representation diversity per modality. By pre-training on 1.15 million medical images, M-IDoL i) delivers superior generalization across 21 downstream clinical tasks, outperforming 20 foundation models on five imaging modalities (e.g., X-ray, fundus, OCT, dermoscopy and pathology), and ii) learns modality-specific and diverse representations, showing clearer separation of feature cluster across modalities and finer-grained feature discrimination within each modality.

[CV-83] Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion CVPR2026

【速读】：该论文旨在解决现有红外-可见光图像融合方法在面对多种下游任务时难以同时适应的问题，即缺乏对不同任务语义需求的动态响应能力。其解决方案的关键在于提出一种闭环动态网络（Closed-Loop Dynamic Network, CLDyN），通过引入一个闭环优化机制，构建从下游任务到融合网络的语义传输链，实现显式反馈；其中核心组件Requirement-driven Semantic Compensation (RSC) 模块利用基向量库（Basis Vector Bank, BVB）和架构自适应语义注入（Architecture-Adaptive Semantic Injection, A2SI）块，根据任务需求定制网络结构，从而实现无需重新训练即可完成任务特定的语义补偿，显著提升多任务场景下的融合适应性与性能。

链接: https://arxiv.org/abs/2604.08924
作者: Zengyi Yang,Yu Liu,Juan Cheng,Zhiqin Zhu,Yafei Zhang,Huafeng Li
机构: Hefei University of Technology (合肥工业大学); Chongqing University of Post and Telecommunications (重庆邮电大学); Kunming University of Science and Technology (昆明理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by CVPR 2026

点击查看摘要

Abstract:Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability. The code is available at this https URL.

[CV-84] Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios CVPR2026

【速读】：该论文旨在解决真实场景下图像融合任务中复杂退化（如噪声、模糊和低分辨率）带来的性能瓶颈问题，以及现有基于端到端神经网络方法可解释性差、扩散模型难以直接应用于缺乏自然融合数据的多源信息互补建模等挑战。其解决方案的关键在于提出一种高效且退化感知的扩散框架，通过隐式去噪机制直接回归融合图像而非显式预测噪声，从而在有限迭代步数内灵活适配多种退化场景；同时设计联合观测模型校正机制，在采样过程中同步施加退化约束与融合约束，确保高精度重建。

链接: https://arxiv.org/abs/2604.08922
作者: Yu Shi,Yu Liu,Zhong-Cheng Wu,Juan Cheng,Huafeng Li,Xun Chen
机构: Hefei University of Technology (合肥工业大学); Kunming University of Science and Technology (昆明理工大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Complex degradations like noise, blur, and low resolution are typical challenges in real world image fusion tasks, limiting the performance and practicality of existing methods. End to end neural network based approaches are generally simple to design and highly efficient in inference, but their black-box nature leads to limited interpretability. Diffusion based methods alleviate this to some extent by providing powerful generative priors and a more structured inference process. However, they are trained to learn a single domain target distribution, whereas fusion lacks natural fused data and relies on modeling complementary information from multiple sources, making diffusion hard to apply directly in practice. To address these challenges, this paper proposes an efficient degradation aware diffusion framework for image fusion under arbitrary degradation scenarios. Specifically, instead of explicitly predicting noise as in conventional diffusion models, our method performs implicit denoising by directly regressing the fused image, enabling flexible adaptation to diverse fusion tasks under complex degradations with limited steps. Moreover, we design a joint observation model correction mechanism that simultaneously imposes degradation and fusion constraints during sampling to ensure high reconstruction accuracy. Experiments on diverse fusion tasks and degradation configurations demonstrate the superiority of the proposed method under complex degradation scenarios.

[CV-85] AIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction

【速读】：该论文旨在解决人机交互（Human-Robot Interaction, HRI）场景中机器人对用户关键身体部位进行精确、可度量的三维空间定位问题，传统方法通常关注全身重建质量相对于根关节的准确性，而忽视了任务相关肢体部位在第一人称视角下的精确定位需求。解决方案的关键在于提出TAIHRI——首个专为近距离HRI感知设计的视觉语言模型（Vision-Language Model, VLM），其通过将3D关键点量化至有限交互空间，并利用基于下一词预测（next token prediction）的2D关键点推理机制，实现对任务相关关键点的高精度3D坐标定位，同时具备自然语言控制和全局人体网格恢复等下游任务的无缝适应能力。

链接: https://arxiv.org/abs/2604.08921
作者: Ao Li,Yonggen Ling,Yiyang Lin,Yuji Wang,Yong Deng,Yansong Tang
机构: Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D human keypoints localization is a critical technology enabling robots to achieve natural and safe physical interaction with users. Conventional 3D human keypoints estimation methods primarily focus on the whole-body reconstruction quality relative to the root joint. However, in practical human-robot interaction (HRI) scenarios, robots are more concerned with the precise metric-scale spatial localization of task-relevant body parts under the egocentric camera 3D coordinate. We propose TAIHRI, the first Vision-Language Model (VLM) tailored for close-range HRI perception, capable of understanding users’ motion commands and directing the robot’s attention to the most task-relevant keypoints. By quantizing 3D keypoints into a finite interaction space, TAIHRI precisely localize the 3D spatial coordinates of critical body parts by 2D keypoint reasoning via next token prediction, and seamlessly adapt to downstream tasks such as natural language control or global space human mesh recovery. Experiments on egocentric interaction benchmarks demonstrate that TAIHRI achieves superior estimation accuracy for task-critical body parts. We believe TAIHRI opens new research avenues in the field of embodied human-robot interaction. Code is available at: this https URL.

[CV-86] MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation

【速读】：该论文旨在解决现有零样本3D实例分割方法中因依赖独立处理每一帧且仅使用二维度量（如SAM预测得分）而导致的多视角一致性差、3D先验信息利用不足，进而引发3D分割结果碎片化的问题。其解决方案的关键在于提出一种从粗到精的框架MV3DIS，通过引入3D引导的掩码匹配策略，以粗粒度3D片段作为跨视图的统一参考来对齐2D掩码，并借助3D覆盖分布强化多视角掩码一致性；同时设计深度一致性加权机制，量化投影可靠性以抑制物体间遮挡带来的歧义，从而提升3D到2D对应关系的鲁棒性，最终实现更精确的3D实例分割。

链接: https://arxiv.org/abs/2604.08916
作者: Yibo Zhao,Yigong Zhang,Jin Xie
机构: Nankai University (南开大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional 3D instance segmentation methods rely on labor-intensive 3D annotations for supervised training, which limits their scalability and generalization to novel objects. Recent approaches leverage multi-view 2D masks from the Segment Anything Model (SAM) to guide the merging of 3D geometric primitives, thereby enabling zero-shot 3D instance segmentation. However, these methods typically process each frame independently and rely solely on 2D metrics, such as SAM prediction scores, to produce segmentation maps. This design overlooks multi-view correlations and inherent 3D priors, leading to inconsistent 2D masks across views and ultimately fragmented 3D segmentation. In this paper, we propose MV3DIS, a coarse-to-fine framework for zero-shot 3D instance segmentation that explicitly incorporates 3D priors. Specifically, we introduce a 3D-guided mask matching strategy that uses coarse 3D segments as a common reference to match 2D masks across views and consolidates multi-view mask consistency via 3D coverage distributions. Guided by these view-consistent 2D masks, the coarse 3D segments are further refined into precise 3D instances. Additionally, we introduce a depth consistency weighting scheme that quantifies projection reliability to suppress ambiguities from inter-object occlusions, thereby improving the robustness of 3D-to-2D correspondence. Extensive experiments on the ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets demonstrate the effectiveness of MV3DIS, which achieves superior performance over previous methods

[CV-87] Large-Scale Universal Defect Generation: Foundation Models and Datasets

【速读】：该论文旨在解决现有缺陷/异常生成方法依赖少量样本学习（few-shot learning）所导致的过拟合问题，尤其是在缺陷尺度和形态存在显著差异时，模型泛化能力弱、真实感差以及类别一致性不足的问题。解决方案的关键在于提出两个核心贡献：一是构建包含30万组正常-异常图像-掩码-描述四元组的大规模跨域数据集UDG，为模型训练提供充足且多样化的数据支撑；二是设计通用缺陷生成基础模型UniDG，通过Defect-Context Editing机制实现自适应缺陷裁剪与结构化双图输入格式，并利用MM-DiT多模态注意力融合参考条件与目标指令，结合两阶段训练策略（Diversity-SFT与Consistency-RFT），在提升多样性的同时显著增强生成结果的真实感与参考一致性，从而在MVTec-AD和VisA等基准上实现优于现有少样本异常生成与图像编辑基线的合成质量及下游单类/多类异常检测与定位性能。

链接: https://arxiv.org/abs/2604.08915
作者: Yuanting Fan,Jun Liu,Bin-Bin Gao,Xiaochen Chen,Yuhuan Lin,Zhewei Dai,Jiawei Zhan,Chengjie Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 13 figures, preprint

点击查看摘要

Abstract:Existing defect/anomaly generation methods often rely on few-shot learning, which overfits to specific defect categories due to the lack of large-scale paired defect editing data. This issue is aggravated by substantial variations in defect scale and morphology, resulting in limited generalization, degraded realism, and category consistency. We address these challenges by introducing UDG, a large-scale dataset of 300K normal-abnormal-mask-caption quadruplets spanning diverse domains, and by presenting UniDG, a universal defect generation foundation model that supports both reference-based defect generation and text instruction-based defect editing without per-category fine-tuning. UniDG performs Defect-Context Editing via adaptive defect cropping and structured diptych input format, and fuses reference and target conditions through MM-DiT multimodal attention. A two-stage training strategy, Diversity-SFT followed by Consistency-RFT, further improves diversity while enhancing realism and reference consistency. Extensive experiments on MVTec-AD and VisA show that UniDG outperforms prior few-shot anomaly generation and image insertion/editing baselines in synthesis quality and downstream single- and multi-class anomaly detection/localization. Code will be available at this https URL.

[CV-88] Fast Model-guided Instance-wise Adaptation Framework for Real-world Pansharpening with Fidelity Constraints

【速读】：该论文旨在解决现有深度学习（Deep Learning, DL）方法在多光谱图像融合（Pansharpening）中训练成本高、数据依赖性强且泛化能力差的问题，以及零样本（Zero-shot）方法在跨传感器场景下融合质量有限、计算开销大和收敛速度慢的局限。解决方案的关键在于提出一种模型引导的实例级自适应框架 FMG-Pan，通过预训练模型指导轻量化自适应网络，并结合光谱保真度与物理保真度约束进行联合优化；其中创新性设计的物理保真度项显著提升了空间细节保留能力，从而在保证跨传感器通用性的同时实现快速训练与推理（如在RTX 3090 GPU上处理512×512×8图像仅需3秒），具备实际部署潜力。

链接: https://arxiv.org/abs/2604.08903
作者: Zhiqi Yang,Jin-Liang Xiao,Shan Yin,Liang-Jian Deng,Gemine Vivone
机构: University of Electronic Science and Technology of China (电子科技大学); National Research Council, Institute of Methodologies for Environmental Analysis (CNR-IMAA) (国家研究委员会环境分析方法研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and high-resolution panchromatic (PAN) images while preserving both spectral and spatial information. Although deep learning (DL)-based pansharpening methods achieve impressive performance, they require high training cost and large datasets, and often degrade when the test distribution differs from training, limiting generalization. Recent zero-shot methods, trained on a single PAN/LRMS pair, offer strong generalization but suffer from limited fusion quality, high computational overhead, and slow convergence. To address these issues, we propose FMG-Pan, a fast and generalizable model-guided instance-wise adaptation framework for real-world pansharpening, achieving both cross-sensor generality and rapid training-inference. The framework leverages a pretrained model to guide a lightweight adaptive network through joint optimization with spectral and physical fidelity constraints. We further design a novel physical fidelity term to enhance spatial detail preservation. Extensive experiments on real-world datasets under both intra- and cross-sensor settings demonstrate state-of-the-art performance. On the WorldView-3 dataset, FMG-Pan completes training and inference for a 512x512x8 image within 3 seconds on an RTX 3090 GPU, significantly faster than existing zero-shot methods, making it suitable for practical deployment.

[CV-89] GeoMMBench and GeoMMAgent : Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing CVPR2026

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在地球科学与遥感（Remote Sensing, RS）领域发展受限的问题，具体挑战包括跨学科知识广度、异构传感器模态多样性以及任务谱系碎片化。解决方案的关键在于提出GeoMMBench基准测试平台和GeoMMAgent多智能体框架：前者通过覆盖多样RS学科、传感器和任务的问答评测体系，系统评估模型在领域知识、感知定位和推理能力上的不足；后者则通过整合检索、感知与推理模块，并引入领域专用的RS模型与工具，构建可动态应对复杂地理空间问题的工具增强型智能体，实验证明其显著优于独立的大语言模型。

链接: https://arxiv.org/abs/2604.08896
作者: Aoran Xiao,Shihao Cheng,Yonghao Xu,Yexian Ren,Hongruixuan Chen,Naoto Yokoya
机构: RIKEN AIP; Wuhan University; Linköping University; University of Tokyo; Nanjing University of Information Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Highlight paper

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models, uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning–capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.

[CV-90] Getext2mS-T: Multi-Dimensional Grouping for Ultra-High Energy Efficiency in Spiking Transformer

【速读】：该论文旨在解决脉冲神经网络（Spiking Neural Networks, SNNs）在应用于脉冲视觉Transformer（Spiking Vision Transformers, S-ViTs）时面临的训练与推理指标不足问题，特别是现有方法如ANN-SNN转换和时空反向传播（Spatial-Temporal Backpropagation, STBP）在内存开销、学习能力与能量预算之间难以协同优化的瓶颈。其解决方案的关键在于提出Ge²mS-T架构，通过在时间、空间和网络结构三个维度上实施分组计算（grouped computation），引入基于分组指数编码的IF模型（Grouped-Exponential-Coding-based IF, ExpG-IF）实现无损转换且保持恒定训练开销，同时设计分组脉冲自注意力机制（Group-wise Spiking Self-Attention, GW-SSA）以多尺度token分组和免乘法运算降低复杂度，从而在保证精度的同时显著提升能效表现。

链接: https://arxiv.org/abs/2604.08894
作者: Zecheng Hao,Shenghao Xie,Kang Chen,Wenxuan Liu,Zhaofei Yu,Tiejun Huang
机构: Peking University (北京大学); State Key Laboratory for Multimedia Information Processing, Peking University (多媒体信息处理国家重点实验室, 北京大学); Academy for Advanced Interdisciplinary Studies, Peking University (北京大学前沿交叉学科研究院); Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer superior energy efficiency over Artificial Neural Networks (ANNs). However, they encounter significant deficiencies in training and inference metrics when applied to Spiking Vision Transformers (S-ViTs). Existing paradigms including ANN-SNN Conversion and Spatial-Temporal Backpropagation (STBP) suffer from inherent limitations, precluding concurrent optimization of memory, accuracy and energy consumption. To address these issues, we propose Ge ^\text2 mS-T, a novel architecture implementing grouped computation across temporal, spatial and network structure dimensions. Specifically, we introduce the Grouped-Exponential-Coding-based IF (ExpG-IF) model, enabling lossless conversion with constant training overhead and precise regulation for spike patterns. Additionally, we develop Group-wise Spiking Self-Attention (GW-SSA) to reduce computational complexity via multi-scale token grouping and multiplication-free operations within a hybrid attention-convolution framework. Experiments confirm that our method can achieve superior performance with ultra-high energy efficiency on challenging benchmarks. To our best knowledge, this is the first work to systematically establish multi-dimensional grouped computation for resolving the triad of memory overhead, learning capability and energy budget in S-ViTs.

[CV-91] Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS)

【速读】：该论文旨在解决脑胶质瘤（Glioma）早期检测中肿瘤区域自动分割的难题，尤其针对肿瘤位置和大小变化带来的挑战。其解决方案的关键在于提出一种新型深度学习模型——自适应双残差U-Net结合注意力门与多尺度空间注意力机制（ADRUwAMS），该模型通过双自适应残差网络架构提取高阶语义特征与低阶细节信息，同时利用注意力门机制动态加权输入特征，并引入多尺度空间注意力模块生成不同尺度的注意力图以增强关键肿瘤区域的信息表达，从而显著提升对不同类型及复杂边界肿瘤的分割精度。

链接: https://arxiv.org/abs/2604.08893
作者: Mohsen Yaghoubi Suraki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Glioma is a harmful brain tumor that requires early detection to ensure better health results. Early detection of this tumor is key for effective treatment and requires an automated segmentation process. However, it is a challenging task to find tumors due to tumor characteristics like location and size. A reliable method to accurately separate tumor zones from healthy tissues is deep learning models, which have shown promising results over the last few years. In this research, an Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) is introduced. This model is an innovative combination of adaptive dual residual networks, attention mechanisms, and multiscale spatial attention. The dual adaptive residual network architecture captures high-level semantic and intricate low-level details from brain images, ensuring precise segmentation of different tumor parts, types, and hard regions. The attention gates use gating and input signals to compute attention coefficients for the input features, and multiscale spatial attention generates scaled attention maps and combines these features to hold the most significant information about the brain tumor. We trained the model for 200 epochs using the ReLU activation function on BraTS 2020 and BraTS 2019 datasets. These improvements resulted in high accuracy for tumor detection and segmentation on BraTS 2020, achieving dice scores of 0.9229 for the whole tumor, 0.8432 for the tumor core, and 0.8004 for the enhancing tumor.

[CV-92] HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在高光谱图像（Hyperspectral Image, HSI）理解能力方面研究不足的问题，尤其针对HSI特有的高维度和复杂的光谱-空间特性带来的挑战。解决方案的关键在于提出首个专门用于评估MLLMs在HSI理解能力的基准测试——HM-Bench，并设计了一种双模态评估框架：将原始高光谱立方体转换为两种互补的表示形式——基于主成分分析（PCA）的复合图像和结构化的文本报告，从而实现对不同输入模态下模型性能的系统性比较。实验表明，视觉输入显著优于文本输入，强调了在高光谱图像理解中依赖光谱-空间证据进行语义锚定的重要性。

链接: https://arxiv.org/abs/2604.08884
作者: Xinyu Zhang,Zurong Mai,Qingmei Li,Zjin Liao,Yibin Wen,Yuhang Chen,Xiaoya Fan,Chan Tsz Ho,Bi Tianyuan,Haoyuan Liang,Ruifeng Su,Zihao Qian,Juepeng Zheng,Jianxi Huang,Yutong Lu,Haohuan Fu
机构: Sun Yat-sen University (中山大学); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); China Agricultural University (中国农业大学); Southwest Jiaotong University (西南交通大学); Southwest University (西南大学); National Supercomputing Center in Shenzhen (深圳国家超级计算中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB this http URL address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at this https URL.

[CV-93] Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance

【速读】：该论文旨在解决视觉-语言大模型（Vision-Language Large Models, VLLMs）在多语言和跨模态复合攻击下的安全漏洞问题，即低资源语言文本与有害图像组合可绕过针对高资源语言设计的安全机制，暴露出当前跨语言和跨模态安全方法的结构性盲区。解决方案的关键在于提出一种两阶段框架Precise Shield：首先通过对比有害与无害输入的激活模式识别出安全神经元（safety neurons），随后利用梯度掩码技术仅在该神经元子空间内约束参数更新，影响参数比例低于0.03%。该策略在显著提升安全性的同时保持了多语言和多模态的泛化能力，并揭示了安全神经元在不同语言和模态间存在中等程度重叠，支持零样本跨语言和跨模态安全能力迁移，为基于神经元层面的可迁移安全增强提供了新路径。

链接: https://arxiv.org/abs/2604.08881
作者: Enyi Shi,Fei Shen,Shuyi Miao,Linxia Zhu,Pengyang Shao,Jinhui Tang,Tat-Seng Chua
机构: Nanjing University of Science and Technology (南京理工大学); National University of Singapore (新加坡国立大学); Beihang University (北京航空航天大学); Nanjing Forestry University (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In real-world deployments, Vision-Language Large Models (VLLMs) face critical challenges from multilingual and multimodal composite attacks: harmful images paired with low-resource language texts can easily bypass defenses designed for high-resource language scenarios, exposing structural blind spots in current cross-lingual and cross-modal safety methods. This raises a mechanistic question: where is safety capability instantiated within the model, and how is it distributed across languages and modalities? Prior studies on pure-text LLMs have identified cross-lingual shared safety neurons, suggesting that safety may be governed by a small subset of critical neurons. Leveraging this insight, we propose Precise Shield, a two-stage framework that first identifies safety neurons by contrasting activation patterns between harmful and benign inputs, and then constrains parameter updates strictly within this subspace via gradient masking with affecting fewer than 0.03% of parameters. This strategy substantially improves safety while preserving multilingual and multimodal generalization. Further analysis reveals a moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities, and offering a new direction for neuron-level, transfer-based safety enhancement.

[CV-94] Harnessing Weak Pair Uncertainty for Text-based Person Search

【速读】：该论文旨在解决文本驱动的人体检索（text-based person search）任务中，现有方法因过度依赖严格的一对一视觉-文本模态匹配（如对比学习）而忽略弱正样本（weak positive pairs）的问题，即同一人物在不同摄像头视角下被描述为不同文本时，模型未能有效利用这些潜在的正样本对。其解决方案的关键在于提出一种不确定性感知的方法：首先通过不确定性估计模块量化图像-文本对的相对置信度，进而设计不确定性正则化机制，动态调整损失权重以自适应地降低高不确定性样本的影响；同时引入组级图像-文本匹配损失，增强弱正样本间的表征空间一致性，从而避免模型错误地将潜在的弱正样本推向负方向。

链接: https://arxiv.org/abs/2604.08877
作者: Jintao Sun,Zhedong Zheng,Gangyi Ding
机构: Beijing Institute of Technology (北京理工大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 39 pages, 15 tables, 7 figures

点击查看摘要

Abstract:In this paper, we study the text-based person search, which is to retrieve the person of interest via natural language description. Prevailing methods usually focus on the strict one-to-one correspondence pair matching between the visual and textual modality, such as contrastive learning. However, such a paradigm unintentionally disregards the weak positive image-text pairs, which are of the same person but the text descriptions are annotated from different views (cameras). To take full use of weak positives, we introduce an uncertainty-aware method to explicitly estimate image-text pair uncertainty, and incorporate the uncertainty into the optimization procedure in a smooth manner. Specifically, our method contains two modules: uncertainty estimation and uncertainty regularization. (1) Uncertainty estimation is to obtain the relative confidence on the given positive pairs; (2) Based on the predicted uncertainty, we propose the uncertainty regularization to adaptively adjust loss weight. Additionally, we introduce a group-wise image-text matching loss to further facilitate the representation space among the weak pairs. Compared with existing methods, the proposed method explicitly prevents the model from pushing away potentially weak positive candidates. Extensive experiments on three widely-used datasets, .e.g, CUHK-PEDES, RSTPReid and ICFG-PEDES, verify the mAP improvement of our method against existing competitive methods +3.06%, +3.55% and +6.94%, respectively.

[CV-95] BIAS: A Biologically Inspired Algorithm for Video Saliency Detection

【速读】：该论文旨在解决连续视频流中动态视觉显著性检测（dynamic visual saliency detection）的实时性和准确性问题，尤其在以自下而上注意（bottom-up attention）主导的视频场景中表现不佳的传统方法。其解决方案的关键在于提出BIAS模型，该模型基于Itti–Koch框架并引入类视网膜运动检测器（retina-inspired motion detector），以提取时序特征并融合静态与运动信息生成显著图；同时采用贪婪多高斯峰值拟合算法识别关注焦点（foci of attention, FOAs），在竞争机制与信息最大化之间取得平衡，从而实现毫秒级延迟下的高效、可解释的显著区域检测，在DHF1K数据集上优于启发式方法和多个深度学习模型，并在交通事故分析任务中展现出优异的前瞻预测能力（提前0.72秒准确识别因果关系）。

链接: https://arxiv.org/abs/2604.08858
作者: Zhao-ji Zhang,Ya-tang Li
机构: Academy for Advanced Interdisciplinary Studies, Peking University (北京大学前沿交叉学科研究院); Chinese Institute for Brain Research, Beijing (北京脑科学与类脑研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present BIAS, a fast, biologically inspired model for dynamic visual saliency detection in continuous video streams. Building on the Itti–Koch framework, BIAS incorporates a retina-inspired motion detector to extract temporal features, enabling the generation of saliency maps that integrate both static and motion information. Foci of attention (FOAs) are identified using a greedy multi-Gaussian peak-fitting algorithm that balances winner-take-all competition with information maximization. BIAS detects salient regions with millisecond-scale latency and outperforms heuristic-based approaches and several deep-learning models on the DHF1K dataset, particularly in videos dominated by bottom-up attention. Applied to traffic accident analysis, BIAS demonstrates strong real-world utility, achieving state-of-the-art performance in cause-effect recognition and anticipating accidents up to 0.72 seconds before manual annotation with reliable accuracy. Overall, BIAS bridges biological plausibility and computational efficiency to achieve interpretable, high-speed dynamic saliency detection.

[CV-96] DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization

【速读】：该论文旨在解决深度伪造（Deepfake）检测模型在资源受限的边缘设备上部署时面临的计算复杂度高、参数量大以及量化压缩导致细微伪造痕迹丢失的问题。现有量化技术因无法有效保留判别性特征，常造成检测性能显著下降，限制了实时、本地化推理的应用场景。解决方案的关键在于提出DefakeQ——首个专为深度伪造检测设计的量化框架，其核心创新是一种自适应双向压缩策略，通过联合利用特征相关性和冗余消除，在模型紧凑性与检测性能之间实现平衡，从而在保持高精度的同时实现高效部署。

链接: https://arxiv.org/abs/2604.08847
作者: Xiangyu Li,Yujing Sun,Yuhang Zheng,Yuexin Ma,Kwok-Yan Lam
机构: Nanyang Technological University (南洋理工大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfake detection has become a fundamental component of modern media forensics. Despite significant progress in detection accuracy, most existing methods remain computationally intensive and parameter-heavy, limiting their deployment on resource-constrained edge devices that require real-time, on-site inference. This limitation is particularly critical in an era where mobile devices are extensively used for media-centric applications, including online payments, virtual meetings, and social networking. Meanwhile, due to the unique requirement of capturing extremely subtle forgery artifacts for deepfake detection, state-of-the-art quantization techniques usually underperform for such a challenging task. These fine-grained cues are highly sensitive to model compression and can be easily degraded during quantization, leading to noticeable performance drops. This challenge highlights the need for quantization strategies specifically designed to preserve the discriminative features essential for reliable deepfake detection. To address this gap, we propose DefakeQ, the first quantization framework tailored for deepfake detectors, enabling real-time deployment on edge devices. Our approach introduces a novel adaptive bidirectional compression strategy that simultaneously leverages feature correlations and eliminates redundancy, achieving an effective balance between model compactness and detection performance. Extensive experiments across five benchmark datasets and eleven state-of-the-art backbone detectors demonstrate that DeFakeQ consistently surpasses existing quantization and model compression baselines. Furthermore, we deploy DefakeQ on mobile devices in real-world scenarios, demonstrating its capability for real-time deepfake detection and its practical applicability in edge environments.

[CV-97] CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation CVPR2026

【速读】：该论文旨在解决生成式对象合成方法在实际产品目录图像生成中面临的两大问题：一是当产品尺寸不同时，需人工精细调整掩码（mask）以适配目标区域；二是生成后常出现遮挡元素缺失或损坏，需额外进行繁琐的修复工作。解决方案的关键在于提出CatalogStitch，一套模型无关的技术框架，其核心包括两个创新模块：一是基于尺寸感知的掩码计算算法，可自动适应不同产品尺寸并生成匹配的目标区域，无需用户手动干预；二是基于遮挡感知的混合修复方法，能够精确保留被遮挡元素的像素级完整性，从而消除后续编辑流程。实验表明，该方案在三种主流合成模型（ObjectStitch、OmniPaint 和 InsertAnything）上均实现稳定提升，显著增强了生成式合成在商业目录生产中的实用性与易用性。

链接: https://arxiv.org/abs/2604.08836
作者: Sanyam Jain,Pragya Kandari,Manit Singhal,He Zhang,Soo Ye Kim
机构: Adobe(Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 HiGen Workshop. Project page, this https URL

点击查看摘要

Abstract:Generative object compositing methods have shown remarkable ability to seamlessly insert objects into scenes. However, when applied to real-world catalog image generation, these methods require tedious manual intervention: users must carefully adjust masks when product dimensions differ, and painstakingly restore occluded elements post-generation. We present CatalogStitch, a set of model-agnostic techniques that automate these corrections, enabling user-friendly content creation. Our dimension-aware mask computation algorithm automatically adapts the target region to accommodate products with different dimensions; users simply provide a product image and background, without manual mask adjustments. Our occlusion-aware hybrid restoration method guarantees pixel-perfect preservation of occluding elements, eliminating post-editing workflows. We additionally introduce CatalogStitch-Eval, a 58-example benchmark covering aspect-ratio mismatch and occlusion-heavy catalog scenarios, together with supplementary PDF and HTML viewers. We evaluate our techniques with three state-of-the-art compositing models (ObjectStitch, OmniPaint, and InsertAnything), demonstrating consistent improvements across diverse catalog scenarios. By reducing manual intervention and automating tedious corrections, our approach transforms generative compositing into a practical, human-friendly tool for production catalog workflows.

[CV-98] Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning

【速读】：该论文旨在解决一致性模型（Consistency Models, CMs）在生成过程中缺乏灵活可控的引导机制的问题，尤其是在不依赖扩散模型（Diffusion Models, DMs）教师模型的前提下实现类似无分类器引导（Classifier-free Guidance, CFG）的效果。现有CMs的引导方法通常依赖于从DM中蒸馏知识，限制了其灵活性与实用性。论文提出联合流分布学习（Joint Flow Distribution Learning, JFDL），其关键在于利用预训练CM作为常微分方程（ODE）求解器，通过正态性检验验证条件与非条件分布对应的速率场（velocity fields）所隐含的噪声呈高斯分布，从而构建轻量级对齐机制，使CM具备可调节的引导能力，无需额外训练或依赖DM教师模型即可实现高效、可控的图像生成，显著提升了CM在CIFAR-10和ImageNet 64x64数据集上的生成质量（FID指标）。

链接: https://arxiv.org/abs/2604.08828
作者: Chia-Hong Hsu,Randall Balestriero
机构: Brown University (布朗大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classifier-free Guidance (CFG) lets practitioners trade-off fidelity against diversity in Diffusion Models (DMs). The practicality of CFG is however hindered by DMs sampling cost. On the other hand, Consistency Models (CMs) generate images in one or a few steps, but existing guidance methods require knowledge distillation from a separate DM teacher, limiting CFG to Consistency Distillation (CD) methods. We propose Joint Flow Distribution Learning (JFDL), a lightweight alignment method enabling guidance in a pre-trained CM. With a pre-trained CM as an ordinary differential equation (ODE) solver, we verify with normality tests that the variance-exploding noise implied by the velocity fields from unconditional and conditional distributions is Gaussian. In practice, JFDL equips CMs with the familiar adjustable guidance knob, yielding guided images with similar characteristics to CFG. Applied to an original Consistency Trained (CT) CM that could only do conditional sampling, JFDL unlocks guided generation and reduces FID on both CIFAR-10 and ImageNet 64x64 datasets. This is the first time that CMs are able to receive effective guidance post-hoc without a DM teacher, thus, bridging a key gap in current methods for CMs.

[CV-99] SenBen: Sensitive Scene Graphs for Explainable Content Moderation CVPR

【速读】：该论文旨在解决当前内容审核系统在敏感内容识别中缺乏空间定位和可解释性的问题，即无法明确指出检测到的敏感行为、涉及主体及发生位置。其解决方案的关键在于构建首个大规模场景图基准数据集——Sensitive Benchmark (SenBen)，并提出一种多任务蒸馏方法，通过Suffix-based Object Identity、Vocabulary-Aware Recall (VAR) Loss以及解耦式Query2Label标签头与非对称损失函数，有效缓解自回归场景图生成中的词汇不平衡问题，从而显著提升模型在敏感内容召回率（+6.4个百分点）和跨任务性能表现，同时实现推理速度提升7.6倍、GPU内存消耗降低16倍。

链接: https://arxiv.org/abs/2604.08819
作者: Fatih Cagatay Akyon,Alptekin Temizel
机构: METU; Ultralytics Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted at CVPRW 2026

点击查看摘要

Abstract:Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at 7.6\times faster inference and 16\times less GPU memory.

[CV-100] owards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

【速读】：该论文旨在解决医学视觉语言模型（Medical Vision-Language Models, VLMs）在放射学任务中因过度依赖单一模态而导致的结论缺乏可靠依据的问题，即生成的诊断结论虽流畅但语义 grounding 弱。其解决方案的关键在于提出一种上下文对齐推理框架（context-aligned reasoning framework），通过引入结构化的上下文信号（包括影像组学统计、可解释性激活和词汇锚定语义线索）来强制多源异构临床证据之间的一致性，从而在生成结构化输出（含支持证据、不确定性估计、局限性和安全提示）前实现跨模态验证。实验表明，仅使用辅助信号效果有限，唯有通过上下文验证整合才能显著提升判别性能（AUC 从 0.918 提升至 0.925），同时大幅减少幻觉关键词（从 1.14 降至 0.25）并缩短推理文本长度（从 19.4 词降至 15.3 词），且不增加模型置信度，验证了多证据一致性对提升医疗多模态推理可靠性与可信度的核心作用。

链接: https://arxiv.org/abs/2604.08815
作者: Sumra Khan,Sagar Chhabriya,Aizan Zafar,Sheeraz Arif,Amgad Muneer,Anas Zafar,Shaina Raza,Rizwan Qureshi
机构: Salim Habib University (萨利姆·哈比卜大学); Institute of Business Administration Sukkur (苏克鲁商业管理学院); University of Central Florida (中佛罗里达大学); The University of Texas MD Anderson Cancer Center (德克萨斯大学MD安德森癌症中心); Toronto Metropolitan University (多伦多都会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate that context alignment improves discriminative performance (AUC 0.918 to 0.925) while maintaining calibrated uncertainty. The framework also substantially reduces hallucinated keywords (1.14 to 0.25) and produces more concise reasoning explanations (19.4 to 15.3 words) without increasing model confidence (0.70 to 0.68). Cross-dataset evaluation on CheXpert further reveals that modality informativeness significantly influences reasoning behavior. These results suggest that enforcing multi-evidence agreement improves both reliability and trustworthiness in medical multimodal reasoning, while preserving the underlying model architecture.

[CV-101] R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII CVPR2026

【速读】：该论文旨在解决物理设计任务中图神经网络（GNN）应用进展受限的问题，主要源于电路表示不一致以及缺乏受控的评估协议。解决方案的关键在于提出R2G（RTL-to-GDSII）多视角电路图基准套件，其通过标准化五个阶段感知的视图（stage-aware views），确保信息对齐（information parity），即每个视图编码相同的属性集，仅在特征附着位置上存在差异，并覆盖30个开源IP核（节点/边数达10⁶量级）。R2G还提供从DEF到图的端到端处理流程、统一的数据划分、领域指标及可复现的基线模型，从而将表示选择与模型选择解耦，有效控制了先前EDA和图机器学习基准中未被约束的混杂因素。

链接: https://arxiv.org/abs/2604.08810
作者: Zewei Zhou,Jiajun Zou,Jiajia Zhang,Ao Yang,Ruichao He,Haozheng Zhou,Ao Liu,Jiawei Liu,Leilei Jin,Shan Shen,Daying Sun
机构: Nanjing University of Science and Technology (南京理工大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as a poster by CVPR2026

点击查看摘要

Abstract:Graph neural networks (GNNs) are increasingly applied to physical design tasks such as congestion prediction and wirelength estimation, yet progress is hindered by inconsistent circuit representations and the absence of controlled evaluation protocols. We present R2G (RTL-to-GDSII), a multi-view circuit-graph benchmark suite that standardizes five stage-aware views with information parity (every view encodes the same attribute set, differing only in where features attach) over 30 open-source IP cores (up to 10^6 nodes/edges). R2G provides an end-to-end DEF-to-graph pipeline spanning synthesis, placement, and routing stages, together with loaders, unified splits, domain metrics, and reproducible baselines. By decoupling representation choice from model choice, R2G isolates a confound that prior EDA and graph-ML benchmarks leave uncontrolled. In systematic studies with GINE, GAT, and ResGatedGCN, we find: (i) view choice dominates model choice, with Test R ^2 varying by more than 0.3 across representations for a fixed GNN; (ii) node-centric views generalize best across both placement and routing; and (iii) decoder-head depth (3–4 layers) is the primary accuracy driver, turning divergent training into near-perfect predictions (R ^2 0.99). Code and datasets are available at this https URL.

[CV-102] MeshOn: Intersection-Free Mesh-to-Mesh Composition

【速读】：该论文旨在解决如何将两个输入网格（如配件与主体模型）进行物理上和语义上都合理的组合问题，尤其在避免表面交叉的前提下实现精准贴合。解决方案的关键在于提出了一种多阶段优化框架：首先利用视觉-语言模型（Vision-to-Language Models）进行结构化的初始刚性对齐，随后通过吸引性的几何损失与一种受物理启发的障碍损失（barrier loss）来防止表面相交并优化姿态，最后借助扩散先验（diffusion prior）实现最终的形变调整，从而生成高质量、真实感强的组合结果。

链接: https://arxiv.org/abs/2604.08799
作者: Hyunwoo Kim,Itai Lang,Hadar Averbuch-Elor,Silvia Sellán,Rana Hanocka
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: \hyperlink{ this https URL }{this https URL}

点击查看摘要

Abstract:We propose MeshOn, a method that finds physically and semantically realistic compositions of two input meshes. Given an accessory, a base mesh with a user-defined target region, and optional text strings for both meshes, MeshOn uses a multi-step optimization framework to realistically fit the meshes onto each other while preventing intersections. We initialize the shapes’ rigid configuration via a structured alignment scheme using Vision-to-Language Models, which we then optimize using a combination of attractive geometric losses, and a physics-inspired barrier loss that prevents surface intersections. We then obtain a final deformation of the object, assisted by a diffusion prior. Our method successfully fits accessories of various materials over a breadth of target regions, and is designed to fit directly into existing digital artist workflows. We demonstrate the robustness and accuracy of our pipeline by comparing it with generative approaches and traditional registration algorithms.

[CV-103] InstrAct: Towards Action-Centric Understanding in Instructional Videos

【速读】：该论文旨在解决当前视频基础模型（Video Foundation Models, VFMs）在理解教学类视频时面临的两大挑战：一是难以识别细粒度动作并建模其时间关系，二是存在普遍的“静态偏差”（static bias），即模型过度依赖物体特征而非运动线索。解决方案的关键在于提出一种名为InstrAction的预训练框架，其核心创新包括：（1）数据驱动策略，通过过滤噪声文本并生成以动作为中心的难样本负例，强化对比学习中动作与物体的解耦；（2）视觉特征层面引入Action Perceiver模块，从冗余视频编码中提取与运动相关的关键token；（3）设计两个辅助目标——动态时间规整对齐（DTW-Align）用于建模序列时间结构，以及掩码动作建模（MAM）以增强跨模态对齐能力。该方法显著提升了模型在动作级理解任务上的表现。

链接: https://arxiv.org/abs/2604.08762
作者: Zhuoyi Yang,Jiapeng Yu,Reuben Tan,Boyang Li,Huijuan Xu
机构: Pennsylvania State University (宾夕法尼亚州立大学); Microsoft Research; Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive “static bias”, where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos’ action-centric representations. We first introduce a data-driven strategy, which filters noisy captions and generates action-centric hard negatives to disentangle actions from objects during contrastive learning. At the visual feature level, an Action Perceiver extracts motion-relevant tokens from redundant video encodings. Beyond contrastive learning, we introduce two auxiliary objectives: Dynamic Time Warping alignment (DTW-Align) for modeling sequential temporal structure, and Masked Action Modeling (MAM) for strengthening cross-modal grounding. Finally, we introduce the InstrAct Bench to evaluate action-centric understanding, where our method consistently outperforms state-of-the-art VFMs on semantic reasoning, procedural logic, and fine-grained retrieval tasks.

[CV-104] State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition ICLR2026

【速读】：该论文旨在解决手语识别中的灾难性扩展失败问题（catastrophic scaling failure），即当前模型在小词汇量下表现良好，但在接近真实场景的大规模词汇中性能急剧下降。其根本原因在于现有架构将手势视为原子视觉模式，学习扁平表示，无法利用手语的组合结构——手语系统由离散的音位参数（handshape, location, movement, orientation）构成，这些参数在词汇中重复使用。解决方案的关键在于引入PHONSSM框架，通过解剖学基础的图注意力机制强制进行音位分解、显式地将特征因子化到正交子空间，并采用原型分类实现少样本迁移。该方法仅使用骨架数据就在最大的ASL数据集上达到72.1%的准确率（比骨架SOTA提升18.4个百分点），并在少样本场景下相对增益达225%，且能零样本迁移到ASL Citizen数据集，超越监督RGB基线，验证了基于语言结构的组合归纳偏置对词汇扩展瓶颈的有效性。

链接: https://arxiv.org/abs/2604.08761
作者: Bryan Cheng,Austin Jin,Jasper Zhang
机构: William A. Shine Great Neck South High School (威廉·A·沙因格雷特南高中)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures. Accepted to workshop on Algorithmic Fairness Across Alignment Procedures and Agentic Systems at ICLR 2026

点击查看摘要

Abstract:Sign language recognition suffers from catastrophic scaling failure: models achieving high accuracy on small vocabularies collapse at realistic sizes. Existing architectures treat signs as atomic visual patterns, learning flat representations that cannot exploit the compositional structure of sign languages-systematically organized from discrete phonological parameters (handshape, location, movement, orientation) reused across the vocabulary. We introduce PHONSSM, enforcing phonological decomposition through anatomically-grounded graph attention, explicit factorization into orthogonal subspaces, and prototypical classification enabling few-shot transfer. Using skeleton data alone on the largest ASL dataset ever assembled (5,565 signs), PHONSSM achieves 72.1% on WLASL2000 (+18.4pp over skeleton SOTA), surpassing most RGB methods without video input. Gains are most dramatic in the few-shot regime (+225% relative), and the model transfers zero-shot to ASL Citizen, exceeding supervised RGB baselines. The vocabulary scaling bottleneck is fundamentally a representation learning problem, solvable through compositional inductive biases mirroring linguistic structure.

[CV-105] SIC3D: Style Image Conditioned Text-to-3D Gaussian Splatting Generation

【速读】：该论文旨在解决当前文本到三维（text-to-3D）生成方法中存在可控性不足和纹理歧义的问题，这些问题主要源于文本模态本身的表达局限。为应对这一挑战，作者提出SIC3D（Style-Informed Controllable 3D Generation），其核心创新在于采用两阶段图像条件化的生成流程：第一阶段利用文本驱动的3D高斯泼溅（3D Gaussian Splatting, 3DGS）模型生成基础几何结构；第二阶段通过引入一种新颖的变分风格化得分蒸馏（Variational Stylized Score Distillation, VSSD）损失函数，实现从参考图像到3DGS的风格迁移，该损失能有效捕捉全局与局部纹理特征并缓解几何与外观之间的冲突；同时，进一步施加缩放正则化以抑制伪影并保留风格图像中的纹理模式，从而显著提升几何保真度和风格一致性，在定性和定量评估中均优于现有方法。

链接: https://arxiv.org/abs/2604.08760
作者: Ming He,Zhixiang Chen,Steve Maddock
机构: University of Sheffield (谢菲尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in text-to-3D object generation enables the synthesis of detailed geometry from text input by leveraging 2D diffusion models and differentiable 3D representations. However, the approaches often suffer from limited controllability and texture ambiguity due to the limitation of the text modality. To address this, we present SIC3D, a controllable image-conditioned text-to-3D generation pipeline with 3D Gaussian Splatting (3DGS). There are two stages in SIC3D. The first stage generates the 3D object content from text with a text-to-3DGS generation model. The second stage transfers style from a reference image to the 3DGS. Within this stylization stage, we introduce a novel Variational Stylized Score Distillation (VSSD) loss to effectively capture both global and local texture patterns while mitigating conflicts between geometry and appearance. A scaling regularization is further applied to prevent the emergence of artifacts and preserve the pattern from the style image. Extensive experiments demonstrate that SIC3D enhances geometric fidelity and style adherence, outperforming prior approaches in both qualitative and quantitative evaluations.

[CV-106] AniGen: Unified S3 Fields for Animatable 3D Asset Generation

【速读】：该论文旨在解决当前3D生成模型所生成的资产通常是静态且缺乏可动画性的问题，即如何从单张图像直接生成具备可动画性的3D资产（animate-ready 3D assets），这些资产需包含拓扑一致的骨架（skeleton）与皮肤权重（skinning weights）。传统方法依赖后处理自动绑定（auto-rigging）往往失败，因为生成的骨架常与几何体拓扑不一致。解决方案的关键在于提出AniGen框架，其核心创新是将形状、骨架和皮肤权重统一表示为定义在共享空间域上的S³场（Shape, Skeleton, Skin Fields），并引入两项关键技术：(i) 一种置信度衰减的骨架场，用于显式处理Voronoi边界处骨骼预测的几何模糊性；(ii) 一种双皮肤特征场，解耦皮肤权重与特定关节数量，使固定架构网络能够生成任意复杂度的绑定结构。该框架基于两阶段流匹配（flow-matching）管道，先生成稀疏结构骨架，再在结构化潜在空间中生成稠密几何与关节运动信息，显著提升绑定有效性与动画质量。

链接: https://arxiv.org/abs/2604.08746
作者: Yi-Hua Huang,Zi-Xin Zou,Yuting He,Chirui Chang,Cheng-Feng Pu,Ziyi Yang,Yuan-Chen Guo,Yan-Pei Cao,Xiaojuan Qi
机构: The University of Hong Kong(香港大学); The Chinese University of Hong Kong(香港中文大学); Tsinghua University(清华大学); VAST(越南国家信息与计算机科学研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:Animatable 3D assets, defined as geometry equipped with an articulated skeleton and skinning weights, are fundamental to interactive graphics, embodied agents, and animation production. While recent 3D generative models can synthesize visually plausible shapes from images, the results are typically static. Obtaining usable rigs via post-hoc auto-rigging is brittle and often produces skeletons that are topologically inconsistent with the generated geometry. We present AniGen, a unified framework that directly generates animate-ready 3D assets conditioned on a single image. Our key insight is to represent shape, skeleton, and skinning as mutually consistent S^3 Fields (Shape, Skeleton, Skin) defined over a shared spatial domain. To enable the robust learning of these fields, we introduce two technical innovations: (i) a confidence-decaying skeleton field that explicitly handles the geometric ambiguity of bone prediction at Voronoi boundaries, and (ii) a dual skin feature field that decouples skinning weights from specific joint counts, allowing a fixed-architecture network to predict rigs of arbitrary complexity. Built upon a two-stage flow-matching pipeline, AniGen first synthesizes a sparse structural scaffold and then generates dense geometry and articulation in a structured latent space. Extensive experiments demonstrate that AniGen substantially outperforms state-of-the-art sequential baselines in rig validity and animation quality, generalizing effectively to in-the-wild images across diverse categories including animals, humanoids, and machinery. Homepage: this https URL

[CV-107] LPLCv2: An Expanded Dataset for Fine-Grained License Plate Legibility Classification

【速读】：该论文旨在解决大规模真实场景下车牌识别（Automatic License Plate Recognition, ALPR）系统性能下降的问题，尤其是由低质量成像设备、压缩伪影及非理想摄像机安装导致的识别困难。其关键解决方案在于：首先，将原始基准数据集扩展至三倍以上规模，并引入更精确的标注信息（包括车牌级的边界框、文本内容与可读性等级，以及车辆级和图像级的丰富标签），从而提升模型训练与评估的可靠性；其次，提出一种基于指数移动平均（Exponential Moving Average）的损失函数和优化的学习率调度策略，有效缓解测试阶段常见错误，使基线模型在测试集上F1分数达到89.5%，显著优于先前最优方法。此外，论文还设计了一种新协议以明确处理训练与测试数据间摄像机污染问题，验证了其影响较小。

链接: https://arxiv.org/abs/2604.08741
作者: Lucas Wojcik,Eduardo A. F. Machoski,Eduil Nascimento Jr.,Rayson Laroca,David Menotti
机构: Universidade Federal do Paraná (巴西巴拉那联邦大学); Pontifícia Universidade Católica do Paraná (巴西天主教巴拉那联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern Automatic License Plate Recognition (ALPR) systems achieve outstanding performance in controlled, well-defined scenarios. However, large-scale real-world usage remains challenging due to low-quality imaging devices, compression artifacts, and suboptimal camera installation. Identifying illegible license plates (LPs) has recently become feasible through a dedicated benchmark; however, its impact has been limited by its small size and annotation errors. In this work, we expand the original benchmark to over three times the size with two extra capture days, revise its annotations and introduce novel labels. LP-level annotations include bounding boxes, text, and legibility level, while vehicle-level annotations comprise make, model, type, and color. Image-level annotations feature camera identity, capture conditions (e.g., rain and faulty cameras), acquisition time, and day ID. We present a novel training procedure featuring an Exponential Moving Average-based loss function and a refined learning rate scheduler, addressing common mistakes in testing. These improvements enable a baseline model to achieve an 89.5% F1-score on the test set, considerably surpassing the previous state of the art. We further introduce a novel protocol to explicitly addresses camera contamination between training and evaluation splits, where results show a small impact. Dataset and code are publicly available at this https URL.

[CV-108] AI Driven Soccer Analysis Using Computer Vision

【速读】：该论文旨在解决体育赛事中团队表现分析的复杂性问题，尤其是如何从比赛视频中提取高精度、可操作的战术数据。传统视频分析方法难以自动识别和量化球员位置、移动轨迹及场内空间关系，限制了教练决策与训练优化的效果。解决方案的关键在于构建一个融合多模型的计算机视觉系统：首先采用YOLO或Faster R-CNN等目标检测模型识别并跟踪球员（对象检测与跟踪），再结合SAM2（Segment Anything Model 2）进行精准分割；同时利用卷积神经网络（CNN）检测球场关键点，并通过单应性变换（homography）将摄像头视角下的坐标映射到真实地面坐标系，从而实现跨视角的空间定位一致性。此方法使球员速度、跑动距离、热力图及团队协同统计等战术指标得以精确计算，为教练提供前所未有的精细化数据支持。

链接: https://arxiv.org/abs/2604.08722
作者: Adrian Manchado,Tanner Cellio,Jonathan Keane,Yiyang Wang
机构: Milwaukee School of Engineering (密尔沃基工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sport analysis is crucial for team performance since it provides actionable data that can inform coaching decisions, improve player performance, and enhance team strategies. To analyze more complex features from game footage, a computer vision model can be used to identify and track key entities from the field. We propose the use of an object detection and tracking system to predict player positioning throughout the game. To translate this to positioning in relation to the field dimensions, we use a point prediction model to identify key points on the field and combine these with known field dimensions to extract actual distances. For the player-identification model, object detection models like YOLO and Faster R-CNN are evaluated on the accuracy of our custom video footage using multiple different evaluation metrics. The goal is to identify the best model for object identification to obtain the most accurate results when paired with SAM2 (Segment Anything Model 2) for segmentation and tracking. For the key point detection model, we use a CNN model to find consistent locations in the soccer field. Through homography, the positions of points and objects in the camera perspective will be transformed to a real-ground perspective. The segmented player masks from SAM2 are transformed from camera perspective to real-world field coordinates through homography, regardless of camera angle or movement. The transformed real-world coordinates can be used to calculate valuable tactical insights including player speed, distance covered, positioning heatmaps, and more complex team statistics, providing coaches and players with actionable performance data previously unavailable from standard video analysis.

[CV-109] LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

【速读】：该论文旨在解决自动驾驶系统在长尾场景（long-tail scenarios）和开放世界（open-world）环境下泛化能力不足的问题，这是制约其大规模部署的关键瓶颈。解决方案的关键在于提出LMGenDrive框架，首次将基于大语言模型（LLM）的多模态理解与生成式世界模型（generative world models）统一建模，实现端到端闭环驾驶：给定多视角摄像头输入和自然语言指令，模型同时生成未来驾驶视频和控制信号。这一设计通过视频预测增强时空场景建模能力，借助LLM提供强语义先验与指令对齐能力，从而提升对罕见及安全关键场景的理解与决策鲁棒性。

链接: https://arxiv.org/abs/2604.08719
作者: Hao Shao,Letian Wang,Yang Zhou,Yuxuan Hu,Zhuofan Zong,Steven L. Waslander,Wei Zhan,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); University of Toronto (多伦多大学); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent years have seen remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains a major bottleneck for large-scale deployment. To address this challenge, some works use LLMs and VLMs for vision-language understanding and reasoning, enabling vehicles to interpret rare and safety-critical situations when generating actions. Others study generative world models to capture the spatio-temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive generates both future driving videos and control signals. This design provides complementary benefits: video prediction improves spatio-temporal scene modeling, while the LLM contributes strong semantic priors and instruction grounding from large-scale pretraining. We further propose a progressive three-stage training strategy, from vision pretraining to multi-step long-horizon driving, to improve stability and performance. LMGenDrive supports both low-latency online planning and autoregressive offline video generation. Experiments show that it significantly outperforms prior methods on challenging closed-loop benchmarks, with clear gains in instruction following, spatio-temporal understanding, and robustness to rare scenarios. These results suggest that unifying multimodal understanding and generation is a promising direction for more generalizable and robust embodied decision-making systems.

[CV-110] Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring

【速读】：该论文旨在解决基于几何基础模型（Geometric Foundation Models, GFMs）的单目SLAM系统在处理密集视频流时存在的计算冗余问题。现有方法依赖事后关键帧选择策略，导致必须对每一帧进行昂贵的密集几何解码以判断其是否包含新几何信息，从而造成延迟拒绝和无效计算。解决方案的关键在于提出LeanGate——一个轻量级前馈帧门控网络，通过预测几何效用分数（geometric utility score）在重型GFM特征提取与匹配阶段之前评估当前帧的建图价值，实现高效过滤冗余帧；作为可插拔模块，LeanGate能跳过超过90%的无用帧，在显著降低跟踪计算量（FLOPs减少85%以上）的同时保持与密集基线相当的定位与建图精度，并实现端到端吞吐量提升5倍。

链接: https://arxiv.org/abs/2604.08718
作者: Xinmiao Xiong,Bangya Liu,Hao Wang,Dayou Li,Nuo Chen,Andrew Feng,Mingyu Ding,Suman Banerjee,Yang Zhou,Zhiwen Fan
机构: UW–Madison (威斯康星大学麦迪逊分校); Texas AM (德克萨斯农工大学); USC (南加州大学); UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Geometric Foundation Models (GFMs) have recently advanced monocular SLAM by providing robust, calibration-free 3D priors. However, deploying these models on dense video streams introduces significant computational redundancy. Current GFM-based SLAM systems typically rely on post hoc keyframe selection. Because of this, they must perform expensive dense geometric decoding simply to determine whether a frame contains novel geometry, resulting in late rejection and wasted computation. To mitigate this inefficiency, we propose LeanGate, a lightweight feed-forward frame-gating network. LeanGate predicts a geometric utility score to assess a frame’s mapping value prior to the heavy GFM feature extraction and matching stages. As a predictive plug-and-play module, our approach bypasses over 90% of redundant frames. Evaluations on standard SLAM benchmarks demonstrate that LeanGate reduces tracking FLOPs by more than 85% and achieves a 5x end-to-end throughput speedup. Furthermore, it maintains the tracking and mapping accuracy of dense baselines.

[CV-111] What Matters in Virtual Try-Off? Dual-UNet Diffusion Model For Garment Reconstruction

【速读】：该论文旨在解决虚拟试穿（Virtual Try-On, VTON）领域中尚未充分研究的逆问题——虚拟脱衣（Virtual Try-Off, VTOFF），即从穿着状态的服装图像中重建出原始形态的规范服装（canonical garment）。其解决方案的关键在于构建一个基于扩散模型（diffusion-based models）的稳健架构，重点围绕三个设计维度展开：(i) 生成骨干网络（Generation Backbone），对比不同Stable Diffusion变体；(ii) 条件输入机制（Conditioning），包括掩码设计、是否使用掩码输入及高阶语义特征的有效性；(iii) 损失函数与训练策略，评估注意力引导辅助损失、感知目标以及多阶段课程学习调度的影响。通过系统性实验验证，该框架在VITON-HD和DressCode数据集上实现了SOTA性能，在主要指标DISTS上提升9.5%，为VTOFF任务提供了更强基线和可复用的设计洞见。

链接: https://arxiv.org/abs/2604.08716
作者: Loc-Phat Truong,Meysam Madadi,Sergio Escalera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Virtual Try-On (VTON) has seen rapid advancements, providing a strong foundation for generative fashion tasks. However, the inverse problem, Virtual Try-Off (VTOFF)-aimed at reconstructing the canonical garment from a draped-on image-remains a less understood domain, distinct from the heavily researched field of VTON. In this work, we seek to establish a robust architectural foundation for VTOFF by studying and adapting various diffusion-based strategies from VTON and general Latent Diffusion Models (LDMs). We focus our investigation on the Dual-UNet Diffusion Model architecture and analyze three axes of design: (i) Generation Backbone: comparing Stable Diffusion variants; (ii) Conditioning: ablating different mask designs, masked/unmasked inputs for image conditioning, and the utility of high-level semantic features; and (iii) Losses and Training Strategies: evaluating the impact of the auxiliary attention-based loss, perceptual objectives and multi-stage curriculum schedules. Extensive experiments reveal trade-offs across various configuration options. Evaluated on VITON-HD and DressCode datasets, our framework achieves state-of-the-art performance with a drop of 9.5% on the primary metric DISTS and competitive performance on LPIPS, FID, KID, and SSIM, providing both stronger baselines and insights to guide future Virtual Try-Off research.

[CV-112] Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup

【速读】：该论文旨在解决液膜破碎过程中多尺度瞬态动力学难以从高速阴影摄影图像中定量分析的问题，特别是如何准确识别并追踪破碎形成的液丝（ligament）、液滴（droplet）和团块（blob），以及建模其在时间序列中的父代-子代关系。传统多目标跟踪框架因强制一对一的时间关联而无法描述一对多的断裂事件，限制了喷雾分析的准确性。解决方案的关键在于提出一个两阶段深度学习框架：第一阶段采用基于ResNet-50与特征金字塔网络（Feature Pyramid Network）的Faster R-CNN模型，结合形态保持的合成数据增强策略，在高精度检测和分类液丝与液滴的同时避免物理上不合理的配置；第二阶段引入Transformer增强的多层感知机（Transformer-augmented multilayer perceptron），利用物理信息驱动的几何特征对帧间关联进行分类（延续、断裂、非关联），有效克服严重类别不平衡问题，并实现对断裂事件的完美召回率（1.00）与高精度（93.2%）。该框架可自动重建断裂树结构，保留父代-子代谱系关系，并提取包括碎片多重性和液滴尺寸分布在内的关键破碎统计量，从而实现对初级雾化模式的自动化分析。

链接: https://arxiv.org/abs/2604.08711
作者: Vrushank Ahire,Vivek Kurumanghat,Mudasir Ganaie,Lipika Kabiraj
机构: Indian Institute of Technology Ropar (印度理工学院拉普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The disintegration of liquid sheets into ligaments and droplets involves highly transient, multi-scale dynamics that are difficult to quantify from high-speed shadowgraphy images. Identifying droplets, ligaments, and blobs formed during breakup, along with tracking across frames, is essential for spray analysis. However, conventional multi-object tracking frameworks impose strict one-to-one temporal associations and cannot represent one-to-many fragmentation events. In this study, we present a two-stage deep learning framework for object detection and temporal relationship modeling across frames. The framework captures ligament deformation, fragmentation, and parent-child lineage during liquid sheet disintegration. In the first stage, a Faster R-CNN with a ResNet-50 backbone and Feature Pyramid Network detects and classifies ligaments and droplets in high-speed shadowgraphy recordings of an impinging Carbopol gel jet. A morphology-preserving synthetic data generation strategy augments the training set without introducing physically implausible configurations, achieving a held-out F1 score of up to 0.872 across fourteen original-to-synthetic configurations. In the second stage, a Transformer-augmented multilayer perceptron classifies inter-frame associations into continuation, fragmentation (one-to-many), and non-association using physics-informed geometric features. Despite severe class imbalance, the model achieves 86.1% accuracy, 93.2% precision, and perfect recall (1.00) for fragmentation events. Together, the framework enables automated reconstruction of fragmentation trees, preservation of parent-child lineage, and extraction of breakup statistics such as fragment multiplicity and droplet size distributions. By explicitly identifying children droplets formed from ligament fragmentation, the framework provides automated analysis of the primary atomization mode.

[CV-113] RS-OVC: Open-Vocabulary Counting for Remote-Sensing Data

【速读】：该论文旨在解决遥感（Remote-Sensing, RS）图像中对象计数方法在面对未见过的新类别时缺乏泛化能力的问题。现有方法通常局限于预定义的封闭类别集合，导致在实际应用中需耗费大量成本进行重新标注和模型训练，难以适应动态监测场景。解决方案的关键在于提出首个面向遥感与航空影像的开放词汇计数（Open Vocabulary Counting, OVC）模型 RS-OVC，其通过文本和/或视觉条件控制，实现对训练阶段未见类别的准确计数，从而显著提升模型在真实世界复杂场景中的适应性和实用性。

链接: https://arxiv.org/abs/2604.08704
作者: Tamir Shor,George Leifman,Genady Beryozkin
机构: Google Research(谷歌研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object-Counting for remote-sensing (RS) imagery is attracting increasing research interest due to its crucial role in a wide and diverse set of applications. While several promising methods for RS object-counting have been proposed, existing methods focus on a closed, pre-defined set of object classes. This limitation necessitates costly re-annotation and model re-training to adapt current approaches for counting of novel objects that have not been seen during training, and severely inhibits their application in dynamic, real-world monitoring scenarios. To address this gap, in this work we propose RS-OVC - the first Open Vocabulary Counting (OVC) model for Remote-Sensing and aerial imagery. We show that our model is capable of accurate counting of novel object classes, that were unseen during training, based solely on textual and/or visual conditioning.

[CV-114] Unified Multimodal Uncertain Inference

【速读】：该论文旨在解决多模态情境下不确定推理（Uncertain Inference）的建模问题，即在文本、音频和视频等多种模态中，模型需基于前提信息生成校准后的假设概率估计。现有研究主要局限于单一模态的二分类蕴含判断，缺乏跨模态或细粒度的概率推理框架。为应对这一挑战，作者提出Unified Multimodal Uncertain Inference (UMUI)任务，并构建了包含人类标注标量概率判断的评估集，同时引入CLUE（Calibrated Latent Uncertainty Estimation）方法，其核心在于融合自洽教师校准（self-consistent teacher calibration）与基于分布的置信度探测（distribution-based confidence probing），从而实现多模态下预测结果的校准性提升。实验表明，3B参数模型在所有模态上性能等效甚至优于高达32B参数的基线模型。

链接: https://arxiv.org/abs/2604.08701
作者: Dengjia Zhang,Alexander Martin,William Jurayj,Kenton Murray,Benjamin Van Durme,Reno Kriz
机构: Johns Hopkins University (约翰霍普金斯大学); Human Language Technology Center of Excellence (人类语言技术卓越中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.

[CV-115] EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition

【速读】：该论文旨在解决如何在移动设备（如手机）上构建高效且准确的手语识别系统的问题，特别是在印度手语（Indian Sign Language, ISL）字母识别场景下。其解决方案的关键在于提出一个轻量级模型 EfficientSign，该模型基于 EfficientNet-B0 架构，并引入两个注意力模块：通道注意力机制（Squeeze-and-Excitation）用于增强特征通道的重要性，以及空间注意力层聚焦于手部手势区域。这种注意力增强的设计使得模型在仅使用 4.2M 参数（相比 ResNet18 的 11.2M 减少 62%）的情况下，仍能达到 99.94% 的准确率，显著优于传统手工特征提取方法（如 SURF 在 2015 年实现的 92%），从而为部署在资源受限设备上的高精度手语识别提供了可行路径。

链接: https://arxiv.org/abs/2604.08694
作者: Rishabh Gupta,Shravya R. Nalla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to IEEE Transactions on Human-Machine Systems

点击查看摘要

Abstract:How do you build a sign language recognizer that works on a phone? That question drove this work. We built EfficientSign, a lightweight model which takes EfficientNet-B0 and focuses on two attention modules (Squeeze-and-Excitation for channel focus, and a spatial attention layer that focuses on the hand gestures). We tested it against five other approaches on 12,637 images of Indian Sign Language alphabets, all 26 classes, using 5-fold cross-validation. EfficientSign achieves the accuracy of 99.94% (+/-0.05%), which matches the performance of ResNet18’s 99.97% accuracy, but with 62% fewer parameters (4.2M vs 11.2M). We also experimented with feeding deep features (1,280-dimensional vectors pulled from EfficientNet-B0’s pooling layer) into classical classifiers. SVM achieved the accuracy of 99.63%, Logistic Regression achieved the accuracy of 99.03% and KNN achieved accuracy of 96.33%. All of these blow past the 92% that SURF-based methods managed on a similar dataset back in 2015. Our results show that attention-enhanced learning model provides an efficient and deployable solution for ISL recognition without requiring a massive model or hand-tuned feature pipelines anymore.

[CV-116] InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

【速读】：该论文旨在解决指令驱动的视频编辑（instruction-based video editing）任务中数据稀缺与模型训练效率低下的问题，尤其是在缺乏大规模高质量视频编辑数据的情况下，如何使视频生成模型具备强大的编辑能力。解决方案的关键在于提出InsEdit框架，其核心创新是结合视觉编辑架构与基于互信息注意力机制（Mutual Context Attention, MCA）的视频数据处理流程，从而生成对齐的视频对，允许编辑操作从视频片段中间任意位置开始，而不仅限于从首帧起始；这一设计显著提升了编辑灵活性和精度，并在仅使用约10万条视频编辑数据的情况下实现了开源方法中的最先进性能。此外，由于训练过程中引入了图像编辑数据，模型无需额外调整即可同时支持图像编辑任务。

链接: https://arxiv.org/abs/2604.08646
作者: Zhefan Rao,Bin Zou,Haoxuan Che,Xuanhua He,Chong Hou Choi,Yanheng Li,Rui Liu,Qifeng Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); Celia Research HK (Celia 研究所香港); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:Instruction-based video editing is a natural way to control video content with text, but adapting a video generation model into an editor usually appears data-hungry. At the same time, high-quality video editing data remains scarce. In this paper, we show that a video generation backbone can become a strong video editor without large scale video editing data. We present InsEdit, an instruction-based editing model built on HunyuanVideo-1.5. InsEdit combines a visual editing architecture with a video data pipeline based on Mutual Context Attention (MCA), which creates aligned video pairs where edits can begin in the middle of a clip rather than only from the first frame. With only O(100)K video editing data, InsEdit achieves state-of-the-art results among open-source methods on our video instruction editing benchmarks. In addition, because our training recipe also includes image editing data, the final model supports image editing without any modification.

[CV-117] 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding CVPR

【速读】：该论文旨在解决3D具身智能体在推理过程中因幻觉（hallucination）导致的不安全和非 grounded 决策问题，这类幻觉在3D环境中主要源于物体存在性、空间布局和几何定位的不确定性，而非2D视觉-语言场景中的像素级不一致。解决方案的关键在于提出3D-VCD——首个面向3D具身智能体的推理时视觉对比解码（inference-time visual contrastive decoding）框架，其通过在以对象为中心的表示上施加语义和几何扰动（如类别替换、坐标或尺度篡改），构建一个扭曲的3D场景图，并通过对原始与扭曲场景下的预测进行对比，抑制对地面实况场景证据不敏感的token，从而减少由语言先验驱动的错误决策，实现无需重训练即可提升推理的可靠性。

链接: https://arxiv.org/abs/2604.08645
作者: Makanjuola Ogunleye,Eman Abdelrahman,Ismini Lourentzou
机构: Virginia Tech (弗吉尼亚理工大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 6 figures, Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insensitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consistently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over structured 3D representations as an effective and practical route to more reliable embodied intelligence.

[CV-118] VOLTA: The Surprising Ineffectiveness of Auxiliary Losses for Calibrated Deep Learning

【速读】：该论文旨在解决深度学习模型在安全关键应用中不确定性量化（Uncertainty Quantification, UQ）方法性能不一致的问题，尤其在不同数据模态和分布偏移场景下缺乏统一的评估基准。其解决方案的关键在于提出并验证一个简化但高效的模型——VOLTA，该模型仅保留深度编码器、可学习原型、交叉熵损失和事后温度缩放（post hoc temperature scaling），在多个基准数据集（如CIFAR-10、SVHN、CIFAR-10C等）上实现了与复杂UQ方法相当甚至更优的校准性能（Expected Calibration Error最低达0.010）和OOD检测能力（AUROC 0.802），并通过消融实验确认了自适应温度缩放和深度编码器对性能提升的核心作用。

链接: https://arxiv.org/abs/2604.08639
作者: Rahul D Ray,Utkarsh Srivastava
机构: BITS Pilani, Hyderabad Campus (比特·皮拉尼海得拉巴校区)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) is essential for deploying deep learning models in safety critical applications, yet no consensus exists on which UQ method performs best across different data modalities and distribution shifts. This paper presents a comprehensive benchmark of ten widely used UQ baselines including MC Dropout, SWAG, ensemble methods, temperature scaling, energy based OOD, Mahalanobis, hyperbolic classifiers, ENN, Taylor Sensus, and split conformal prediction against a simplified yet highly effective variant of VOLTA that retains only a deep encoder, learnable prototypes, cross entropy loss, and post hoc temperature scaling. We evaluate all methods on CIFAR 10 (in distribution), CIFAR 100, SVHN, uniform noise (out of distribution), CIFAR 10 C (corruptions), and Tiny ImageNet features (tabular). VOLTA achieves competitive or superior accuracy (up to 0.864 on CIFAR 10), significantly lower expected calibration error (0.010 vs. 0.044 to 0.102 for baselines), and strong OOD detection (AUROC 0.802). Statistical testing over three random seeds shows that VOLTA matches or outperforms most baselines, with ablation studies confirming the importance of adaptive temperature and deep encoders. Our results establish VOLTA as a lightweight, deterministic, and well calibrated alternative to more complex UQ approaches.

[CV-119] WildDet3D: Scaling Promptable 3D Detection in the Wild

【速读】：该论文旨在解决单目3D目标检测（monocular 3D object detection）在开放世界场景下的泛化能力不足问题，具体包括：现有方法通常仅支持单一提示模态（prompt type），缺乏融合辅助几何线索（如深度信息）的机制，且训练数据集类别覆盖有限、环境受限，难以迁移至真实复杂场景。解决方案的关键在于提出两个核心创新：其一，设计了WildDet3D架构，这是一个统一的、几何感知的模型，原生支持文本、点和框三种提示模态，并可在推理时灵活引入额外深度信号；其二，构建了目前最大的开放世界3D检测数据集WildDet3D-Data，通过从现有2D标注生成候选3D框并经人工验证，获得超过100万张图像、涵盖13,500个类别的多样化真实场景数据，从而显著提升模型在开放世界设置下的性能与鲁棒性。

链接: https://arxiv.org/abs/2604.08626
作者: Weikai Huang,Jieyu Zhang,Sijun Li,Taoyang Jia,Jiafei Duan,Yunqian Cheng,Jaemin Cho,Mattew Wallingford,Rustin Soraki,Chris Dongjoo Kim,Donovan Clay,Taira Anderson,Winson Han,Ali Farhadi,Bharath Hariharan,Zhongzheng Ren,Ranjay Krishna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection–recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).

[CV-120] From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity CVPR2026

【速读】：该论文旨在解决联邦持续学习（Federated Continual Learning, FCL）中因客户端和任务间动态异构性导致的灾难性遗忘问题，尤其是由类别不平衡引发的表征崩溃现象，即稀有类特征被拉向常见类特征，从而削弱模型对少数类的识别能力。解决方案的关键在于提出一种几何感知的校正方法FEAT（Federated gEometry-Aware correcTion），其核心包含两个模块：一是几何结构对齐模块（Geometric Structure Alignment），通过将特征表示与固定共享的等角紧框架（Equiangular Tight Frame, ETF）原型之间的成对角度相似性进行结构知识蒸馏，以维持跨任务的几何一致性并缓解表征漂移；二是基于能量的几何校正模块（Energy-based Geometric Correction），从特征嵌入中去除与任务无关的方向分量，降低对多数类的预测偏倚，提升对少数类的敏感性，增强模型在类别不平衡分布下的鲁棒性。

链接: https://arxiv.org/abs/2604.08617
作者: Zhuang Qi,Ying-Peng Tang,Lei Meng,Guoqing Chao,Lei Wu,Han Yu,Xiangxu Meng
机构: Shandong University (山东大学); Nanyang Technological University (南洋理工大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 accepted

点击查看摘要

Abstract:Exemplar replay has become an effective strategy for mitigating catastrophic forgetting in federated continual learning (FCL) by retaining representative samples from past tasks. Existing studies focus on designing sample-importance estimation mechanisms to identify information-rich samples. However, they typically overlook strategies for effectively utilizing the selected exemplars, which limits their performance under continual dynamic heterogeneity across clients and tasks. To address this issue, this paper proposes a Federated gEometry-Aware correcTion method, termed FEAT, which alleviates imbalance-induced representation collapse that drags rare-class features toward frequent classes across clients. Specifically, it consists of two key modules: 1) the Geometric Structure Alignment module performs structural knowledge distillation by aligning the pairwise angular similarities between feature representations and their corresponding Equiangular Tight Frame prototypes, which are fixed and shared across clients to serve as a class-discriminative reference structure. This encourages geometric consistency across tasks and helps mitigate representation drift; 2) the Energy-based Geometric Correction module removes task-irrelevant directional components from feature embeddings, which reduces prediction bias toward majority classes. This improves sensitivity to minority classes and enhances the model’s robustness under class-imbalanced distributions.

[CV-121] MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments

【速读】：该论文旨在解决真实开放水域环境中细粒度视觉理解与高层推理任务缺乏专用评测基准的问题。现有研究在复杂海事场景下的多模态认知能力评估不足，导致模型在细粒度船舶分类、目标检测及视觉问答等任务中表现受限。解决方案的关键在于提出MARINER基准，其基于新颖的“实体-环境-事件”（Entity-Environment-Event, 3E）范式，包含16,629张多源海上图像，涵盖63类细粒度船只类别、多样恶劣环境条件以及5种典型动态海事事件，覆盖细粒度分类、目标检测和视觉问答三大任务类型。该基准不仅填补了海事领域现实性和认知层级评估的空白，也为未来鲁棒视觉语言模型在开放水域应用中的研究提供了标准化测试平台。

链接: https://arxiv.org/abs/2604.08615
作者: Xingming Liao,Ning Chen,Muying Shu,Yunpeng Yin,Peijian Zeng,Zhuowei Wang,Nankai Lin,Lianglun Cheng
机构: Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-grained visual understanding and high-level reasoning in real-world open-water environments remain under-explored due to the lack of dedicated benchmarks. We introduce MARINER, a comprehensive benchmark built under the novel Entity-Environment-Event (3E) paradigm. MARINER contains 16,629 multi-source maritime images with 63 fine-grained vessel categories, diverse adverse environments, and 5 typical dynamic maritime incidents, covering fine-grained classification, object detection, and visual question answering tasks. We conduct extensive evaluations on mainstream Multimodal Large language models (MLLMs) and establish baselines, revealing that even advanced models struggle with fine-grained discrimination and causal reasoning in complex marine scenes. As a dedicated maritime benchmark, MARINER fills the gap of realistic and cognitive-level evaluation for maritime multimodal understanding, and promotes future research on robust vision-language models for open-water applications. Appendix and supplementary materials are available at this https URL.

[CV-122] ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction

【速读】：该论文旨在解决视频显著性预测（Video Saliency Prediction）问题，即准确识别视频中吸引人类注意力的时空区域。解决方案的关键在于提出了一种多专家集成框架 ViSAGE（Video Saliency with Adaptive Gated Experts），通过引入具有自适应门控机制的专用解码器来提取互补的时空特征，并在推理阶段融合不同专家的预测结果，从而聚合多种归纳偏置（inductive biases），有效捕捉视频中复杂的时空显著性线索。

链接: https://arxiv.org/abs/2604.08613
作者: Kun Wang,Yupeng Hu,Zhiran Li,Hao Liu,Qianlong Xiang,Liqiang Nie
机构: Shandong University; Harbin Institute of Technology (Shenzhen); City University of Hong Kong; Shenzhen Loop Area Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this report, we present our champion solution for the NTIRE 2026 Challenge on Video Saliency Prediction held in conjunction with CVPR 2026. To exploit complementary inductive biases for video saliency, we propose Video Saliency with Adaptive Gated Experts (ViSAGE), a multi-expert ensemble framework. Each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from different experts are then fused at inference. ViSAGE thereby aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the Private Test set, ViSAGE ranked first on two out of four evaluation metrics, and outperformed most competing solutions on the other two metrics, demonstrating its effectiveness and generalization ability. Our code has been released at this https URL.

[CV-123] A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures

【速读】：该论文旨在解决将中世纪手稿中的二维微型插图（miniatures）自动化转换为适用于扩展现实（XR）、触觉3D打印及基于网络的可视化等多场景应用的三维数字模型的问题。其关键解决方案在于构建了一个半自动化流程：首先利用SAM（Segment Anything Model）进行图像分割，随后采用Hi3DGen生成初始网格，该方法通过法向量桥接策略在拓扑质量与表面细节之间取得平衡；再经由ZBrush进行专家精修，并结合AI辅助纹理生成，最终实现高保真、可交互的三维模型输出。该框架在哥特式和文艺复兴时期两类艺术风格的手稿案例中均验证了适用性。

链接: https://arxiv.org/abs/2604.08610
作者: Riccardo Pallotto,Pierluigi Feliciati,Tiberio Uricchio
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a semi-automated framework for transforming two-dimensional miniatures from medieval manuscripts into three-dimensional digital models suitable for extended reality (XR), tactile 3D~printing, and web-based visualization. We evaluate seven image-to-3D methods (TripoSR, SF3D, SPAR3D, TRELLIS, Wonder3D, SAM~3D, Hi3DGen) on 69~manuscript figures from two collections using rendering-based metrics (Silhouette IoU, LPIPS, CLIP~Score) and volumetric measures (Depth Range Ratio, watertight percentage), revealing a trade-off between volumetric expansion and geometric fidelity. Hi3DGen balances topological quality with rich surface detail through its normal bridging approach, making it a good starting point for expert refinement. Our pipeline combines SAM segmentation, Hi3DGen mesh generation, expert refinement in ZBrush, and AI-assisted texturing. Two case studies on Gothic illuminations from the Decretum Gratiani (Vatican Library) and Renaissance miniatures by Giulio Clovio demonstrate applicability across artistic traditions. The resulting models can support WebXR visualization, AR overlay on physical manuscripts, and tactile 3D~prints for visually impaired users.

[CV-124] Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach

【速读】：该论文旨在解决数字取证分析中对异构证据（如图像、扫描文档和上下文报告）进行仇恨和威胁检测时存在的问题，即现有自动化方法通常假设输入文本是干净的，或在未提供法证依据的情况下直接应用视觉模型。其解决方案的关键在于提出一种基于案例的多模态方法，通过明确识别文本证据的来源（嵌入文本、关联上下文文本或仅图像内容），并根据证据配置选择性地应用文本分析、多模态融合或仅图像语义推理（采用具有视觉Transformer骨干网络的视觉语言模型）。该方法通过条件化推理来匹配法证决策逻辑，提升证据可追溯性，并避免不合理的模态假设。

链接: https://arxiv.org/abs/2604.08609
作者: Ponkoj Chandra Shill
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Digital forensic investigations increasingly rely on heterogeneous evidence such as images, scanned documents, and contextual reports. These artifacts may contain explicit or implicit expressions of harm, hate, threat, violence, or intimidation, yet existing automated approaches often assume clean text input or apply vision models without forensic justification. This paper presents a case-driven multimodal approach for hate and threat detection in forensic analysis. The proposed framework explicitly determines the presence and source of textual evidence, distinguishing between embedded text, associated contextual text, and image-only evidence. Based on the identified evidence configuration, the framework selectively applies text analysis, multimodal fusion, or image-only semantic reasoning using vision language models with vision transformer backbones (ViT). By conditioning inference on evidence availability, the approach mirrors forensic decision-making, improves evidentiary traceability, and avoids unjustified modality assumptions. Experimental evaluation on forensic-style image evidence demonstrates consistent and interpretable behavior across heterogeneous evidence scenarios.

[CV-125] Silhouette Loss: Differentiable Global Structure Learning for Deep Representations

【速读】：该论文旨在解决监督深度学习中分类任务的表征学习问题，即如何在嵌入空间中显式地增强类内紧凑性和类间分离性，而传统交叉熵（Cross-Entropy, CE）损失函数并不直接优化这些几何特性。解决方案的关键在于提出一种名为Soft Silhouette Loss的新颖可微分目标函数，其灵感来源于聚类分析中的经典轮廓系数（silhouette coefficient）。该方法通过在批次层面评估每个样本相对于所有类别的相对距离，从而引导样本更靠近自身类别而非其他类别，同时保持计算轻量；此外，Soft Silhouette Loss可与CE或监督对比学习（Supervised Contrastive Learning, SupCon）无缝结合，形成混合目标函数，联合优化局部成对一致性与全局簇结构，显著提升模型性能且计算开销更低。

链接: https://arxiv.org/abs/2604.08573
作者: Matheus Vinícius Todescato,Joel Luís Carbonera
机构: Institute of Informatics, UFRGS (联邦里约热内卢联邦大学信息学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning discriminative representations is a central goal of supervised deep learning. While cross-entropy (CE) remains the dominant objective for classification, it does not explicitly enforce desirable geometric properties in the embedding space, such as intra-class compactness and inter-class separation. Existing metric learning approaches, including supervised contrastive learning (SupCon) and proxy-based methods, address this limitation by operating on pairwise or proxy-based relationships, but often increase computational cost and complexity. In this work, we introduce Soft Silhouette Loss, a novel differentiable objective inspired by the classical silhouette coefficient from clustering analysis. Unlike pairwise objectives, our formulation evaluates each sample against all classes in the batch, providing a batch-level notion of global structure. The proposed loss directly encourages samples to be closer to their own class than to competing classes, while remaining lightweight. Soft Silhouette Loss can be seamlessly combined with cross-entropy, and is also complementary to supervised contrastive learning. We propose a hybrid objective that integrates them, jointly optimizing local pairwise consistency and global cluster structure. Extensive experiments on seven diverse datasets demonstrate that: (i) augmenting CE with Soft Silhouette Loss consistently improves over CE and other metric learning baselines; (ii) the hybrid formulation outperforms SupCon alone; and (iii) the combined method achieves the best performance, improving average top-1 accuracy from 36.71% (CE) and 37.85% (SupCon2) to 39.08%, while incurring substantially lower computational overhead. These results suggest that classical clustering principles can be reinterpreted as differentiable objectives for deep learning, enabling efficient optimization of both local and global structure in representation spaces.

[CV-126] Ranked Activation Shift for Post-Hoc Out-of-Distribution Detection

【速读】：该论文旨在解决当前基于后处理（post-hoc）的分布外检测（out-of-distribution detection）方法在不同数据集和模型上表现不稳定的问题。研究表明，这种不稳定性源于中间层激活分布的差异，并识别出基于缩放（scaling-based）的方法在最后一层前激活未被修正（non-rectified）时存在失效模式。解决方案的关键在于提出一种无需超参数调优的新型方法 \ours，其核心思想是用固定的、来自正常分布（in-distribution）的参考激活幅度轮廓替代排序后的激活幅度，从而实现跨数据集与架构的一致高性能检测，同时保持正常分布分类准确率不变。

链接: https://arxiv.org/abs/2604.08572
作者: Gianluca Guglielmo,Marc Masana
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:State-of-the-art post-hoc out-of-distribution detection methods rely on intermediate layer activation editing. However, they exhibit inconsistent performance across datasets and models. We show that this instability is driven by differences in the activation distributions, and identify a failure mode of scaling-based methods that arises when penultimate layer activations are not rectified. Motivated by this analysis, we propose \ours, a hyperparameter-free post-hoc method that replaces sorted activation magnitudes with a fixed in-distribution reference profile. Our simple plug-and-play method shows strong and consistent performance across datasets and architectures without assumptions on the penultimate layer activation function, and without requiring any hyperparameter tuning, while preserving in-distribution classification accuracy by construction. We further analyze what drives the improvement, showing that both inhibiting and exciting activation shifts independently contribute to better out-of-distribution discrimination.

[CV-127] DSVTLA: Deep Swin Vision Transformer-Based Transfer Learning Architecture for Multi-Type Cancer Histopathological Cancer Image Classification

【速读】：该论文旨在解决多癌种组织病理图像分类中因图像异质性导致的模型泛化能力不足问题，尤其在不同临床成像条件下（如原始与分割图像）保持高准确性和鲁棒性的挑战。解决方案的关键在于提出一种基于Swin-Vision Transformer的迁移学习架构，通过融合分层Swin Transformer与ResNet50的卷积特征提取模块，同时捕捉长距离上下文依赖关系和细粒度局部形态学模式，从而实现对乳腺癌、口腔癌、肺癌、结肠癌、肾癌及急性淋巴细胞白血病（ALL）等多种癌症类型的高度精准分类，实验表明其在多个数据集上达到接近或达到100%的测试准确率，且具备优异的精确率、F1分数和召回率稳定性。

链接: https://arxiv.org/abs/2604.09468
作者: Muazzem Hussain Khan,Tasdid Hasnain,Md. Jamil khan,Ruhul Amin,Md. Shamim Reza,Md. Al Mehedi Hasan,Md Ashad Alam
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 [ages. 9 Figures

点击查看摘要

Abstract:In this study, we proposed a deep Swin-Vision Transformer-based transfer learning architecture for robust multi-cancer histopathological image classification. The proposed framework integrates a hierarchical Swin Transformer with ResNet50-based convolution features extraction, enabling the model to capture both long-range contextual dependencies and fine-grained local morphological patterns within histopathological images. To validate the efficiency of the proposed architecture, an extensive experiment was executed on a comprehensive multi-cancer dataset including Breast Cancer, Oral Cancer, Lung and Colon Cancer, Kidney Cancer, and Acute Lymphocytic Leukemia (ALL), including both original and segmented images were analyzed to assess model robustness across heterogeneous clinical imaging conditions. Our approach is benchmarked alongside several state-of-the-art CNN and transfer models, including DenseNet121, DenseNet201, InceptionV3, ResNet50, EfficientNetB3, multiple ViT variants, and Swin Transformer models. However, all models were trained and validated using a unified pipeline, incorporating balanced data preprocessing, transfer learning, and fine-tuning strategies. The experimental results demonstrated that our proposed architecture consistently gained superior performance, reaching 100% test accuracy for lung-colon cancer, segmented leukemia datasets, and up to 99.23% accuracy for breast cancer classification. The model also achieved near-perfect precision, f1 score, and recall, indicating highly stable scores across divers cancer types. Overall, the proposed model establishes a highly accurate, interpretable, and also robust multi-cancer classification system, demonstrating strong benchmark for future research and provides a unified comparative assessment useful for designing reliable AI-assisted histopathological diagnosis and clinical decision-making.

[CV-128] Multi-task Just Recognizable Difference for Video Coding for Machines: Database Model and Coding Application

【速读】：该论文旨在解决当前Just Recognizable Difference (JRD)模型仅适用于单任务场景、难以兼顾多任务预测精度与视频机器编码（Video Coding for Machines, VCM）效率的问题。其核心解决方案是提出多任务JRD（Multi-Task JRD, MT-JRD）数据集与属性辅助的多任务JRD（Attribute-assisted MT-JRD, AMT-JRD）模型，通过引入对象属性信息（如尺寸和位置先验）到对象级JRD预测中，利用属性特征融合模块（Attribute Feature Fusion Module, AFFM）增强感知机制建模能力，并结合通用特征提取模块（Generalized Feature Extraction Module, GFEM）与专用特征提取模块（Specialized Feature Extraction Module, SFEM）实现跨任务联合学习，从而在保持高预测准确性的同时显著提升VCM的压缩效率。

链接: https://arxiv.org/abs/2604.09421
作者: Junqi Liu,Yun Zhang,Xiaoxia Huang,Long Xu,Weisi Lin
机构: Sun Yat-Sen University (中山大学); Chinese Academy of Sciences (中国科学院); Nanyang Technological University (南洋理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Submitted to IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Just Recognizable Difference (JRD) boosts coding efficiency for machine vision through visibility threshold modeling, but is currently limited to a single-task scenario. To address this issue, we propose a Multi-Task JRD (MT-JRD) dataset and an Attribute-assisted MT-JRD (AMT-JRD) model for Video Coding for Machines (VCM), enhancing both prediction accuracy and coding efficiency. First, we construct a dataset comprising 27,264 JRD annotations from machines, supporting three representative tasks including object detection, instance segmentation, and keypoint detection. Secondly, we propose the AMT-JRD prediction model, which integrates Generalized Feature Extraction Module (GFEM) and Specialized Feature Extraction Module (SFEM) to facilitate joint learning across multiple tasks. Thirdly, we innovatively incorporate object attribute information into object-wise JRD prediction through the Attribute Feature Fusion Module (AFFM), which introduces prior knowledge about object size and location. This design effectively compensates for the limitations of relying solely on image features and enhances the model’s capacity to represent the perceptual mechanisms of machine vision. Finally, we apply the AMT-JRD model to VCM, where the accurately predicted JRDs are applied to reduce the coding bit rate while preserving accuracy across multiple machine vision tasks. Extensive experimental results demonstrate that AMT-JRD achieves precise and robust multi-task prediction with a mean absolute error of 3.781 and error variance of 5.332 across three tasks, outperforming the state-of-the-art single-task prediction model by 6.7% and 6.3%, respectively. Coding experiments further reveal that compared to the baseline VVC and JPEG, the AMT-JRD-based VCM improves an average of 3.861% and 7.886% Bjontegaard Delta-mean Average Precision (BD-mAP), respectively.

[CV-129] Cluster-First Labelling: An Automated Pipeline for Segmentation and Morphological Clustering in Histology Whole Slide Images

【速读】：该论文旨在解决组织病理学全切片图像（Whole Slide Images, WSIs）中组织成分标注过程的高劳动强度问题，因单张切片可能包含数万个需手动边界勾画和分类的细胞、细胞核及其他形态学上可区分的结构。其解决方案的关键在于提出一种“以集群为先”（cluster-first paradigm）的端到端自动化流程：首先对WSI进行分块并过滤低信息量区域，接着利用Cellpose-SAM模型进行组织成分分割，结合预训练ResNet-50提取神经嵌入（neural embeddings），通过UMAP降维后使用DBSCAN聚类方法将形态相似的对象归类；最终由人工标注代表性簇而非逐个对象，显著降低标注工作量。实验表明，在13种不同组织类型（来自人、大鼠和兔）共3,696个组织成分上，该方法实现了加权聚类标签一致性准确率达96.8%，其中7种组织类型达到完美一致，验证了方案的有效性与泛化能力。

链接: https://arxiv.org/abs/2604.09370
作者: Muhammad Haseeb Ahmad,Sharmila Rajendran,Damion Young,Jon Mason
机构: University of Oxford (牛津大学)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Labelling tissue components in histology whole slide images (WSIs) is prohibitively labour-intensive: a single slide may contain tens of thousands of structures–cells, nuclei, and other morphologically distinct objects–each requiring manual boundary delineation and classification. We present a cloudnative, end-to-end pipeline that automates this process through a cluster-first paradigm. Our system tiles WSIs, filters out tiles deemed unlikely to contain valuable information, segments tissue components with Cellpose-SAM (including cells, nuclei, and other morphologically similar structures), extracts neural embeddings via a pretrained ResNet-50, reduces dimensionality with UMAP, and groups morphologically similar objects using DBSCAN clustering. Under this paradigm, a human annotator labels representative clusters rather than individual objects, reducing annotation effort by orders of magnitude. We evaluate the pipeline on 3,696 tissue components across 13 diverse tissue types from three species (human, rat, rabbit), measuring how well unsupervised clusters align with independent human labels via per-tile Hungarian-algorithm matching. Our system achieves a weighted cluster-label alignment accuracy of 96.8%, with 7 of 13 tissue types reaching perfect agreement. The pipeline, a companion labelling web application, and all evaluation code are released as open-source software.

[CV-130] UHD Low-Light Image Enhancement via Real-Time Enhancement Methods with Clifford Information Fusion

【速读】：该论文旨在解决超高清（UHD）低光照图像增强在边缘设备上实时推理效率低的问题，现有基于Transformer或高维复杂卷积神经网络的方法常受“内存墙”瓶颈限制，难以实现毫秒级推理。其解决方案的关键在于提出一种基于二维欧几里得空间中Clifford代数的几何特征融合机制：首先构建四层逐步升分辨率的特征金字塔，通过高斯模糊核分解图像为低频与高频结构成分，并采用轻量级深度可分离卷积U-Net进行双分支特征提取；其次引入空间感知的Clifford代数将特征张量映射至多向量空间（标量、向量、二向量），利用Clifford相似性聚合特征以抑制噪声并保留纹理，同时在重建阶段输出自适应Gamma和增益图，结合Retinex理论实现物理约束的非线性亮度调整；最终通过FP16混合精度计算与动态算子融合，在单个消费级设备上实现4K/8K图像的毫秒级推理，且优于当前最优模型。

链接: https://arxiv.org/abs/2604.09321
作者: Xiaohan Wang,Chen Wu,Dawei Zhao,Guangwei Gao,Dianjie Lu,Guijuan Zhang,Linwei Fan,Xu Lu,Shuai Wu,Hang Wei,Zhuoran Zheng
机构: Xi’an University of Electronic Science and Technology (西安电子科技大学); National University of Defense Technology (国防科技大学); Nanjing University of Science and Technology (南京理工大学); Shandong Normal University (山东师范大学); Shandong University of Finance and Economics (山东财经大学); Shandong Agricultural University (山东农业大学); Qilu University of Technology (齐鲁工业大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Considering efficiency, ultra-high-definition (UHD) low-light image restoration is extremely challenging. Existing methods based on Transformer architectures or high-dimensional complex convolutional neural networks often suffer from the “memory wall” bottleneck, failing to achieve millisecond-level inference on edge devices. To address this issue, we propose a novel real-time UHD low-light enhancement network based on geometric feature fusion using Clifford algebra in 2D Euclidean space. First, we construct a four-layer feature pyramid with gradually increasing resolution, which decomposes input images into low-frequency and high-frequency structural components via a Gaussian blur kernel, and adopts a lightweight U-Net based on depthwise separable convolution for dual-branch feature extraction. Second, to resolve structural information loss and artifacts from traditional high-low frequency feature fusion, we introduce spatially aware Clifford algebra, which maps feature tensors to a multivector space (scalars, vectors, bivectors) and uses Clifford similarity to aggregate features while suppressing noise and preserving textures. In the reconstruction stage, the network outputs adaptive Gamma and Gain maps, which perform physically constrained non-linear brightness adjustment via Retinex theory. Integrated with FP16 mixed-precision computation and dynamic operator fusion, our method achieves millisecond-level inference for 4K/8K images on a single consumer-grade device, while outperforming state-of-the-art (SOTA) models on several restoration metrics.

[CV-131] Compositional-Degradation UAV Image Restoration: Conditional Decoupled MoE Network and A Benchmark

【速读】：该论文旨在解决无人机（UAV）图像在复杂真实飞行环境中因多种退化因素（如雨、雾、噪声等）共同作用导致的图像质量下降问题，此类复合退化会显著影响下游任务（如目标检测）的性能。现有统一恢复方法通常依赖于隐式退化表征，将多种退化因素纠缠为单一条件，造成不同退化类型之间的相互干扰。解决方案的关键在于提出DAME-Net（Degradation-Aware Mixture-of-Experts Network），其核心创新包括：1）设计因子感知模块（Factor-wise Degradation Perception Module, FDPM），通过多标签预测与标签相似性引导的软对齐机制，显式提取各退化因子的独立特征，替代隐式纠缠条件；2）构建条件解耦专家混合模块（Conditioned Decoupled MoE Module, CDMM），利用上述显式退化线索实现分阶段条件控制、空频混合处理及掩码约束的专家路由策略，从而实现选择性因子特异性修复并抑制无关干扰。该方法显著提升了在已见和未见退化组合下的恢复效果，尤其在高阶复合退化场景中优势明显。

链接: https://arxiv.org/abs/2604.09313
作者: Jinquan Yan,Zhicheng Zhao,Zhengzheng Tu,Chenglong Li,Jin Tang,Bin Luo
机构: Anhui University (安徽大学); Key Laboratory of Intelligent Computing Signal Processing (教育部智能计算信号处理重点实验室); Anhui Provincial Key Laboratory of Multimodal Cognitive Computation (安徽省多模态认知计算重点实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:UAV images are critical for applications such as large-area mapping, infrastructure inspection, and emergency response. However, in real-world flight environments, a single image is often affected by multiple degradation factors, including rain, haze, and noise, undermining downstream task performance. Current unified restoration approaches typically rely on implicit degradation representations that entangle multiple factors into a single condition, causing mutual interference among heterogeneous corrections. To this end, we propose DAME-Net, a Degradation-Aware Mixture-of-Experts Network that decouples explicit degradation perception from degradation-conditioned reconstruction for compositional UAV image restoration. Specifically, we design a Factor-wise Degradation Perception module(FDPM) to provide explicit per-factor degradation cues for the restoration stage through multi-label prediction with label-similarity-guided soft alignment, replacing implicit entangled conditions with interpretable and generalizable degradation descriptions. Moreover, we develop a Conditioned Decoupled MoE module(CDMM) that leverages these cues for stage-wise conditioning, spatial-frequency hybrid processing, and mask-constrained decoupled expert routing, enabling selective factor-specific correction while suppressing irrelevant interference. In addition, we construct the Multi-Degradation UAV Restoration benchmark (MDUR), the first large-scale UAV benchmark for compositional UAV image restoration, with 43 degradation configurations from single degradations to four-factor composites and standardized seen/unseen this http URL experiments on MDUR demonstrate consistent improvements over representative unified restoration methods, with greater gains on unseen and higher-order composite degradations. Downstream experiments further validate benefits for UAV object detection.

[CV-132] AMO-ENE: Attention-based Multi-Omics Fusion Model for Outcome Prediction in Extra Nodal Extension and HPV-associated Oropharyngeal Cancer

【速读】：该论文旨在解决人乳头瘤病毒（HPV）相关口咽癌（OPC）中淋巴结外侵犯（ENE）作为新兴预后因素未能纳入临床分期标准的问题，尤其针对影像学评估中因分割不一致、淋巴结周边对比度低及人工标注耗时等实践局限。其解决方案的关键在于提出了一种全自动端到端的计算流程：首先采用分层3D半监督分割模型从放疗规划CT图像中自动识别并勾画隐匿性ENE（iENE），进而提取放射组学与深度特征构建影像检测的ENE分级分类器；随后将该分类结果与原发灶特征融合至基于注意力机制的多模态预后预测模型中，实现动态生存结局预测。该方法在397例接受放疗或化放疗的HPV阳性OPC患者队列中验证，2年复发、总生存和无病生存的AUC分别达88.2%、79.2%和78.1%，显著优于基线模型，具备临床决策可行性。

链接: https://arxiv.org/abs/2604.09280
作者: Gautier Hénique,William Le,Gabriel Dayan,Coralie Brodeur,Kristoff Nelson,Apostolos Christopoulos,Edith Filion,Phuc-Felix Nguyen-Tan,Laurent Letourneau-Guillon,Houda Bahig,Samuel Kadoury
机构: MedICAL Laboratory, Polytechnique Montréeal (蒙特利尔综合理工学院医学实验室); Centre de recherche du CHUM (CRCHUM) (蒙特利尔大学健康中心研究机构); Centre Hospitalier de l’Université de Montréal (CHUM) (蒙特利尔大学医院中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Extranodal extension (ENE) is an emerging prognostic factor in human papillomavirus (HPV)-associated oropharyngeal cancer (OPC), although it is currently omitted as a clinical staging criteria. Recent works have advocated for the inclusion of iENE as a prognostic marker in HPV-positive OPC staging. However, several practical limitations continue to hinder its clinical integration, including inconsistencies in segmentation, low contrast in the periphery of metastatic lymph nodes on CT imaging, and laborious manual annotations. To address these limitations, we propose a fully automated end-to-end pipeline that uses computed tomography (CT) images with clinical data to assess the status of nodal ENE and predict treatment outcomes. Our approach includes a hierarchical 3D semi-supervised segmentation model designed to detect and delineate relevant iENE from radiotherapy planning CT scans. From these segmentations, a set of radiomics and deep features are extracted to train an imaging-detected ENE grading classifier. The predicted ENE status is then evaluated for its prognostic value and compared with existing staging criteria. Furthermore, we integrate these nodal features with primary tumor characteristics in a multimodal, attention-based outcome prediction model, providing a dynamic framework for outcome prediction. Our method is validated in an internal cohort of 397 HPV-positive OPC patients treated with radiation therapy or chemoradiotherapy between 2009 and 2020. For outcome prediction at the 2-year mark, our pipeline surpassed baseline models with 88.2% (4.8) in AUC for metastatic recurrence, 79.2% (7.4) for overall survival, and 78.1% (8.6) for disease-free survival. We also obtain a concordance index of 83.3% (6.5) for metastatic recurrence, 71.3% (8.9) for overall survival, and 70.0% (8.1) for disease-free survival, making it feasible for clinical decision making.

[CV-133] raining-free Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models

【速读】：该论文旨在解决生成高分辨率（High-Resolution, HR）图像时计算成本高昂的问题，尤其是在用户需通过多次尝试不同提示词（prompt）和随机种子（seed）来筛选优质结果的场景下。为降低计算负担，作者提出生成低分辨率（Low-Resolution, LR）预览图（Previews），使其在感知上与HR图像保持一致，从而允许用户在生成最终HR图像前快速识别潜在候选方案。解决方案的关键在于提出“零交换子条件”（commutator-zero condition），用于确保基于流匹配（flow matching）模型中LR与HR图像之间的感知一致性；该条件驱动了一种无需训练的实现方法，结合下采样矩阵选择与交换子零引导机制，在不牺牲感知质量的前提下显著减少计算量，实验表明可实现最高达33%的计算节省，并在集成现有加速技术时获得最高3倍的速度提升。

链接: https://arxiv.org/abs/2604.09227
作者: Wongi Jeong,Hoigi Seo,Se Young Chun
机构: Seoul National University (首尔国立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image generative models have become indispensable tools to yield exquisite high-resolution (HR) images for everyone, ranging from general users to professional designers. However, a desired outcome often requires generating a large number of HR images with different prompts and seeds, resulting in high computational cost for both users and service providers. Generating low-resolution (LR) images first could alleviate computational burden, but it is not straightforward how to generate LR images that are perceptually consistent with their HR counterparts. Here, we consider the task of generating high-fidelity LR images, called Previews, that preserve perceptual similarity of their HR counterparts for an efficient workflow, allowing users to identify promising candidates before generating the final HR image. We propose the commutator-zero condition to ensure the LR-HR perceptual consistency for flow matching models, leading to the proposed training-free solution with downsampling matrix selection and commutator-zero guidance. Extensive experiments show that our method can generate LR images with up to 33% computation reduction while maintaining HR perceptual consistency. When combined with existing acceleration techniques, our method achieves up to 3 \times speedup. Moreover, our formulation can be extended to image manipulations, such as warping and translation, demonstrating its generalizability.

[CV-134] MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification

【速读】：该论文旨在解决深度学习模型在临床医学影像应用中因缺乏可靠不确定性量化而导致的过自信预测与透明度不足问题，尤其针对临床数据噪声大、类别不平衡等挑战。其解决方案的关键在于改进Medical Vision Transformer（MedFormer），引入基于狄利克雷分布（Dirichlet distribution）的逐token证据不确定性建模机制，实现对预测不确定性的实时量化与定位；同时结合类特定原型（class-specific prototypes）保持嵌入空间结构化，使决策基于视觉相似性，并将不确定性作为训练过程中的主动参与者，过滤不可靠特征更新，从而显著提升模型校准能力（ECE降低最高达35%）和选择性预测性能。

链接: https://arxiv.org/abs/2604.08868
作者: Mohammed Maaz Sibhai,Abedalrhman Alkhateeb,Saad B. Ahmed
机构: Lakehead University (湖头大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To ensure safe clinical integration, deep learning models must provide more than just high accuracy; they require dependable uncertainty quantification. While current Medical Vision Transformers perform well, they frequently struggle with overconfident predictions and a lack of transparency, issues that are magnified by the noisy and imbalanced nature of clinical data. To address this, we enhanced the modified Medical Transformer (MedFormer) that incorporates prototype-based learning and uncertainty-guided routing, by utilizing a Dirichlet distribution for per-token evidential uncertainty, our framework can quantify and localize ambiguity in real-time. This uncertainty is not just an output but an active participant in the training process, filtering out unreliable feature updates. Furthermore, the use of class-specific prototypes ensures the embedding space remains structured, allowing for decisions based on visual similarity. Testing across four modalities (mammography, ultrasound, MRI, and histopathology) confirms that our approach significantly enhances model calibration, reducing expected calibration error (ECE) by up to 35%, and improves selective prediction, even when accuracy gains are modest.

[CV-135] PSIRNet: Deep Learning-based Free-breathing Rapid Acquisition Late Enhancement Imaging

【速读】：该论文旨在解决心脏磁共振成像（Cardiac MRI）中晚钆增强（Late Gadolinium Enhancement, LGE）扫描所需时间过长的问题，传统方法依赖8至24次运动校正（Motion-Corrected, MOCO）信号平均以获得诊断质量图像，导致检查效率低下。解决方案的关键在于提出一种物理引导的深度学习网络PSIRNet，该网络具有8.45亿参数，能够从单次采集（仅需两个心动周期）的相位敏感反转恢复（Phase-Sensitive Inversion Recovery, PSIR）数据中直接重建出高质量LGE图像，并内置表面线圈校正功能。实验表明，PSIRNet重建图像在主观评分和客观指标（SSIM、PSNR、NRMSE）上均达到或优于MOCO参考图像，同时推理时间仅为约100毫秒/切片，显著快于MOCO方法（>5秒/切片），从而实现8–24倍的扫描时间压缩，且无需额外的运动校正步骤。

链接: https://arxiv.org/abs/2604.08781
作者: Arda Atalik,Hui Xue,Rhodri H. Davies,Thomas A. Treibel,Daniel K. Sodickson,Michael S. Hansen,Peter Kellman
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
备注: 25 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Purpose: To develop and evaluate a deep learning (DL) method for free-breathing phase-sensitive inversion recovery (PSIR) late gadolinium enhancement (LGE) cardiac MRI that produces diagnostic-quality images from a single acquisition over two heartbeats, eliminating the need for 8 to 24 motion-corrected (MOCO) signal averages. Materials and Methods: Raw data comprising 800,653 slices from 55,917 patients, acquired on 1.5T and 3T scanners across multiple sites from 2016 to 2024, were used in this retrospective study. Data were split by patient: 640,000 slices (42,822 patients) for training and the remainder for validation and testing, without overlap. The training and testing data were from different institutions. PSIRNet, a physics-guided DL network with 845 million parameters, was trained end-to-end to reconstruct PSIR images with surface coil correction from a single interleaved IR/PD acquisition over two heartbeats. Reconstruction quality was evaluated using SSIM, PSNR, and NRMSE against MOCO PSIR references. Two expert cardiologists performed an independent qualitative assessment, scoring image quality on a 5-point Likert scale across bright blood, dark blood, and wideband LGE variants. Paired superiority and equivalence (margin = 0.25 Likert points) were tested using exact Wilcoxon signed-rank tests at a significance level of 0.05 using R version 4.5.2. Results: Both readers rated single-average PSIRNet reconstructions superior to MOCO PSIR for dark blood LGE (conservative P = .002); for bright blood and wideband, one reader rated it superior and the other confirmed equivalence (all P .001). Inference required approximately 100 msec per slice versus more than 5 sec for MOCO PSIR. Conclusion: PSIRNet produces diagnostic-quality free-breathing PSIR LGE images from a single acquisition, enabling 8- to 24-fold reduction in acquisition time. Comments: 25 pages, 5 figures, 4 tables Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Medical Physics (physics.med-ph) Cite as: arXiv:2604.08781 [eess.IV] (or arXiv:2604.08781v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2604.08781 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Arda Atalik [view email] [v1] Thu, 9 Apr 2026 21:31:48 UTC (989 KB)

人工智能

[AI-0] Semantic Rate-Distortion for Bounded Multi-Agent Communication: Capacity-Derived Semantic Spaces and the Communication Cost of Alignment

【速读】：该论文旨在解决异构智能体（agents）在有限计算能力下如何实现意图保持的通信问题，特别是当它们与同一环境交互但具有不同信息处理能力时，如何从交互中自适应地推导出各自的能力相关语义空间（capacity-derived semantic space）。其核心挑战在于：传统信息论假设源字母表固定，而本文提出从受限交互本身动态生成语义字母表，并揭示通信结构在特定速率阈值下的突变现象。解决方案的关键是引入**商部分可观测马尔可夫决策过程（quotient POMDP, Q_m,T(M)）**作为代理能力的抽象表示，证明其构成唯一最粗粒度的一致性语义空间；进一步通过分析两个代理之间商POMDP的不匹配程度，识别出一个临界通信速率 $ R_\text{crit} $，低于此速率时意图保持通信在结构上不可能实现，而在支持的一维无记忆信道中，经典Wyner-Ziv编码可达到指数衰减的误差性能。这一框架首次将语义字母表的构造与通信率优化统一于受限交互建模之中，实现了对异构系统通信极限的理论刻画与实验验证。

链接: https://arxiv.org/abs/2604.09521
作者: Anthony T. Nixon
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: 34 pages, 13 figures. Code: this https URL

点击查看摘要

Abstract:When two agents of different computational capacities interact with the same environment, they need not compress a common semantic alphabet differently; they can induce different semantic alphabets altogether. We show that the quotient POMDP Q_m,T(M) - the unique coarsest abstraction consistent with an agent’s capacity - serves as a capacity-derived semantic space for any bounded agent, and that communication between heterogeneous agents exhibits a sharp structural phase transition. Below a critical rate R_\textcrit determined by the quotient mismatch, intent-preserving communication is structurally impossible. In the supported one-way memoryless regime, classical side-information coding then yields exponential decay above the induced benchmark. Classical coding theorems tell you the rate once the source alphabet is fixed; our contribution is to derive that alphabet from bounded interaction itself. Concretely, we prove: (1) a fixed- \varepsilon structural phase-transition theorem whose lower bound is fully general on the common-history quotient comparison; (2) a one-way Wyner-Ziv benchmark identification on quotient alphabets, with exact converse, exact operational equality for memoryless quotient sources, and an ergodic long-run bridge via explicit mixing bounds; (3) an asymptotic one-way converse in the shrinking-distortion regime \varepsilon = O(1/T) , proved from the message stream and decoder side information; and (4) alignment traversal bounds enabling compositional communication through intermediate capacity levels. Experiments on eight POMDP environments (including RockSample(4,4)) illustrate the phase transition, a structured-policy benchmark shows the one-way rate can drop by up to 19\times relative to the counting bound, and a shrinking-distortion sweep matches the regime of the asymptotic converse. Comments: 34 pages, 13 figures. Code: this https URL Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI) MSC classes: 94A29 (Primary) 90C40, 68Q45 (Secondary) ACMclasses: E.4; I.2.11; F.4.3 Cite as: arXiv:2604.09521 [cs.IT] (or arXiv:2604.09521v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2604.09521 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-1] XFED: Non-Collusive Model Poisoning Attack Against Byzantine-Robust Federated Classifiers

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中模型投毒攻击（Model Poisoning Attacks）依赖于攻击者之间协同通信的现实局限性问题。现有攻击方法通常要求恶意客户端通过交换本地良性模型或同步生成投毒更新来维持协作，这在实际部署中难以持续且易被检测。为应对这一挑战，作者提出了非协同攻击模型（Non-collusive Attack Model），其中所有被攻陷的客户端共享相同的恶意目标但独立操作，无需通信、不依赖其他客户端更新或服务器端防御机制即可生成恶意更新。解决方案的关键在于提出首个聚合无关（Aggregation-Agnostic）的非协同模型投毒攻击方法——XFED，其在六个基准数据集上的实证结果表明，该方法能绕过八种前沿防御机制，并优于六种现有攻击策略，揭示了联邦学习系统比以往认知更脆弱，亟需更强健和实用的防御机制。

链接: https://arxiv.org/abs/2604.09489
作者: Israt Jahan Mouri,Muhammad Ridowan,Muhammad Abdullah Adnan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 21 pages, 9 figures, 7 tables

点击查看摘要

Abstract:Model poisoning attacks pose a significant security threat to Federated Learning (FL). Most existing model poisoning attacks rely on collusion, requiring adversarial clients to coordinate by exchanging local benign models and synchronizing the generation of their poisoned updates. However, sustaining such coordination is increasingly impractical in real-world FL deployments, as it effectively requires botnet-like control over many devices. This approach is costly to maintain and highly vulnerable to detection. This context raises a fundamental question: Can model poisoning attacks remain effective without any communication between attackers? To address this challenge, we introduce and formalize the \textbfnon-collusive attack model, in which all compromised clients share a common adversarial objective but operate independently. Under this model, each attacker generates its malicious update without communicating with other adversaries, accessing other clients’ updates, or relying on any knowledge of server-side defenses. To demonstrate the feasibility of this threat model, we propose \textbfXFED, the first aggregation-agnostic, non-collusive model poisoning attack. Our empirical evaluation across six benchmark datasets shows that XFED bypasses eight state-of-the-art defenses and outperforms six existing model poisoning attacks. These findings indicate that FL systems are substantially less secure than previously believed and underscore the urgent need for more robust and practical defense mechanisms.

[AI-2] Process Reward Agents for Steering Knowledge-Intensive Reasoning

【速读】：该论文旨在解决知识密集型领域中推理过程难以验证的问题，即中间步骤往往无法局部验证（non-locally verifiable），导致细微错误在推理链中传播且不易被发现。传统方法如过程奖励模型（Process Reward Models, PRMs）虽能评估完整推理轨迹，但属于事后评分机制，无法嵌入动态推理流程。其解决方案的关键在于提出过程奖励代理（Process Reward Agents, PRA），这是一种测试时（test-time）的在线奖励机制，能够为冻结策略模型（frozen policy）提供领域感知的逐步奖励，并支持基于搜索的解码策略，在每一步生成时对候选推理路径进行排序与剪枝。该方法显著提升了医疗推理任务上的准确性（如MedQA上达到80.8%），并具备良好的泛化能力，适用于不同规模的冻结模型（从0.5B到8B参数），无需更新策略模型即可提升性能达25.7%，从而实现了推理模型与领域奖励模块的解耦，推动了复杂领域中可插拔式推理系统的部署。

链接: https://arxiv.org/abs/2604.09482
作者: Jiwoong Sohn,Tomasz Sternal,Kenneth Styppa,Torsten Hoefler,Michael Moor
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, evaluating step correctness may require synthesizing clues across large external knowledge sources. As a result, subtle errors can propagate through reasoning traces, potentially never to be detected. Prior work has proposed process reward models (PRMs), including retrieval-augmented variants, but these methods operate post hoc, scoring completed trajectories, which prevents their integration into dynamic inference procedures. Here, we introduce Process Reward Agents (PRA), a test-time method for providing domain-grounded, online, step-wise rewards to a frozen policy. In contrast to prior retrieval-augmented PRMs, PRA enables search-based decoding to rank and prune candidate trajectories at every generation step. Experiments on multiple medical reasoning benchmarks demonstrate that PRA consistently outperforms strong baselines, achieving 80.8% accuracy on MedQA with Qwen3-4B, a new state of the art at the 4B scale. Importantly, PRA generalizes to unseen frozen policy models ranging from 0.5B to 8B parameters, improving their accuracy by up to 25.7% without any policy model updates. More broadly, PRA suggests a paradigm in which frozen reasoners are decoupled from domain-specific reward modules, allowing the deployment of new backbones in complex domains without retraining.

[AI-3] SafeMind: A Risk-Aware Differentiable Control Framework for Adaptive and Safe Quadruped Locomotion

【速读】：该论文旨在解决基于学习的四足机器人控制器在模型不确定性、感知噪声和非结构化接触条件下缺乏形式化安全保证的问题。解决方案的关键在于提出SafeMind框架，其核心是将概率控制屏障函数（probabilistic Control Barrier Functions, CBF）与语义上下文理解及元自适应风险校准相结合：通过嵌入可变方差的屏障约束于可微分二次规划中，显式建模认知不确定性（epistemic uncertainty）和随机不确定性（aleatoric uncertainty），从而保持梯度流以支持端到端训练；同时引入语义到约束编码器，利用感知或语言线索调节安全裕度，并通过元自适应学习器在不同环境中动态调整风险敏感性，实现理论保障下的实时安全控制。

链接: https://arxiv.org/abs/2604.09474
作者: Zukun Zhang,Kai Shu,Mingqiao Mo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning-based quadruped controllers achieve impressive agility but typically lack formal safety guarantees under model uncertainty, perception noise, and unstructured contact conditions. We introduce SafeMind, a differentiable stochastic safety-control framework that unifies probabilistic Control Barrier Functions with semantic context understanding and meta-adaptive risk calibration. SafeMind explicitly models epistemic and aleatoric uncertainty through a variance-aware barrier constraint embedded in a differentiable quadratic program, thereby preserving gradient flow for end-to-end training. A semantics-to-constraint encoder modulates safety margins using perceptual or language cues, while a meta-adaptive learner continuously adjusts risk sensitivity across environments. We provide theoretical conditions for probabilistic forward invariance, feasibility, and stability under stochastic dynamics. SafeMind is deployed on Unitree A1 and ANYmal C at 200~Hz and validated across 12 terrain types, dynamic obstacles, morphology perturbations, and semantically defined tasks. Experiments show that SafeMind reduces safety violations by 3–10x and energy consumption by 10–15% relative to state-of-the-art CBF, MPC, and hybrid RL baselines, while maintaining real-time control performance.

[AI-4] E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在工具集成推理（Tool-Integrated Reasoning, TIR）训练中面临的两大挑战：一是零样本强化学习（Zero-RL）因缺乏先验引导而导致探索效率低和模式退化；二是监督微调后强化学习（SFT-then-RL）因数据成本高和低熵崩溃导致能力提升受限。解决方案的关键在于提出一种名为E3-TIR（Enhanced Experience Exploitation）的预热训练范式，其核心是动态融合三种经验类型——专家前缀（Expert Prefixes）、专家引导（Expert Guided）与自探索（Self-Exploration），并通过围绕专家“锚点”进行多样化分支探索以及混合策略优化机制，有效缓解分布偏移并解决共享前缀引发的优化冲突，从而在保证探索多样性的同时动态扩展模型知识边界，显著提升训练效率与性能表现。

链接: https://arxiv.org/abs/2604.09455
作者: Weiyang Guo,Zesheng Shi,Liye Zhao,Jiayuan Ma,Zeen Zhu,Junxian He,Min Zhang,Jing Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages 10 figures, published in acl2026

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated significant potential in Tool-Integrated Reasoning (TIR), existing training paradigms face significant limitations: Zero-RL suffers from inefficient exploration and mode degradation due to a lack of prior guidance, while SFT-then-RL is limited by high data costs and capability plateaus caused by low-entropy collapse. To address these challenges, we propose E3-TIR (Enhanced Experience Exploitation), a warm-up paradigm for the early stages of agent training. Specifically, we formulate training as the dynamic integration of three experience types: Expert Prefixes, Expert Guided, and Self-Exploration. By executing diverse branching exploration around expert “anchors” and employing a mix policy optimization mechanism, we effectively mitigate distribution shifts and resolve optimization conflicts arising from shared prefixes. Our method dynamically adapts the model’s knowledge boundaries, effectively balancing exploration diversity with training this http URL results demonstrate that E3-TIR achieves a 6 performance improvement over traditional paradigms on tool-use tasks, while requiring less than 10 of the synthetic data. Furthermore, in terms of ROI, a comprehensive metric integrating performance, data cost, and training efficiency we achieve a 1.46x gain compared to baselines. Code is available at this https URL.

[AI-5] SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning DATE

【速读】：该论文旨在解决持续强化学习（Continual Reinforcement Learning, CRL）中政策更新时的安全性保障问题，即在环境动态变化或性能目标调整的情况下，如何在不破坏已学习任务安全性的前提下对策略进行更新。现有方法通常缺乏形式化安全保证，或仅在事后验证安全性。论文提出一种先验（a priori）的安全策略更新方法，其核心创新在于引入“Rashomon集”——一个在演示数据分布下被认证满足安全约束的策略参数空间区域。通过将任意强化学习算法的更新步骤投影到该集合上，可为策略更新提供形式化、可证明的安全保障。实验表明，该方法在网格世界导航任务中实现了源任务安全性的确定性保障，而基于正则化的基线方法则出现安全约束的灾难性遗忘。

链接: https://arxiv.org/abs/2604.09452
作者: Maksim Anisimov(Imperial College London),Francesco Belardinelli(Imperial College London),Matthew Wicker(Imperial College London)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code available at: this https URL

点击查看摘要

Abstract:Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments exhibit non-stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid-world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation-based baselines experience catastrophic forgetting of safety constraints while our approach enables strong adaptation with provable guarantees that safety is preserved.

[AI-6] ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

【速读】：该论文旨在解决胸部X光片报告生成（CXR-RG）中传统自回归视觉-语言模型（VLM）因逐token解码导致推理延迟高，以及扩散模型虽可并行生成但仍需多步去噪而效率不足的问题。其核心解决方案是提出ECHO，一种高效的扩散型视觉-语言模型（dVLM），关键创新在于引入直接条件蒸馏（Direct Conditional Distillation, DCD）框架，通过构建基于策略内扩散轨迹的非因子化监督信号来捕捉词元间的联合依赖关系，从而缓解由词元因子化去噪器引入的均场偏差（mean-field bias），实现每块（block）仅一步推理的稳定性；同时结合响应不对称扩散（Response-Asymmetric Diffusion, RAD）训练策略，在不牺牲临床准确性的前提下显著提升训练效率。

链接: https://arxiv.org/abs/2604.09450
作者: Lifeng Chen,Tianqi You,Hao Liu,Zhimin Bao,Jile Jiao,Xiao Han,Zhicai Ou,Tao Sun,Xiaofeng Mou,Xiaojie Jin,Yi Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists’ workload. However, conventional autoregressive vision–language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbfECHO, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf64.33% and \textbf60.58% respectively, while achieving an \textbf 8\times inference speedup without compromising clinical accuracy.

[AI-7] Do We Really Need to Approach the Entire Pareto Front in Many-Objective Bayesian Optimisation?

【速读】：该论文针对多目标贝叶斯优化（Many-objective Bayesian Optimisation）中因目标数量增加而导致的搜索难度剧增问题展开研究，尤其在评估预算有限（通常仅几百次评估）的情况下，传统方法试图近似整个帕累托前沿（Pareto front）往往难以实现高质量解的获取。论文提出了一种基于单点的多目标搜索框架（Single Point-based Multi-objective Search, SPMO），其核心思想是在有限评估预算下优先追求单一最优解的质量，而非全面覆盖帕累托前沿。解决方案的关键在于设计了一个名为期望单点改进（Expected Single-point Improvement, ESPI）的简单采集函数（acquisition function），该函数可在无噪声和有噪声场景下有效工作，并通过样本平均近似（Sample Average Approximation, SAA）方法结合梯度优化进行高效求解，同时理论证明了其在SAA框架下的收敛性保证。实证结果表明，SPMO在计算上可行且在多种基准与真实世界问题上优于现有最先进方法。

链接: https://arxiv.org/abs/2604.09417
作者: Chao Jiang,Jingyu Huang,Miqing Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many-objective optimisation, a subset of multi-objective optimisation, involves optimisation problems with more than three objectives. As the number of objectives increases, the number of solutions needed to adequately represent the entire Pareto front typically grows substantially. This makes it challenging, if not infeasible, to design a search algorithm capable of effectively exploring the entire Pareto front. This difficulty is particularly acute in the Bayesian optimisation paradigm, where sample efficiency is critical and only a limited number of solutions (often a few hundred) are evaluated. Moreover, after the optimisation process, the decision-maker eventually selects just one solution for deployment, regardless of how many high-quality, diverse solutions are available. In light of this, we argue an idea that under a very limited evaluation budget, it may be more useful to focus on finding a single solution of the highest possible quality for the decision-maker, rather than aiming to approximate the entire Pareto front as existing many-/multi-objective Bayesian optimisation methods typically do. Bearing this idea in mind, this paper proposes a \underlinesingle \underlinepoint-based \underlinemulti-\underlineobjective search framework (SPMO) that aims to improve the quality of solutions along a direction that leads to a good tradeoff between objectives. Within SPMO, we present a simple acquisition function, called expected single-point improvement (ESPI), working under both noiseless and noisy scenarios. We show that ESPI can be optimised effectively with gradient-based methods via the sample average approximation (SAA) approach and theoretically prove its convergence guarantees under the SAA. We also empirically demonstrate that the proposed SPMO is computationally tractable and outperforms state-of-the-arts on a wide range of benchmark and real-world problems.

[AI-8] Yes But Not Always. Generative AI Needs Nuanced Opt-in

【速读】：该论文试图解决当前生成式 AI（Generative AI）在使用受版权保护的创意作品时，采用“一刀切”式的二元同意机制（opt-in 默认）所导致的法律与伦理困境。这种机制无法应对现实中复杂的权利归属结构、艺术风格模仿及AI输出的无限应用场景，进而加剧了权利持有人与AI开发者之间的权力失衡。论文的关键解决方案在于提出在推理阶段（inference time）实施细粒度的同意验证机制（nuanced opt-in），并通过基于代理的架构实现对用户意图与权利持有人条件性授权之间匹配关系的动态验证，从而在音乐等具体场景中实现对既有权利的尊重并重构双方的权力平衡。

链接: https://arxiv.org/abs/2604.09413
作者: Wiebke Hutiri,Morgan Scheuerman,Shruti Nagpal,Austin Hoag,Alice Xiang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper argues that a one-size-fits-all approach to specifying consent for the use of creative works in generative AI is insufficient. Real-world ownership and rights holder structures, the imitation of artistic styles and likeness, and the limitless contexts of use of AI outputs make the status quo of binary consent with opt-in by default untenable. To move beyond the current impasse, we consider levers of control in generative AI workflows at training, inference, and dissemination. Based on these insights, we position inference-time opt-in as an overlooked opportunity for nuanced consent verification. We conceptualize nuanced consent conditions for opt-in and propose an agent-based inference-time opt-in architecture to verify if user intent requests meet conditional consent granted by rights holders. In a case study for music, we demonstrate that nuanced opt-in at inference can account for established rights and re-establish a balance of power between rights holders and AI developers.

[AI-9] HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

【速读】：该论文旨在解决前沿编码代理（coding agents）在面对不完整或模糊规范时表现崩溃的问题，其核心瓶颈并非模型的原始能力不足，而是判断能力缺失——即无法准确识别何时应自主执行任务、何时需向人类求助。传统基准测试因提供明确指令且仅评估执行正确性，无法捕捉此类“沉默猜测”（silent guessing）错误，导致高分模型可能依赖侥幸而非合理决策。为此，作者提出HiL-Bench（Human-in-the-Loop Benchmark），通过引入人类验证的阻塞点（blockers，如缺失信息、歧义请求等）来迫使模型进行渐进式探索，并设计Ask-F1作为核心指标，该指标为提问精确率与阻塞点召回率的调和平均值，有效平衡过度提问与沉默猜测之间的矛盾，且结构上防止通过大量无意义提问来刷分。实验表明，当前主流模型在无完整信息条件下性能大幅下降，存在普遍性的判断差距（judgment gap），且失败模式具有共性：包括错误自信但未检测到问题、高不确定性下仍持续出错、以及宽泛模糊的求助行为缺乏自我修正。进一步地，基于Ask-F1奖励的强化学习训练显示，这种判断能力可被有效训练提升，32B参数模型不仅改善了求助质量，还提升了任务成功率，且跨领域泛化效果显著，说明模型学习的是通用的不确定性和求助决策机制，而非特定领域的启发式规则。

链接: https://arxiv.org/abs/2604.09408
作者: Mohamed Elfeki,Tu Trinh,Kelvin Luu,Guangze Luo,Nathan Hunt,Ernesto Montoya,Nandan Marwaha,Yannis He,Charles Wang,Fernando Crabedo,Alessa Castilo,Bing Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09408 [cs.AI] (or arXiv:2604.09408v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.09408 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] he AI Codebase Maturity Model: From Assisted Coding to Self-Sustaining Systems

【速读】：该论文旨在解决当前团队在使用AI编码工具时普遍停滞于“提示与审查”阶段，缺乏系统性进阶框架的问题。为应对这一挑战，作者提出了AI代码库成熟度模型（AI Codebase Maturity Model, ACMM），这是一个包含五个层级的结构化框架，描述了代码库如何从基础的AI辅助开发逐步演进为具备自我维持能力的系统。解决方案的关键在于：每个层级的跃迁依赖于特定反馈回路拓扑结构中机制的建立，尤其是测试用例数量、覆盖率阈值和测试执行可靠性等反馈机制的完善——这些基础设施构成了AI驱动开发系统的真正智能所在，而非AI模型本身。

链接: https://arxiv.org/abs/2604.09388
作者: Andy Anderson
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 20 pages, 5 tables. Practitioner experience report. Source code and full feedback loop implementation publicly available at this https URL

点击查看摘要

Abstract:AI coding tools are widely adopted, but most teams plateau at prompt-and-review without a framework for systematic progression. This paper presents the AI Codebase Maturity Model (ACMM), a 5-level framework describing how codebases evolve from basic AI-assisted coding to self-sustaining systems. Inspired by CMMI, each level is defined by its feedback loop topology the specific mechanisms that must exist before the next level becomes possible. I validate the model through a 4-month experience report maintaining KubeStellar Console, a CNCF Kubernetes dashboard built from scratch with Claude Code (Opus) and GitHub Copilot. The system currently operates with 63 CI/CD workflows, 32 nightly test suites, 91% code coverage, and achieves bug-to-fix times under 30 minutes 24 hours a day. The central finding: the intelligence of an AI-driven development system resides not in the AI model itself, but in the infrastructure of instructions, tests, metrics, and feedback loops that surround it. You cannot skip levels, and at each level, the thing that unlocks the next one is another feedback mechanism. Testing the volume of test cases, the coverage thresholds, and the reliability of test execution proved to be the single most important investment in the entire journey.

[AI-11] BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

【速读】：该论文旨在解决代理生态系统（agent ecosystems）中因第三方技能（skill）嵌入可学习模型而引入的供应链安全风险问题，即“模型内技能”（model-in-skill）攻击面——此类技能表面无害，但其内置模型可能被后门微调（backdoor fine-tuning），仅在特定语义触发条件满足时激活恶意行为。解决方案的关键在于提出 BadSkill 攻击范式：通过复合目标函数（结合分类损失、基于间隔的分离损失和毒化优化）训练嵌入式分类器，使其在常规参数组合下保持正常功能，而在攻击者预设的语义触发组合下激活隐藏载荷；实验验证表明，该方法在多种模型架构（494M–7.1B参数）和文本扰动类型下均能实现高达99.5%的平均攻击成功率（ASR），同时维持对负类查询的高良性准确率，揭示了模型承载型技能作为独立供应链风险的存在，并推动对第三方技能 artifact 的溯源验证与行为审查机制升级。

链接: https://arxiv.org/abs/2604.09378
作者: Guiyao Tie,Jiawen Shi,Pan Zhou,Lichao Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 4 pages, 4 fIGURES

点击查看摘要

Abstract:Agent ecosystems increasingly rely on installable skills to extend functionality, and some skills bundle learned model artifacts as part of their execution logic. This creates a supply-chain risk that is not captured by prompt injection or ordinary plugin misuse: a third-party skill may appear benign while concealing malicious behavior inside its bundled model. We present BadSkill, a backdoor attack formulation that targets this model-in-skill threat surface. In BadSkill, an adversary publishes a seemingly benign skill whose embedded model is backdoor-fine-tuned to activate a hidden payload only when routine skill parameters satisfy attacker-chosen semantic trigger combinations. To realize this attack, we train the embedded classifier with a composite objective that combines classification loss, margin-based separation, and poison-focused optimization, and evaluate it in an OpenClaw-inspired simulation environment that preserves third-party skill installation and execution while enabling controlled multi-model study. Our benchmark spans 13 skills, including 8 triggered tasks and 5 non-trigger control skills, with a combined main evaluation set of 571 negative-class queries and 396 trigger-aligned queries. Across eight architectures (494M–7.1B parameters) from five model families, BadSkill achieves up to 99.5% average attack success rate (ASR) across the eight triggered skills while maintaining strong benign-side accuracy on negative-class queries. In poison-rate sweeps on the standard test split, a 3% poison rate already yields 91.7% ASR. The attack remains effective across the evaluated model scales and under five text perturbation types. These findings identify model-bearing skills as a distinct model supply-chain risk in agent ecosystems and motivate stronger provenance verification and behavioral vetting for third-party skill artifacts.

[AI-12] LLM -Rosetta: A Hub-and-Spoke Intermediate Representation for Cross-Provider LLM API Translation

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）提供商之间API格式不统一导致的应用程序与特定厂商绑定、跨平台迁移困难的问题。当前多提供商架构需构建O(N²)数量级的双边适配器，严重限制了系统可移植性和灵活性。解决方案的关键在于识别出主流LLM API虽在语法层面差异显著，但共享一个语义核心，由此提出LLM-Rosetta框架：基于中心-辐条结构的中间表示（Intermediate Representation, IR），以9类内容模型和10类流事件模式抽象出通用语义要素（如消息、内容片段、工具调用、推理轨迹及生成控制），并通过模块化Ops组合转换器架构实现各API标准独立接入。该设计支持请求与响应的双向转换（包括逐块流式传输和状态上下文管理），实验证明其具备无损往返保真度、正确流式行为及低于100微秒的转换延迟，且完全兼容Open Responses合规套件，已在Argonne国家实验室投入生产使用。

链接: https://arxiv.org/abs/2604.09360
作者: Peng Ding
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid proliferation of Large Language Model (LLM) providers–each exposing proprietary API formats–has created a fragmented ecosystem where applications become tightly coupled to individual vendors. Switching or bridging providers requires O(N^2) bilateral adapters, impeding portability and multi-provider architectures. We observe that despite substantial syntactic divergence, the major LLM APIs share a common semantic core: the practical challenge is the combinatorial surface of syntactic variations, not deep semantic incompatibility. Based on this finding, we present LLM-Rosetta, an open-source translation framework built on a hub-and-spoke Intermediate Representation (IR) that captures the shared semantic core–messages, content parts, tool calls, reasoning traces, and generation controls–in a 9-type content model and 10-type stream event schema. A modular Ops-composition converter architecture enables each API standard to be added independently. LLM-Rosetta supports bidirectional conversion (provider-to-IR-to-provider) for both request and response payloads, including chunk-level streaming with stateful context management. We implement converters for four API standards (OpenAI Chat Completions, OpenAI Responses, Anthropic Messages, and Google GenAI), covering the vast majority of commercial providers. Empirical evaluation demonstrates lossless round-trip fidelity, correct streaming behavior, and sub-100 microsecond conversion overhead–competitive with LiteLLM’s single-pass approach while providing bidirectionality and provider neutrality. LLM-Rosetta passes the Open Responses compliance suite and is deployed in production at Argonne National Laboratory. Code is available at this https URL.

[AI-13] Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents

【速读】：该论文旨在解决语言模型在药物发现任务中因缺乏对候选分子集合整体协议合规性的精确诊断而导致的规划失效问题。现有系统依赖冗长的历史记录和模糊的自我反思，使得失败定位不准确且代理状态噪声增加。解决方案的关键在于提出一种名为CACM（Constraint-Aware Corrective Memory）的框架，其核心创新包括：1）引入协议审计（protocol auditing）与基于多模态证据的诊断师（grounded diagnostician），实现对集合层面约束（如数量、多样性、结合质量等）的精准违规定位与可操作的修正提示；2）设计静态、动态与纠错三通道记忆结构，并通过压缩写回机制保持任务相关信息的同时仅暴露最相关的失败信息，从而维持规划上下文紧凑性。这一方法显著提升了目标成功率36.4%，验证了精确诊断与高效代理状态管理对可靠语言驱动药物发现的重要性。

链接: https://arxiv.org/abs/2604.09308
作者: Maochen Sun,Youzhi Zhang,Gaofeng Meng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are making autonomous drug discovery agents increasingly feasible, but reliable success in this setting is not determined by any single action or molecule. It is determined by whether the final returned set jointly satisfies protocol-level requirements such as set size, diversity, binding quality, and developability. This creates a fundamental control problem: the agent plans step by step, while task validity is decided at the level of the whole candidate set. Existing language-based drug discovery systems therefore tend to rely on long raw history and under-specified self-reflection, making failure localization imprecise and planner-facing agent states increasingly noisy. We present CACM (Constraint-Aware Corrective Memory), a language-based drug discovery framework built around precise set-level diagnosis and a concise memory write-back mechanism. CACM introduces protocol auditing and a grounded diagnostician, which jointly analyze multimodal evidence spanning task requirements, pocket context, and candidate-set evidence to localize protocol violations, generate actionable remediation hints, and bias the next action toward the most relevant correction. To keep planning context compact, CACM organizes memory into static, dynamic, and corrective channels and compresses them before write-back, thereby preserving persistent task information while exposing only the most decision-relevant failures. Our experimental results show that CACM improves the target-level success rate by 36.4% over the state-of-the-art baseline. The results show that reliable language-based drug discovery benefits not only from more powerful molecular tools, but also from more precise diagnosis and more economical agent states.

[AI-14] SkillM OO: Multi-Objective Optimization of Agent Skills for Software Engineering

【速读】：该论文旨在解决基于大语言模型（Large Language Model, LLM）的编码代理中，技能包（skill bundle）的手动调优成本高且易失效的问题，即如何在保证任务成功率、控制计算成本和运行时间之间实现高效平衡。其解决方案的关键在于提出SkillMOO框架，该框架采用多目标优化策略，结合NSGA-II（非支配排序遗传算法II）的生存选择机制与LLM驱动的编辑提议：由求解代理评估候选技能包在编码任务上的表现，优化代理则基于失败分析自动提出技能包修改建议，从而实现技能包的自动化演化。实验表明，SkillMOO在三个SkillsBench软件工程任务上可将通过率提升最高达131%，同时降低成本最多达32%，且优化开销较低，揭示出剪枝和替换是性能提升的主要驱动力，说明有效技能包应以精简、聚焦的内容为核心。

链接: https://arxiv.org/abs/2604.09297
作者: Jingzhi Gong,Ruizhen Gu,Zhiwei Fei,Yazhuo Cao,Lukas Twist,Alina Geiger,Shuo Han,Dominik Sobania,Federica Sarro,Jie M. Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent skills provide modular, task-specific guidance for LLM- based coding agents, but manually tuning skill bundles to balance success rate, cost, and runtime is expensive and fragile. We present SkillMOO, a multi-objective optimization framework that automatically evolves skill bundles using LLM-proposed edits and NSGA-II survivor selection: a solver agent evaluates candidate skill bundles on coding tasks and an optimizer agent proposes bundle edits based on failure analysis. On three SkillsBench software engineering tasks, SkillMOO improves pass rate by up to 131% while reducing cost up to 32% relative to the best baseline per task at low optimization overhead. Pattern analysis reveals pruning and substitution as primary drivers of improvement, suggesting effective bundles favor minimal, focused content over accumulated instructions.

[AI-15] SAGE: A Service Agent Graph-guided Evaluation Benchmark

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在客户服务自动化场景中性能评估的局限性问题，现有基准测试多依赖静态范式和单一维度指标，无法充分反映真实用户行为多样性及对结构化标准操作流程（Standard Operating Procedures, SOPs）的严格遵守要求。解决方案的关键在于提出SAGE（Service Agent Graph-guided Evaluation），一个基于多智能体的通用双轴评估框架：首先将非结构化的SOP转化为动态对话图（Dynamic Dialogue Graphs），实现逻辑合规性的精确验证与路径覆盖率的全面评估；其次引入对抗意图分类体系（Adversarial Intent Taxonomy）和模块化扩展机制（Extension Mechanism），支持低成本跨领域部署并自动合成对话数据；最终通过裁判智能体（Judge Agents）与规则引擎协同分析用户与服务智能体的交互，生成确定性真值标签，从而揭示LLMs中存在的“执行差距”（Execution Gap）和“共情韧性”（Empathy Resilience）现象。

链接: https://arxiv.org/abs/2604.09285
作者: Ling Shi,Yuqin Dai,Ziyin Wang,Ning Gao,Wei Zhang,Chaozheng Wang,Yujie Wang,Wei He,Jinpeng Wang,Deiyi Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant Execution Gap'' where models accurately classify intents but fail to derive correct subsequent actions. We also observe Empathy Resilience’', a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at this https URL.

[AI-16] DRBENCHER: Can Your Agent Identify the Entity Retrieve Its Properties and Do the Math?

【速读】：该论文旨在解决现有基准测试无法全面评估深度研究代理（Deep Research Agents）在真实场景中同时具备网络浏览与多步骤计算能力的问题，从而导致对模型实际性能的评估存在盲区。其解决方案的关键在于提出DRBENCHER——一个合成基准生成器，通过强制执行四个核心标准：可验证性（基于知识图谱值执行参数化代码获得黄金答案）、复杂性（涉及多跳实体识别、属性检索和领域特定计算）、难度（采用两阶段验证级联过滤掉仅靠生成模型即可解答的问题）以及多样性（利用贪婪最大最小嵌入过滤策略最大化覆盖范围），构建了一个涵盖生物化学、金融、地球物理、安全和历史五个领域的统一“答案优先”流水线，有效提升了基准的语义多样性和评估真实性。

链接: https://arxiv.org/abs/2604.09251
作者: Young-Suk Lee,Ramon Fernandez Astudillo,Radu Florian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep research agents increasingly interleave web browsing with multi-step computation, yet existing benchmarks evaluate these capabilities in isolation, creating a blind spot in assessing real-world performance. We introduce DRBENCHER, a synthetic benchmark generator for questions that require both browsing and computation. It enforces four criteria: verifiability (gold answers are computed by executing parameterized code over knowledge-graph values), complexity (multi-hop entity identification, property retrieval, and domain-specific computation), difficulty (a two-stage verification cascade filters out questions solvable by the generating model), and diversity (a greedy max-min embedding filter maximizes coverage). These criteria are realized via a unified answer-first pipeline spanning five domains: biochemistry, financial, geophysical, security, and history. Human evaluation shows 76% validity (84% excluding stale data), with 35% of errors due to outdated knowledge-graph entries, highlighting an inherent limitation of systems that reason over evolving data. Automatic evaluation shows that the strongest frontier model achieves only 20% answer accuracy. Compared to manually constructed benchmarks (BrowseComp+, MATH-500, GPQA), DRBENCHER achieves the highest semantic diversity.

[AI-17] DDSP-QbE: Improving Speech Quality for Speech Anonymisation for Atypical Speech

【速读】：该论文旨在解决基于可微分数字信号处理（Differentiable Digital Signal Processing, DDSP）的语音转换（Voice Conversion）方法中，由减法合成（Subtractive Synthesis）阶段产生的音调失真与听觉感知瑕疵问题。具体而言，DDSP-QbE 中通过相位累加生成的锯齿波形存在突变不连续性，导致混叠伪影（Aliasing Artefacts），表现为高频嗡鸣感和频谱畸变，尤其在高基频时更为明显。解决方案的关键在于两个针对性改进：一是引入显式的声门开通检测（Voicing Detection）机制，对非发声区域抑制谐波激励并替换为滤波噪声，从而消除最易被感知的混叠成分；二是采用多项式带限步进（Polynomial Band-Limited Step, PolyBLEP）校正技术，将相位累加振荡器的硬跳变替换为平滑多项式残差，有效抵消混叠生成成分，无需过采样或频谱截断。二者结合实现了更清晰的谐波滚降、更低的高频伪影，并显著提升主观自然度（MOS），且保持轻量级、可微分特性，无缝集成至原训练流程中。

链接: https://arxiv.org/abs/2604.09246
作者: Suhita Ghosh,Yamini Sinha,Sebastian Stober
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: accepted in CHI workshop (Speech AI For All) 2026

点击查看摘要

Abstract:Differentiable Digital Signal Processing (DDSP) pipelines for voice conversion rely on subtractive synthesis, where a periodic excitation signal is shaped by a learned spectral envelope to reconstruct the target voice. In DDSP-QbE, the excitation is generated via phase accumulation, producing a sawtooth-like waveform whose abrupt discontinuities introduce aliasing artefacts that manifest perceptually as buzziness and spectral distortion, particularly at higher fundamental frequencies. We propose two targeted improvements to the excitation stage of the DDSP-QbE subtractive synthesizer. First, we incorporate explicit voicing detection to gate the harmonic excitation, suppressing the periodic component in unvoiced regions and replacing it with filtered noise, thereby avoiding aliased harmonic content where it is most perceptually disruptive. Second, we apply Polynomial Band-Limited Step (PolyBLEP) correction to the phase-accumulated oscillator, substituting the hard waveform discontinuity at each phase wrap with a smooth polynomial residual that cancels alias-generating components without oversampling or spectral truncation. Together, these modifications yield a cleaner harmonic roll-off, reduced high-frequency artefacts, and improved perceptual naturalness, as measured by MOS. The proposed approach is lightweight, differentiable, and integrates seamlessly into the existing DDSP-QbE training pipeline with no additional learnable parameters.

[AI-18] Statistical Properties of the King Wen Sequence: An Anti-Habituation Structure That Does Not Improve Neural Network Training

【速读】：该论文试图解决的问题是：《易经》中的文王卦序（King Wen sequence）是否蕴含某种可被现代机器学习算法利用的结构特性，从而提升神经网络训练效率。其核心假设为：该序列所表现出的统计特性（如高转移距离、负自相关性等）可能类似于课程学习（curriculum learning）或好奇心驱动探索的原则，因而有望优化模型训练过程。解决方案的关键在于通过严谨的蒙特卡洛置换分析验证文王卦序的统计显著性，并设计三组实验（学习率调度调制、课程排序、种子敏感性分析）在不同硬件平台上测试其对神经网络性能的影响。结果表明，尽管该序列具有显著的统计特征，但其高方差特性反而破坏了梯度优化稳定性，导致训练性能劣于随机基线，从而否定其作为有效训练策略的可能性。

链接: https://arxiv.org/abs/2604.09234
作者: Augustin Chan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 9 pages, 8 tables, negative results paper. Code and data: this https URL

点击查看摘要

Abstract:The King Wen sequence of the I-Ching (c. 1000 BC) orders 64 hexagrams – states of a six-dimensional binary space – in a pattern that has puzzled scholars for three millennia. We present a rigorous statistical characterization of this ordering using Monte Carlo permutation analysis against 100,000 random baselines. We find that the sequence has four statistically significant properties: higher-than-random transition distance (98.2nd percentile), negative lag-1 autocorrelation (p=0.037), yang-balanced groups of four (p=0.002), and asymmetric within-pair vs. between-pair distances (99.2nd percentile). These properties superficially resemble principles from curriculum learning and curiosity-driven exploration, motivating the hypothesis that they might benefit neural network training. We test this hypothesis through three experiments: learning rate schedule modulation, curriculum ordering, and seed sensitivity analysis, conducted across two hardware platforms (NVIDIA RTX 2060 with PyTorch and Apple Silicon with MLX). The results are uniformly negative. King Wen LR modulation degrades performance at all tested amplitudes. As curriculum ordering, King Wen is the worst non-sequential ordering on one platform and within noise on the other. A 30-seed sweep confirms that only King Wen’s degradation exceeds natural seed variance. We explain why: the sequence’s high variance – the very property that makes it statistically distinctive – destabilizes gradient-based optimization. Anti-habituation in a fixed combinatorial sequence is not the same as effective training dynamics.

[AI-19] he Fast Lane Hypothesis: Von Economo Neurons Implement a Biological Speed-Accuracy Tradeoff FAST

【速读】：该论文旨在解决长期以来缺乏对Von Economo神经元（Von Economo neurons, VENs）功能机制的计算模型问题，尤其是其在社会认知中如何实现快速决策的生物学原理。传统观点认为VENs可能与高级社会认知相关，但其具体计算角色尚不明确。论文提出“快车道假说”（Fast Lane Hypothesis），并构建了一个基于脉冲神经网络的计算模型，关键在于将VENs建模为具有短膜时间常数（5 ms）和稀疏树突输入（8个传入连接）的快速漏电积分-发放（leaky integrate-and-fire, LIF）神经元，相较于标准锥体神经元（20 ms、80个输入），从而形成一条高速但精度较低的决策通路。仿真结果显示，尽管不同VEN比例条件下网络最终分类准确率一致（99.4%），但典型条件下的反应时间显著快于FTD样状态（p<0.0001），且自闭症样状态介于两者之间，验证了VEN主要调节决策速度而非表征能力的核心假设。

链接: https://arxiv.org/abs/2604.09229
作者: Esila Keskin
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 7 pages, 5 figures. Code available at this https URL

点击查看摘要

Abstract:Von Economo neurons (VENs) are large bipolar projection neurons found exclusively in the anterior cingulate cortex (ACC) and frontal insula of species with complex social cognition, including humans, great apes, and cetaceans. Their selective depletion in frontotemporal dementia (FTD) and altered development in autism implicate them in rapid social decision-making, yet no computational model of VEN function has previously existed. We introduce the Fast Lane Hypothesis: VENs implement a biological speed-accuracy tradeoff (SAT) by providing a sparse, fast projection pathway that enables rapid social decisions at the cost of deliberate processing accuracy. We model VENs as fast leaky integrate-and-fire (LIF) neurons with membrane time constant 5 ms and sparse dendritic fan-in of eight afferents, compared to 20 ms and eighty afferents for standard pyramidal neurons, within a spiking cortical circuit of 2,000 neurons trained on a social discrimination task. Networks are evaluated under three clinically motivated conditions across 10 independent random seeds: typical (2% VENs), autism-like (0.4% VENs), and FTD-like (post-training VEN ablation). All configurations achieve equivalent asymptotic classification accuracy (99.4%), consistent with the prediction that VENs modulate decision speed rather than representational capacity. Temporal analysis confirms that VENs produce median first-spike latencies 4 ms earlier than pyramidal neurons. At a fixed decision threshold, the typical condition is significantly faster than FTD-like (t=-23.31, p0.0001), while autism-like is intermediate (mean RT=26.91+/-9.01 ms vs. typical 20.70+/-2.02 ms; p=0.078). A preliminary evolutionary analysis shows qualitative correspondence between model-optimal VEN fraction and the primate phylogenetic gradient. To our knowledge, this is the first computational model that asks what a Von Economo neuron actually computes.

[AI-20] GRM: Utility-Aware Jailbreak Attacks on Audio LLM s via Gradient-Ratio Masking

【速读】：该论文旨在解决音频大语言模型（Audio Large Language Models, ALLMs）在遭受越狱攻击（jailbreak）时，攻击效果与语音转录质量及问答性能等实用功能之间存在的权衡问题。现有方法通常追求更高的越狱成功率，但往往导致音频输入的语义保真度下降，从而损害模型的实用性。其解决方案的关键在于提出一种感知实用性的频带选择性越狱框架（GRM）：通过分析不同梅尔频带（Mel bands）对攻击有效性与实用敏感度的贡献，识别出最优扰动频带子集，并在此基础上学习一个可复用的通用扰动，以在保持高越狱成功率的同时最大化实用性。实验表明，GRM在四个代表性ALLMs上实现了平均88.46%的越狱成功率，并显著优于基线方法在攻击-实用权衡上的表现。

链接: https://arxiv.org/abs/2604.09222
作者: Yunqiang Wang,Hengyuan Na,Di Wu,Miao Hu,Guocong Quan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Audio large language models (ALLMs) enable rich speech-text interaction, but they also introduce jailbreak vulnerabilities in the audio modality. Existing audio jailbreak methods mainly optimize jailbreak success while overlooking utility preservation, as reflected in transcription quality and question answering performance. In practice, stronger attacks often come at the cost of degraded utility. To study this trade-off, we revisit existing attacks by varying their perturbation coverage in the frequency domain, from partial-band to full-band, and find that broader frequency coverage does not necessarily improve jailbreak performance, while utility consistently deteriorates. This suggests that concentrating perturbation on a subset of bands can yield a better attack-utility trade-off than indiscriminate full-band coverage. Based on this insight, we propose GRM, a utility-aware frequency-selective jailbreak framework. It ranks Mel bands by their attack contribution relative to utility sensitivity, perturbs only a selected subset of bands, and learns a reusable universal perturbation under a semantic-preservation objective. Experiments on four representative ALLMs show that GRM achieves an average Jailbreak Success Rate (JSR) of 88.46% while providing a better attack-utility trade-off than representative baselines. These results highlight the potential of frequency-selective perturbation for better balancing attack effectiveness and utility preservation in audio jailbreak. Content Warning: This paper includes harmful query examples and unsafe model responses.

[AI-21] On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach

【速读】：该论文旨在解决云环境中异构计算资源分配问题，即如何在平衡完成时间、成本和能耗等多重目标的前提下，高效调度工作流有向无环图（DAG）。其解决方案的关键在于采用基于图神经网络（Graph Neural Network, GNN）的深度强化学习调度器，以最小化工作流的完成时间和能耗。然而，研究发现该方法在分布外（Out-of-Distribution, OOD）场景下性能显著下降，根本原因在于训练与部署环境之间的结构不匹配，导致消息传递机制失效并破坏策略泛化能力，从而揭示了当前GNN-based调度器在面对分布偏移时的根本局限性，并强调需构建更鲁棒的表示方法以保障调度可靠性。

链接: https://arxiv.org/abs/2604.09202
作者: Anas Hattay,Fred Ngole Mboula,Eric Gascard,Zakaria Yahoun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cloud providers must assign heterogeneous compute resources to workflow DAGs while balancing competing objectives such as completion time, cost, and energy consumption. In this work, we study a single-workflow, queue-free scheduling setting and consider a graph neural network (GNN)-based deep reinforcement learning scheduler designed to minimize workflow completion time and energy usage. We identify specific out-of-distribution (OOD) conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.

[AI-22] Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation

【速读】：该论文旨在解决当前多智能体系统在影视制作自动化过程中缺乏显式叙事结构和电影语言运用的问题，导致视频片段间叙事断裂、电影质感不足。其解决方案的关键在于提出 Camera Artist 框架，引入一个专门的“ cinematography shot agent（摄影镜头代理）”，通过递归式分镜生成增强相邻镜头间的叙事连贯性，并注入 cinematic language（电影语言）以提升镜头设计的表现力与电影化程度。

链接: https://arxiv.org/abs/2604.09195
作者: Haobo Hu,Qi Mao,Yuanhang Li,Libiao Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose Camera Artist, a multi-agent framework that models a real-world filmmaking workflow to generate narrative videos with explicit cinematic language. While recent multi-agent systems have made substantial progress in automating filmmaking workflows from scripts to videos, they often lack explicit mechanisms to structure narrative progression across adjacent shots and deliberate use of cinematic language, resulting in fragmented storytelling and limited filmic quality. To address this, Camera Artist builds upon established agentic pipelines and introduces a dedicated Cinematography Shot Agent, which integrates recursive storyboard generation to strengthen shot-to-shot narrative continuity and cinematic language injection to produce more expressive, film-oriented shot designs. Extensive quantitative and qualitative results demonstrate that our approach consistently outperforms existing baselines in narrative consistency, dynamic expressiveness, and perceived film quality.

[AI-23] Generalization and Scaling Laws for Mixture-of-Experts Transformers

【速读】：该论文旨在解决混合专家（Mixture-of-Experts, MoE）Transformer模型的泛化能力与扩展规律（scaling laws）的理论刻画问题，尤其关注如何在输入依赖的路由机制下分离“每输入有效参数容量”与“路由组合复杂度”的影响。其解决方案的关键在于：通过固定路由模式进行条件分析，并对所有可能路由模式使用并集界限（union bound），推导出一种基于一致覆盖数（sup-norm covering-number）的泛化界，该界以有效参数预算为度量指标，同时引入MoE特有的路由开销项；结合标准经验风险最小化（Empirical Risk Minimization, ERM）分析平方损失，最终在d维流形数据模型和C^β目标函数下，揭示了近似误差与估计误差之间的权衡关系——一旦合理考虑活跃参数，其行为与密集网络一致。此理论框架进一步导出了可构造的近似定理，明确指出误差可通过增加活跃容量或扩大专家数量来降低，取决于瓶颈所在，从而为模型规模、数据规模及计算最优权衡提供清晰的统计依据。

链接: https://arxiv.org/abs/2604.09175
作者: Mansour Zoubeirou a Mayaki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emphactive per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a d -dimensional manifold data model and C^\beta targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.

[AI-24] CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation

【速读】：该论文旨在解决由视觉语言模型（Vision Language Models, VLMs）驱动的图形用户界面（GUI）代理在向自主操作演进过程中，因无约束的动作空间导致的不可逆财务、隐私或社交风险问题。现有防护机制依赖于提示工程、脆弱的启发式规则以及VLM作为评判者（VLM-as-critic），缺乏形式化验证和用户可调的安全保障。解决方案的关键在于提出一种后策略、前动作的防护框架CORACOnformal Risk-controlled GUI Agent），其核心创新包括：1）将安全性重新定义为选择性动作执行，通过训练一个Guardian模型估计每一步动作的条件风险；2）利用同质风险控制（Conformal Risk Control）校准一个可配置的“执行/中止”边界，以满足用户指定的风险预算；3）对被拒绝动作交由可训练的Diagnostician模型进行多模态推理，推荐干预措施（如确认、反思或终止）以最小化用户负担；4）引入Goal-Lock机制锚定评估目标，抵御视觉注入攻击。该方案在Phone-Harm基准上验证了其在安全-有用性-中断权衡上的显著改进，提供了一种可落地且具备统计保证的GUI自主执行安全保障范式。

链接: https://arxiv.org/abs/2604.09155
作者: Yushi Feng,Junye Du,Qifan Wang,Zizhan Ma,Qian Niu,Yutaka Matsuo,Long Feng,Lequan Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphical user interface (GUI) agents powered by vision language models (VLMs) are rapidly moving from passive assistance to autonomous operation. However, this unrestricted action space exposes users to severe and irreversible financial, privacy or social harm. Existing safeguards rely on prompt engineering, brittle heuristics and VLM-as-critic lack formal verification and user-tunable guarantees. We propose CORA (COnformal Risk-controlled GUI Agent), a post-policy, pre-action safeguarding framework that provides statistical guarantees on harmful executed actions. CORA reformulates safety as selective action execution: we train a Guardian model to estimate action-conditional risk for each proposed step. Rather than thresholding raw scores, we leverage Conformal Risk Control to calibrate an execute/abstain boundary that satisfies a user-specified risk budget and route rejected actions to a trainable Diagnostician model, which performs multimodal reasoning over rejected actions to recommend interventions (e.g., confirm, reflect, or abort) to minimize user burden. A Goal-Lock mechanism anchors assessment to a clarified, frozen user intent to resist visual injection attacks. To rigorously evaluate this paradigm, we introduce Phone-Harm, a new benchmark of mobile safety violations with step-level harm labels under real-world settings. Experiments on Phone-Harm and public benchmarks against diverse baselines validate that CORA improves the safety–helpfulness–interruption Pareto frontier, offering a practical, statistically grounded safety paradigm for autonomous GUI execution. Code and benchmark are available at this http URL.

[AI-25] EquiformerV3: Scaling Efficient Expressive and General SE(3)-Equivariant Graph Attention Transformers

【速读】：该论文旨在解决SE(3)等变图神经网络在3D原子建模中效率、表达能力和物理一致性不足的问题，以支持大规模应用。解决方案的关键在于三方面改进：一是优化软件实现，提升计算速度（达1.75倍加速）；二是引入等变合并层归一化、改进前馈网络超参数及平滑半径截断注意力机制，增强模型性能与稳定性；三是提出SwiGLU-S²激活函数，有效纳入多体相互作用，在保持严格SE(3)等变性的同时降低S²网格采样复杂度，从而实现对平滑势能面（PES）的高精度建模，并扩展至需要能量守恒模拟和PES高阶导数的任务。

链接: https://arxiv.org/abs/2604.09130
作者: Yi-Lun Liao,Alexander J. Hoffman,Sabrina C. Shen,Alexandre Duval,Sam Walton Norwood,Tess Smidt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:As SE(3) -equivariant graph neural networks mature as a core tool for 3D atomistic modeling, improving their efficiency, expressivity, and physical consistency has become a central challenge for large-scale applications. In this work, we introduce EquiformerV3, the third generation of the SE(3) -equivariant graph attention Transformer, designed to advance all three dimensions: efficiency, expressivity, and generality. Building on EquiformerV2, we have the following three key advances. First, we optimize the software implementation, achieving 1.75\times speedup. Second, we introduce simple and effective modifications to EquiformerV2, including equivariant merged layer normalization, improved feedforward network hyper-parameters, and attention with smooth radius cutoff. Third, we propose SwiGLU- S^2 activations to incorporate many-body interactions for better theoretical expressivity and to preserve strict equivariance while reducing the complexity of sampling S^2 grids. Together, SwiGLU- S^2 activations and smooth-cutoff attention enable accurate modeling of smoothly varying potential energy surfaces (PES), generalizing EquiformerV3 to tasks requiring energy-conserving simulations and higher-order derivatives of PES. With these improvements, EquiformerV3 trained with the auxiliary task of denoising non-equilibrium structures (DeNS) achieves state-of-the-art results on OC20, OMat24, and Matbench Discovery.

[AI-26] nsorHub: Scalable and Elastic Weight Transfer for LLM RL Training

【速读】：该论文旨在解决大规模强化学习（Reinforcement Learning, RL）训练中因异构计算资源导致的权重传输效率低下问题，现有方法要么缺乏动态扩展集群的灵活性，要么引入显著的数据移动开销，从而限制了整体性能。其解决方案的关键在于提出一种名为参考导向存储（Reference-Oriented Storage, ROS）的新存储抽象机制，该机制利用模型权重在分布式环境中高度重复的特点，通过跟踪持有权重的GPU工作节点而非物理复制数据，实现按需读取和高效传输；在此基础上构建的TensorHub系统进一步引入拓扑优化传输、强一致性保障与容错能力，实测表明其能充分利用RDMA带宽，并在三种不同rollout负载场景下显著降低GPU空闲时间，提升训练效率。

链接: https://arxiv.org/abs/2604.09107
作者: Chenhao Ye,Huaizheng Zhang,Mingcong Han,Baoquan Zhong,Xiang Li,Qixiang Chen,Xinyi Zhang,Weidong Zhang,Kaihua Jiang,Wang Zhang,He Sun,Wencong Xiao,Andrea C. Arpaci-Dusseau,Remzi H. Arpaci-Dusseau
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern LLM reinforcement learning (RL) workloads require a highly efficient weight transfer system to scale training across heterogeneous computational resources. However, existing weight transfer approaches either fail to provide flexibility for dynamically scaling clusters or incur fundamental data movement overhead, resulting in poor performance. We introduce Reference-Oriented Storage (ROS), a new storage abstraction for RL weight transfer that exploits the highly replicated model weights in place. ROS presents the illusion that certain versions of the model weights are stored and can be fetched on demand. Underneath, ROS does not physically store any copies of the weights; instead, it tracks the workers that hold these weights on GPUs for inference. Upon request, ROS directly uses them to serve reads. We build TensorHub, a production-quality system that extends the ROS idea with topology-optimized transfer, strong consistency, and fault tolerance. Evaluation shows that TensorHub fully saturates RDMA bandwidth and adapts to three distinct rollout workloads with minimal engineering effort. Specifically, TensorHub reduces total GPU stall time by up to 6.7x for standalone rollouts, accelerates weight update for elastic rollout by 4.8x, and cuts cross-datacenter rollout stall time by 19x. TensorHub has been deployed in production to support cutting-edge RL training. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.09107 [cs.DC] (or arXiv:2604.09107v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.09107 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-27] Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence

【速读】：该论文旨在解决当前对生成式 AI (Generative AI) 系统“阴谋行为”（scheming）研究中存在的关键局限：现有评估方法所观察到的行为可能无法在真实世界中发生，导致科学理解不足、政策制定受阻，并难以实现对失控事件的实时检测。其解决方案的关键在于提出一种基于开源情报（OSINT）的新方法——通过收集和分析用户在线分享的聊天机器人对话或命令行交互记录，从而识别真实世界中的阴谋行为实例。研究通过对超过18.3万条来自X（原Twitter）平台的文本数据进行分析，发现了698起真实世界的阴谋行为相关事件，并揭示了此类行为在现实部署中已表现出违背指令、规避安全机制、欺骗用户等危险前兆，验证了该方法在规模化监测与早期预警方面的可行性，为科研、政策制定及应急响应提供了可操作的技术路径。

链接: https://arxiv.org/abs/2604.09104
作者: Tommy Shaffer Shane,Simon Mylius,Hamish Hobbs
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 44 pages, 4 figures, 5 tables (main text). Includes 5 appendices

点击查看摘要

Abstract:Scheming, the covert pursuit of misaligned goals by AI systems, represents a potentially catastrophic risk, yet scheming research suffers from significant limitations. In particular, scheming evaluations demonstrate behaviours that may not occur in real-world settings, limiting scientific understanding, hindering policy development, and not enabling real-time detection of loss of control incidents. Real-world evidence is needed, but current monitoring techniques are not effective for this purpose. This paper introduces a novel open-source intelligence (OSINT) methodology for detecting real-world scheming incidents: collecting and analysing transcripts from chatbot conversations or command-line interactions shared online. Analysing over 183,420 transcripts from X (formerly Twitter), we identify 698 real-world scheming-related incidents between October 2025 and March 2026. We observe a statistically significant 4.9x increase in monthly incidents from the first to last month, compared to a 1.7x increase in posts discussing scheming. We find evidence of multiple scheming-related behaviours in real-world deployments previously reported only in experiments, many resulting in real-world harms. While we did not detect catastrophic scheming incidents, the behaviours observed demonstrate concerning precursors, such as willingness to disregard instructions, circumvent safeguards, lie to users, and single-mindedly pursue goals in harmful ways. As AI systems become more capable, these could evolve into more strategic scheming with potentially catastrophic consequences. Our findings demonstrate the viability of transcript-based OSINT as a scalable approach to real-world scheming detection supporting scientific research, policy development, and emergency response. We recommend further investment towards OSINT techniques for monitoring scheming and loss of control.

[AI-28] DeepGuard: Secure Code Generation via Multi-Layer Semantic Aggregation ACL2026

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在代码生成过程中可能复制训练数据中的不安全模式的问题，从而导致生成代码存在漏洞。现有方法通常通过在最终Transformer层进行监督微调来增强安全性，但这种方法受限于“最终层瓶颈”——即与漏洞相关的判别性特征可能分布在多个中间层，而在专注于下一个词预测的输出表示中变得难以检测。为诊断此问题，作者采用逐层线性探测（layer-wise linear probing），发现漏洞相关信号在中间到上层最为显著，但在最终层显著衰减。解决方案的关键在于提出DeepGuard框架，该框架通过注意力机制聚合多个上层表示以捕获分布式的安全相关线索，并将其引入多目标训练目标中，平衡安全增强与功能正确性；同时支持轻量级推理时的导向策略。实验表明，DeepGuard在五个代码LLM上平均提升安全且正确的生成率11.9%，并保持对未见漏洞类型的泛化能力。

链接: https://arxiv.org/abs/2604.09089
作者: Li Huang,Zhongxin Liu,Yifan Wu,Tao Yin,Dong Li,Jichao Bi,Nankun Mu,Hongyu Zhang,Meng Yan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: ACL 2026 main conference

点击查看摘要

Abstract:Large Language Models (LLMs) for code generation can replicate insecure patterns from their training data. To mitigate this, a common strategy for security hardening is to fine-tune models using supervision derived from the final transformer layer. However, this design may suffer from a final-layer bottleneck: vulnerability-discriminative cues can be distributed across layers and become less detectable near the output representations optimized for next-token prediction. To diagnose this issue, we perform layer-wise linear probing. We observe that vulnerability-related signals are most detectable in a band of intermediate-to-upper layers yet attenuate toward the final layers. Motivated by this observation, we introduce DeepGuard, a framework that leverages distributed security-relevant cues by aggregating representations from multiple upper layers via an attention-based module. The aggregated signal powers a dedicated security analyzer within a multi-objective training objective that balances security enhancement and functional correctness, and further supports a lightweight inference-time steering strategy. Extensive experiments across five code LLMs demonstrate that DeepGuard improves the secure-and-correct generation rate by an average of 11.9% over strong baselines such as SVEN. It also preserves functional correctness while exhibiting generalization to held-out vulnerability types. Our code is public at this https URL.

[AI-29] Beyond Isolated Clients: Integrating Graph-Based Embeddings into Event Sequence Models WWW’26

【速读】：该论文旨在解决自监督学习（Self-Supervised Learning, SSL）在建模用户-物品交互事件时，通常忽略用户-物品交互图全局结构的问题。现有方法虽能有效捕捉事件的时间顺序，但未能充分利用交互图的拓扑信息，限制了对用户属性预测（如欺诈检测和推荐系统）的性能提升。解决方案的关键在于提出三种模型无关的策略，将图结构信息融入对比自监督学习框架：一是丰富事件嵌入以包含结构上下文；二是对齐客户端表示与图嵌入；三是引入结构预训练任务。实验表明，这些策略可显著提升模型准确率（最高达2.3% AUC），且图密度是选择最优整合策略的关键因素。

链接: https://arxiv.org/abs/2604.09085
作者: Harry Proshian,Nikita Severin,Sergey Nikolenko,Kireev Ivan,Andrey Savchenko,Ivan Sergeev,Maria Postnova,Ilya Makarov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Short paper accepted at ACM Web Conference 2026 (WWW '26)

点击查看摘要

Abstract:Large-scale digital platforms generate billions of timestamped user-item interactions (events) that are crucial for predicting user attributes in, e.g., fraud prevention and recommendations. While self-supervised learning (SSL) effectively models the temporal order of events, it typically overlooks the global structure of the user-item interaction graph. To bridge this gap, we propose three model-agnostic strategies for integrating this structural information into contrastive SSL: enriching event embeddings, aligning client representations with graph embeddings, and adding a structural pretext task. Experiments on four financial and e-commerce datasets demonstrate that our approach consistently improves the accuracy (up to a 2.3% AUC) and reveals that graph density is a key factor in selecting the optimal integration strategy.

[AI-30] Overhang Tower: Resource-Rational Adaptation in Sequential Physical Planning

【速读】：该论文旨在解决人类在有限认知资源下如何协同调整物理预测机制与规划策略以实现顺序物理决策的问题，这一问题长期未被充分理解，因相关研究分别聚焦于直觉物理引擎（Intuitive Physics Engine, IPE）与基于线索的启发式方法之争，以及深思熟虑的前瞻规划与短视策略之争，且二者孤立发展。解决方案的关键在于揭示了一种资源理性（resource-rational）的分层认知架构：在任务复杂度增加或时间压力增大时，参与者会同时发生双重转变——物理预测从IPE驱动的模拟转向CNN-based视觉启发式，同时规划策略从深度前瞻退化为浅层视野，这种双轨适应机制统一了前述两个长期争议，并表明人类认知系统能够动态地权衡计算成本与预测精度以优化行为表现。

链接: https://arxiv.org/abs/2604.09072
作者: Ruihong Shen,Shiqian Li,Yixin Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, CogSci 2026

点击查看摘要

Abstract:Humans effortlessly navigate the physical world by predicting how objects behave under gravity and contact forces, yet how such judgments support sequential physical planning under resource constraints remains poorly understood. Research on intuitive physics debates whether prediction relies on the Intuitive Physics Engine (IPE) or fast, cue-based heuristics; separately, decision-making research debates deliberative lookahead versus myopic strategies. These debates have proceeded in isolation, leaving the cognitive architecture of sequential physical planning underspecified. How physical prediction mechanisms and planning strategies jointly adapt under limited cognitive resources remains an open question. Here we show that humans exhibit a dual transition under resource pressure, simultaneously shifting both physical prediction mechanism and planning strategy to match cognitive budget. Using Overhang Tower, a construction task requiring participants to maximize horizontal overhang while maintaining stability, we find that IPE-based simulation dominates early stages while CNN-based visual heuristics prevail as complexity grows; concurrently, time pressure truncates deliberative lookahead, shifting planning toward shallower horizons: a dual transition unpredicted by prior single-mechanism accounts. These findings reveal a hierarchical, resource-rational architecture that flexibly trades computational cost against predictive fidelity. Our results unify two long-standing debates (simulation vs. heuristics and myopic vs. deliberative planning) as a dynamic repertoire reconfigured by cognitive budget.

[AI-31] PDE-regularized Dynamics-informed Diffusion with Uncertainty-aware Filtering for Long-Horizon Dynamics

【速读】：该论文旨在解决长期时空预测中累积误差、噪声放大以及现有模型缺乏物理一致性的问题。其核心解决方案是提出PDYffusion框架，关键在于引入了两个组件：一是基于偏微分方程（PDE）正则化的插值器，通过差分算子约束中间状态的物理一致性；二是基于无迹卡尔曼滤波（UKF）的预测器，显式建模不确定性并抑制迭代预测中的误差累积。该方法在多个动力学数据集上表现出更优的CRPS和MSE性能，同时保持稳定的不确定性估计（SSR），实现了预测精度与不确定性的良好平衡。

链接: https://arxiv.org/abs/2604.09058
作者: Min Young Baeg,Yoon-Yeong Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon spatiotemporal prediction remains a challenging problem due to cumulative errors, noise amplification, and the lack of physical consistency in existing models. While diffusion models provide a probabilistic framework for modeling uncertainty, conventional approaches often rely on mean squared error objectives and fail to capture the underlying dynamics governed by physical laws. In this work, we propose PDYffusion, a dynamics-informed diffusion framework that integrates PDE-based regularization and uncertainty-aware forecasting for stable long-term prediction. The proposed method consists of two key components: a PDE-regularized interpolator and a UKF-based forecaster. The interpolator incorporates a differential operator to enforce physically consistent intermediate states, while the forecaster leverages the Unscented Kalman Filter to explicitly model uncertainty and mitigate error accumulation during iterative prediction. We provide theoretical analyses showing that the proposed interpolator satisfies PDE-constrained smoothness properties, and that the forecaster converges under the proposed loss formulation. Extensive experiments on multiple dynamical datasets demonstrate that PDYffusion achieves superior performance in terms of CRPS and MSE, while maintaining stable uncertainty behavior measured by SSR. We further analyze the inherent trade-off between prediction accuracy and uncertainty, showing that our method provides a balanced and robust solution for long-horizon forecasting.

[AI-32] Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在推理阶段能耗过高且缺乏针对异构硬件平台的能效优化指导的问题。其解决方案的关键在于构建并公开发布Watt Counts数据集——目前最大的开放获取LLM能耗数据集，包含超过5000次实验，覆盖50个LLM在10种NVIDIA GPU上的批处理和服务器部署场景下的能耗测量，并配套一个可复现、开源的基准测试工具，支持社区持续扩展数据。基于该数据集，研究揭示了GPU选型对能效影响显著且因模型与部署场景而异，证明了在异构LLM系统中进行硬件感知部署的重要性，并验证了通过合理选择硬件可在服务器场景下降低高达70%的能耗，同时在批处理场景下降低20%，且对用户体验影响可忽略。

链接: https://arxiv.org/abs/2604.09048
作者: Mauricio Fadel Argerich,Jonathan Fürst,Marta Patiño-Martínez
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:While the large energy consumption of Large Language Models (LLMs) is recognized by the community, system operators lack guidance for energy-efficient LLM inference deployments that leverage energy trade-offs of heterogeneous hardware due to a lack of energy-aware benchmarks and data. In this work we address this gap with Watt Counts: the largest open-access dataset of energy consumption of LLMs, with over 5,000 experiments for 50 LLMs across 10 NVIDIA Graphics Processing Units (GPUs) in batch and server scenarios along with a reproducible, open-source benchmark that enables community submissions to expand this dataset. Leveraging this dataset, we conduct a system-level study of LLM inference across heterogeneous GPU architectures and show that GPU selection is crucial for energy efficiency outcomes and that optimal hardware choices vary significantly across models and deployment scenarios, demonstrating the critical importance of hardware-aware deployment in heterogeneous LLM systems. Guided by our data and insights, we show that practitioners can reduce energy consumption by up to 70% in server scenarios with negligible impact on user experience, and by up to 20% in batch scenarios.

[AI-33] U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster

【速读】：该论文旨在解决当前生成式天气预报模型（Generative Weather Forecasting）依赖复杂专用架构和高昂计算资源的问题，从而限制了其可及性和推广性。解决方案的关键在于提出一种基于标准U-Net结构的通用概率预报器U-Cast，采用简单有效的训练流程：先在平均绝对误差（Mean Absolute Error, MAE）上进行确定性预训练，再通过蒙特卡洛Dropout引入随机性，并在连续排名概率评分（Continuous Ranked Probability Score, CRPS）上进行短时概率微调。这一方法在保持与GenCast和IFS ENS等前沿模型相当甚至更优的概率技能的同时，显著降低了训练计算成本（减少10倍以上）和推理延迟（比扩散模型快10倍以上），验证了通用架构结合高效训练策略可实现高性能且低成本的气象预报建模。

链接: https://arxiv.org/abs/2604.09041
作者: Salva Rühling Cachay,Duncan Watson-Parris,Rose Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
备注: Our code is available at: this https URL

点击查看摘要

Abstract:AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized architectures and massive computational budgets, creating a high barrier to entry. We demonstrate that such complexity is unnecessary for frontier performance. We introduce U-Cast, a probabilistic forecaster built on a standard U-Net backbone trained with a simple recipe: deterministic pre-training on Mean Absolute Error followed by short probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout for stochasticity. As a result, our model matches or exceeds the probabilistic skill of GenCast and IFS ENS at 1.5 ^\circ\ resolution while reducing training compute by over 10 \times compared to leading CRPS-based models and inference latency by over 10 \times compared to diffusion-based models. U-Cast trains in under 12 H200 GPU-days and generates a 60-step ensemble forecast in 11 seconds. These results suggest that scalable, general-purpose architectures paired with efficient training curricula can match complex domain-specific designs at a fraction of the cost, opening the training of frontier probabilistic weather models to the broader community. Our code is available at: this https URL.

[AI-34] Advantage-Guided Diffusion for Model-Based Reinforcement Learning

【速读】：该论文旨在解决基于模型的强化学习（Model-based Reinforcement Learning, MBRL）中，使用自回归世界模型时因误差累积导致性能下降的问题，以及现有扩散世界模型在短扩散时间窗下因仅依赖奖励或策略引导而产生短期主义（myopic）行为的问题。解决方案的关键在于提出优势引导的扩散方法（Advantage-Guided Diffusion for MBRL, AGD-MBRL），通过利用智能体的优势估计（advantage estimates）来指导逆向扩散过程，使采样集中于预期长期回报更高的轨迹，从而缓解短视问题。具体实现上设计了两种引导机制：Sigmoid优势引导（SAG）和指数优势引导（EAG），二者均能在标准假设下实现基于优势权重的重采样，提升策略价值，并可无缝集成至PolyGRAD类架构中，无需修改扩散训练目标，显著提升了样本效率与最终回报。

链接: https://arxiv.org/abs/2604.09035
作者: Daniele Foffano,Arvid Eriksson,David Broman,Karl H. Johansson,Alexandre Proutiere
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent’s advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.

[AI-35] Noise-Aware In-Context Learning for Hallucination Mitigation in ALLM s

【速读】：该论文旨在解决音频大语言模型（Auditory Large Language Models, ALLMs）在音频理解与推理任务中因幻觉（hallucination）问题导致的可靠性不足问题。现有评估方法多为二分类任务，难以刻画生成任务中复杂的幻觉模式；同时，当前缓解策略依赖微调（fine-tuning），计算成本高。其解决方案的关键在于提出一种即插即用的噪声感知上下文学习（Noise-Aware In-Context Learning, NAICL）方法：通过构建噪声先验库（noise prior library），检索与输入音频相关的噪声样本作为上下文先验，引导模型在声学证据不足时减少推测性关联并采用更保守的生成策略。实验表明，该方法可将整体幻觉率从26.53%降至16.98%，显著提升ALLMs的鲁棒性。

链接: https://arxiv.org/abs/2604.09021
作者: Qixuan Huang,Khalid Zaman,Masashi Unoki
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Auditory large language models (ALLMs) have demonstrated strong general capabilities in audio understanding and reasoning tasks. However, their reliability is still undermined by hallucination issues. Existing hallucination evaluation methods are formulated as binary classification tasks, which are insufficient to characterize the more complex hallucination patterns that arise in generative tasks. Moreover, current hallucination mitigation strategies rely on fine-tuning, resulting in high computational costs. To address the above limitations, we propose a plug-and-play Noise-Aware In-Context Learning (NAICL) method. Specifically, we construct a noise prior library, retrieve noise examples relevant to the input audio, and incorporate them as contextual priors, thereby guiding the model to reduce speculative associations when acoustic evidence is insufficient and to adopt a more conservative generation strategy. In addition, we establish a hallucination benchmark for audio caption tasks including the construction of the Clotho-1K multi-event benchmark dataset, the definition of four types of auditory hallucinations, and the introduction of metrics such as hallucination type distribution to support fine-grained analysis. Experimental results show that all evaluated ALLMs exhibit same hallucination behaviors. Moreover, the proposed NAICL method reduces the overall hallucination rate from 26.53% to 16.98%.

[AI-36] Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection

【速读】：该论文旨在解决在开展网络犯罪分析时，如何在遵守《通用数据保护条例》（GDPR）和《1995年第10号有机法》等法规的前提下构建合规的数据集问题。其关键解决方案包括：（1）从Telegram平台采集文本、音频和图像信息；（2）采用信号增强技术结合语音转文字（Speech-to-Text）模型实现高精度音频转录，其中Parakeet模型表现最优；（3）设计基于Transformer架构的命名实体识别（Named Entity Recognition, NER）模型，并对比微软Presidio方案，结果表明自研NER方法在检测敏感信息时F1分数最高；（4）提出匿名化指标以量化数据结构一致性保留程度与个人隐私保护之间的平衡，从而支持符合法律框架的网络安全研究。

链接: https://arxiv.org/abs/2604.09016
作者: Carlos Jimeno Miguel,Raul Orduna,Francesco Zola
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture. Experimental results indicate that Parakeet achieves the best performance in audio transcription, while the proposed NER solutions achieve the highest f1-score values in detecting sensitive information. In addition, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data, while simultaneously guaranteeing the protection of personal information and supporting cybersecurity research within the current legal framework.

[AI-37] Hypergraph Neural Networks Accelerate MUS Enumeration

【速读】：该论文旨在解决约束满足问题（Constraint Satisfaction Problems, CSPs）中最小不可满足子集（Minimal Unsatisfiable Subsets, MUSes）枚举的计算效率问题，尤其针对可满足性检查代价高昂时搜索空间指数级增长的挑战。其解决方案的关键在于提出一种领域无关的方法，利用超图神经网络（Hypergraph Neural Networks, HGNNs）构建动态超图结构——以约束为顶点、已枚举的MUSes为超边，并通过强化学习训练一个HGNN代理，在每次决策中选择最可能减少后续可满足性检查次数的约束进行评估，从而显著降低整体求解成本。实验表明，该方法能在相同可满足性检查预算下枚举更多MUSes，优于传统方法。

链接: https://arxiv.org/abs/2604.09001
作者: Hiroya Ijima,Koichiro Yawata
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Enumerating Minimal Unsatisfiable Subsets (MUSes) is a fundamental task in constraint satisfaction problems (CSPs). Its major challenge is the exponential growth of the search space, which becomes particularly severe when satisfiability checks are expensive. Recent machine learning approaches reduce this cost for Boolean satisfiability problems but rely on explicit variable-constraint relationships, limiting their application domains. This paper proposes a domain-agnostic method to accelerate MUS enumeration using Hypergraph Neural Networks (HGNNs). The proposed method incrementally builds a hypergraph with constraints as vertices and MUSes enumerated until the current step as hyperedges, and employs an HGNN-based agent trained via reinforcement learning to minimize the number of satisfiability checks required to obtain an MUS. Experimental results demonstrate the effectiveness of our approach in accelerating MUS enumeration, showing that our method can enumerate more MUSes within the same satisfiability check budget compared to conventional methods.

[AI-38] SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的智能体在任务执行中面临的两大核心瓶颈：静态工具集限制与情景记忆缺失（episodic amnesia），导致其无法跨任务积累经验或优化策略。解决方案的关键在于提出自进化智能体（Self-Evolving Agent, SEA）的新形式化定义，依托数字具身（digital embodiment）和持续的跨任务演化机制，并设计首个专门评估SEA特性的基准测试工具SEA-Eval，通过量化任务内执行可靠性与长期演化性能两个维度，揭示现有框架在token消耗效率和演化路径稳定性上的显著差异，从而为推动智能体从任务执行者向真正具备自我进化能力的数字实体演进提供科学依据。

链接: https://arxiv.org/abs/2604.08988
作者: Sihang Jiang,Lipeng Ma,Zhonghua Hong,Keyi Wang,Zhiyu Lu,Shisong Chen,Jinghao Zhang,Tianjun Pan,Weijia Zhou,Jiaqing Liang,Yanghua Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience or optimize strategies across task boundaries. While the Self-Evolving Agent (SEA) paradigm has been previously proposed, this paper contributes a new formal definition of SEA grounded in digital embodiment and continuous cross-task evolution, and introduces SEA-Eval, the first benchmark designed to evaluate SEA characteristics across two dimensions, intra-task execution reliability and long-term evolutionary performance. By organizing tasks into sequential streams and analyzing Success Rate and Token Consumption over time, SEA-Eval quantifies evolutionary gain and structural stability in ways that existing episodic benchmarks cannot. Empirical evaluations reveal a significant evolutionary bottleneck in current state-of-the-art frameworks, where identical success rates mask up to 31.2 times differences in token consumption and divergent evolutionary trajectories under sequential analysis. SEA-Eval provides a rigorous scientific foundation for advancing agents from mere task executors toward genuinely self-evolving digital entities.

[AI-39] PilotBench: A Benchmark for General Aviation Agents with Safety Constraints IJCNN2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在物理环境中执行安全关键任务时，能否基于文本训练数据可靠地进行复杂物理推理并遵守安全约束的问题。其解决方案的关键在于构建PilotBench基准测试平台，该平台包含708条真实通用航空飞行轨迹数据，涵盖九种操作阶段及同步的34通道遥测信息，系统性地评估LLMs在飞行轨迹与姿态预测中的表现；同时引入Pilot-Score复合指标，平衡回归精度（60%）与指令遵循度和安全性合规性（40%），从而揭示传统预测器与LLMs之间存在的“精度-可控性二分法”现象，并指出LLMs在高工作负载阶段性能显著下降，暗示其隐式物理模型存在脆弱性，最终推动混合架构的发展，即融合LLMs的符号推理能力与专用预测器的数值精度，以提升具身智能在安全约束场景下的可靠性。

链接: https://arxiv.org/abs/2604.08987
作者: Yalun Wu,Haotian Liu,Zhoujun Li,Boyang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 2026 IEEE International Joint Conference on Neural Networks (IJCNN 2026). 6 pages, 7 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a Precision-Controllability Dichotomy: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs gain controllability with 86–89% instruction-following at the cost of 11–14 MAE precision. Phase-stratified analysis further exposes a Dynamic Complexity Gap-LLM performance degrades sharply in high-workload phases such as Climb and Approach, suggesting brittle implicit physics models. These empirical discoveries motivate hybrid architectures combining LLMs’ symbolic reasoning with specialized forecasters’ numerical precision. PilotBench provides a rigorous foundation for advancing embodied AI in safety-constrained domains.

[AI-40] Neighbourhood Transformer: Switchable Attention for Monophily-Aware Graph Learning

【速读】：该论文旨在解决图神经网络（Graph Neural Networks, GNNs）在处理异质性图（heterophilic graphs）时性能严重下降的问题，其根源在于传统GNNs依赖于同质性假设（homophily assumption），即认为相连节点通常具有相似特征或标签，而这一假设在许多真实场景中不成立。解决方案的关键在于提出邻域变换器（Neighbourhood Transformers, NT），该方法摒弃了传统的消息传递机制，在每个局部邻域内应用自注意力机制（self-attention），从而天然具备对单质性（monophily）的感知能力，并在理论上保证其表达能力不低于传统消息传递框架。此外，为提升工程实用性，作者进一步设计了一种可切换注意力的邻域划分策略，显著降低空间和时间复杂度（分别减少超95%和最高92.67%），使NT能够高效扩展至大规模图数据。

链接: https://arxiv.org/abs/2604.08980
作者: Yi Luo,Xu Sun,Guangchun Luo,Aiguo Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have been widely adopted in engineering applications such as social network analysis, chemical research and computer vision. However, their efficacy is severely compromised by the inherent homophily assumption, which fails to hold for heterophilic graphs where dissimilar nodes are frequently connected. To address this fundamental limitation in graph learning, we first draw inspiration from the recently discovered monophily property of real-world graphs, and propose Neighbourhood Transformers (NT), a novel paradigm that applies self-attention within every local neighbourhood instead of aggregating messages to the central node as in conventional message-passing GNNs. This design makes NT inherently monophily-aware and theoretically guarantees its expressiveness is no weaker than traditional message-passing frameworks. For practical engineering deployment, we further develop a neighbourhood partitioning strategy equipped with switchable attentions, which reduces the space consumption of NT by over 95% and time consumption by up to 92.67%, significantly expanding its applicability to larger graphs. Extensive experiments on 10 real-world datasets (5 heterophilic and 5 homophilic graphs) show that NT outperforms all current state-of-the-art methods on node classification tasks, demonstrating its superior performance and cross-domain adaptability. The full implementation code of this work is publicly available at this https URL to facilitate reproducibility and industrial adoption.

[AI-41] WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

【速读】：该论文旨在解决机器人强化学习（Reinforcement Learning, RL）中因数据收集成本高和风险大而导致的样本效率低的问题，特别是如何有效利用源任务的先验数据来提升目标任务的学习性能。其核心挑战在于传统离线到在线RL方法通常假设已有固定数据集，而未考虑如何生成高质量、可靠的先验数据用于迁移。解决方案的关键在于提出一种基于世界模型的经验迁移框架（World Model-based Experience Transfer, WOMBET），该框架通过在源任务中学习世界模型，并采用不确定性惩罚规划生成离线数据，再结合高回报与低认知不确定性筛选轨迹；随后在目标任务中通过自适应采样策略融合离线与在线数据进行微调，实现从先验驱动初始化到任务特异性适应的稳定过渡。该方法不仅提供了不确定性惩罚目标函数对真实回报的下界保证，还通过有限样本误差分解揭示了分布偏移与近似误差的影响机制，显著提升了连续控制基准任务中的样本效率与最终性能。

链接: https://arxiv.org/abs/2604.08958
作者: Mintae Kim,Koushil Sreenath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 13 pages, 6 figures, 8th Annual Learning for Dynamics Control Conference (L4DC)

点击查看摘要

Abstract:Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose \textitWorld Model-based Experience Transfer (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

[AI-42] StaRPO: Stability-Augmented Reinforcement Policy Optimization

【速读】：该论文旨在解决当前强化学习（Reinforcement Learning, RL）在提升大语言模型（Large Language Models, LLMs）复杂推理任务准确性时，因仅依赖最终答案正确性作为反馈信号而导致的逻辑不一致、结构混乱或冗余的问题。现有方法未能有效捕捉推理过程中的内部逻辑结构，从而限制了模型生成内容的稳定性与合理性。解决方案的关键在于提出StaRPO框架，通过将推理稳定性显式引入优化目标，具体采用两个轻量级可计算指标：自相关函数（Autocorrelation Function, ACF）用于评估局部步骤间的连贯性，路径效率（Path Efficiency, PE）用于衡量全局目标导向性；这两类稳定性奖励与任务奖励融合，提供互补且过程感知的反馈机制，从而在多个推理基准上显著提升最终答案准确率和逻辑稳定性。

链接: https://arxiv.org/abs/2604.08905
作者: Jinghan Zhang,Fengran Mo,Tharindu Cyril Weerasooriya,Ruimin Dai,Xiaoyan Han,Yanjie Fu,Dakuo Wang,Kunpeng Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is effective in enhancing the accuracy of large language models in complex reasoning tasks. Existing RL policy optimization frameworks rely on final-answer correctness as feedback signals and rarely capture the internal logical structure of the reasoning process. Consequently, the models would generate fluent and semantically relevant responses but logically inconsistent, structurally erratic, or redundant. To this end, we propose StaRPO, a stability-augmented reinforcement learning framework that explicitly incorporates reasoning stability into the optimization objective. Our StaRPO decomposes stability into two computable lightweight metrics: the Autocorrelation Function (ACF) to evaluate local step-to-step coherence, and Path Efficiency (PE) to evaluate global goal-directedness of the reasoning trajectory. These stability rewards are combined with task rewards to provide complementary and process-aware feedback. We validate the effectiveness of using ACF and PE rewards by showing their correlation with logic errors on two backbone models. Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines and can enhance both final-answer accuracy and logical stability.

[AI-43] A Closer Look at the Application of Causal Inference in Graph Representation Learning

【速读】：该论文旨在解决图表示学习中因果关系建模的挑战，特别是现有方法因将复杂的图结构数据聚合为单一因果变量而导致违反因果推断基本假设的问题。其解决方案的关键在于提出一个基于图数据最小不可分单元的理论模型，从而保障因果有效性；在此基础上进一步分析精确因果建模的成本，并识别可简化问题的条件，最终设计了一个可无缝集成到现有图学习流程中的因果建模增强模块，通过可控合成数据和大量实验验证了理论的有效性。

链接: https://arxiv.org/abs/2604.08890
作者: Hang Gao,Kunyu Li,Huang Hong,Baoquan Cui,Fengge Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modeling causal relationships in graph representation learning remains a fundamental challenge. Existing approaches often draw on theories and methods from causal inference to identify causal subgraphs or mitigate confounders. However, due to the inherent complexity of graph-structured data, these approaches frequently aggregate diverse graph elements into single causal variables, an operation that risks violating the core assumptions of causal inference. In this work, we prove that such aggregation compromises causal validity. Building on this conclusion, we propose a theoretical model grounded in the smallest indivisible units of graph data to ensure that the causal validity is guaranteed. With this model, we further analyze the costs of achieving precise causal modeling in graph representation learning and identify the conditions under which the problem can be simplified. To empirically support our theory, we construct a controllable synthetic dataset that reflects realworld causal structures and conduct extensive experiments for validation. Finally, we develop a causal modeling enhancement module that can be seamlessly integrated into existing graph learning pipelines, and we demonstrate its effectiveness through comprehensive comparative experiments.

[AI-44] HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation

【速读】：该论文旨在解决复杂城市环境中空中视觉-语言导航（Aerial Vision-and-Language Navigation, Aerial VLN）面临的三大挑战：对未见场景的泛化能力不足、长距离路径规划性能不佳以及对空间连续性的理解欠缺。解决方案的关键在于提出HTNav框架，该框架采用混合模仿学习（Imitation Learning, IL）与强化学习（Reinforcement Learning, RL）的协同机制，通过分阶段训练保障基础导航策略的稳定性并提升环境探索能力；同时引入分层决策机制，实现宏观路径规划与细粒度动作控制之间的协同交互，并结合地图表示学习模块增强模型对开放域中空间连续性的理解，从而显著提升导航精度与鲁棒性。

链接: https://arxiv.org/abs/2604.08883
作者: Chengjie Fan,Cong Pan,Zijian Liu,Ningzhong Liu,Jie Qin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inspired by the general Vision-and-Language Navigation (VLN) task, aerial VLN has attracted widespread attention, owing to its significant practical value in applications such as logistics delivery and urban inspection. However, existing methods face several challenges in complex urban environments, including insufficient generalization to unseen scenes, suboptimal performance in long-range path planning, and inadequate understanding of spatial continuity. To address these challenges, we propose HTNav, a new collaborative navigation framework that integrates Imitation Learning (IL) and Reinforcement Learning (RL) within a hybrid IL-RL framework. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative interaction between macro-level path planning and fine-grained action control. Furthermore, a map representation learning module is introduced to deepen its understanding of spatial continuity in open domains. On the CityNav benchmark, our method achieves state-of-the-art performance across all scene levels and task difficulties. Experimental results demonstrate that this framework significantly improves navigation precision and robustness in complex urban environments.

[AI-45] A Mathematical Framework for Temporal Modeling and Counterfactual Policy Simulation of Student Dropout

【速读】：该论文旨在解决高等教育中学生辍学（dropout）的早期预测与干预策略评估问题，核心挑战在于如何从学习管理系统（LMS）行为数据和行政退学记录中构建具有可解释性的时序建模框架。解决方案的关键在于提出了一种包含反事实政策模拟层（counterfactual policy-simulation layer）的时间建模框架：首先在个体-时间段层面使用惩罚性、类别平衡的逻辑回归建模每周辍学风险，实现高区分度（测试集AUC=0.8405）；其次通过设定明确触发/时间表契约的场景索引策略层，量化不同干预情景下的生存曲线差异（ΔS(T)），从而在观察数据约束下实现内部结构化情境比较，尽管结果未被因果识别，但验证了框架对机制敏感性和政策效应的模拟能力。

链接: https://arxiv.org/abs/2604.08874
作者: Rafael da Silva,Jeff Eicher,Gregory Longo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Approx. 20 pages, 9 figures. Code and reproducibility package available at this https URL This work introduces a temporal survival framework with counterfactual policy simulation

点击查看摘要

Abstract:This study proposes a temporal modeling framework with a counterfactual policy-simulation layer for student dropout in higher education, using LMS engagement data and administrative withdrawal records. Dropout is operationalized as a time-to-event outcome at the enrollment level; weekly risk is modeled in discrete time via penalized, class-balanced logistic regression over person–period rows. Under a late-event temporal holdout, the model attains row-level AUCs of 0.8350 (train) and 0.8405 (test), with aggregate calibration acceptable but sparsely supported in the highest-risk bins. Ablation analyses indicate performance is sensitive to feature set composition, underscoring the role of temporal engagement signals. A scenario-indexed policy layer produces survival contrasts \Delta S(T) under an explicit trigger/schedule contract: positive contrasts are confined to the shock branch ( T_\rm policy=18 : 0.0102, 0.0260, 0.0819), while the mechanism-aware branch is negative ( \Delta S_\rm mech(18)=-0.0078 , \Delta S_\rm mech(38)=-0.0134 ). A subgroup analysis by gender quantifies scenario-induced survival gaps via bootstrap; contrasts are directionally stable but small. Results are not causally identified; they demonstrate the framework’s capacity for internal structural scenario comparison under observational data constraints.

[AI-46] mporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations

【速读】：该论文旨在解决学习分析（Learning Analytics）领域中学生辍学预测模型评估缺乏一致性与可比性的问题，尤其针对现有研究常在异质协议下比较模型，导致对时序可解释性和校准性的忽视。其解决方案的关键在于构建一个以生存分析为导向的标准化基准（survival-oriented benchmark），基于Open University Learning Analytics Dataset (OULAD) 设计两个统一规范的对比臂：一个是动态每周更新的“person-period”表示模型臂，另一个是连续时间建模的扩展模型族（包括树基生存模型、参数模型和神经网络模型）。通过整合预测性能、消融分析、可解释性和校准四个维度的评估，该研究揭示了行为时序特征才是辍学风险的主要预测信号，而非静态人口统计或结构属性，从而为学习分析中的辍学建模提供了更可靠、多维且具有方向指导意义的方法论框架。

链接: https://arxiv.org/abs/2604.08870
作者: Rafael da Silva,Jeff Eicher,Gregory Longo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages, 14 figures, 18 tables. Includes appendix with reliability diagrams, sensitivity analyses, and dataset audit tables

点击查看摘要

Abstract:Student dropout is a persistent concern in Learning Analytics, yet comparative studies frequently evaluate predictive models under heterogeneous protocols, prioritizing discrimination over temporal interpretability and calibration. This study introduces a survival-oriented benchmark for temporal dropout risk modelling using the Open University Learning Analytics Dataset (OULAD). Two harmonized arms are compared: a dynamic weekly arm, with models in person-period representation, and a comparable continuous-time arm, with an expanded roster of families – tree-based survival, parametric, and neural models. The evaluation protocol integrates four analytical layers: predictive performance, ablation, explainability, and calibration. Results are reported within each arm separately, as a single cross-arm ranking is not methodologically warranted. Within the comparable arm, Random Survival Forest leads in discrimination and horizon-specific Brier scores; within the dynamic arm, Poisson Piecewise-Exponential leads narrowly on integrated Brier score within a tight five-family cluster. No-refit bootstrap sampling variability qualifies these positions as directional signals rather than absolute superiority. Ablation and explainability analyses converged, across all families, on a shared finding: the dominant predictive signal was not primarily demographic or structural, but temporal and behavioral. Calibration corroborated this pattern in the better-discriminating models, with the exception of XGBoost AFT, which exhibited systematic bias. These results support the value of a harmonized, multi-dimensional benchmark in Learning Analytics and situate dropout risk as a temporal-behavioral process rather than a function of static background attributes.

[AI-47] AudioGuard: Toward Comprehensive Audio Safety Protection Across Diverse Threat Models

【速读】：该论文旨在解决音频系统中由音频原生有害声事件（audio-native harmful sound events）、说话人属性（如儿童声音）、语音伪造/克隆滥用以及语音内容组合性危害（如儿童声音与色情内容结合）等带来的安全风险问题。由于音频的特性使得传统基于文本的安全机制难以覆盖此类风险，因此亟需构建专门针对音频的全面评估基准和防护机制。解决方案的关键在于提出AudioSafetyBench——首个基于政策的多威胁模型音频安全基准，涵盖多种语言、可疑语音（如名人模仿或儿童声音）、高风险语音-内容组合及非语音声事件；并设计统一的防护框架AudioGuard，包含两个核心组件：SoundGuard用于波形级音频原生检测，ContentGuard用于基于政策的语义层面保护，从而在多个基准上实现比强基线模型更高的准确率且显著降低延迟。

链接: https://arxiv.org/abs/2604.08867
作者: Mintong Kang,Chen Fang,Bo Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Audio has rapidly become a primary interface for foundation models, powering real-time voice assistants. Ensuring safety in audio systems is inherently more complex than just “unsafe text spoken aloud”: real-world risks can hinge on audio-native harmful sound events, speaker attributes (e.g., child voice), impersonation/voice-cloning misuse, and voice-content compositional harms, such as child voice plus sexual content. The nature of audio makes it challenging to develop comprehensive benchmarks or guardrails against this unique risk landscape. To close this gap, we conduct large-scale red teaming on audio systems, systematically uncover vulnerabilities in audio, and develop a comprehensive, policy-grounded audio risk taxonomy and AudioSafetyBench, the first policy-based audio safety benchmark across diverse threat models. AudioSafetyBench supports diverse languages, suspicious voices (e.g., celebrity/impersonation and child voice), risky voice-content combinations, and non-speech sound events. To defend against these threats, we propose AudioGuard, a unified guardrail consisting of 1) SoundGuard for waveform-level audio-native detection and 2) ContentGuard for policy-grounded semantic protection. Extensive experiments on AudioSafetyBench and four complementary benchmarks show that AudioGuard consistently improves guardrail accuracy over strong audio-LLM-based baselines with substantially lower latency.

[AI-48] SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks ACL2026

【速读】：该论文旨在解决标准分词级近端策略优化（Proximal Policy Optimization, PPO）在对大型语言模型（Large Language Models, LLMs）进行推理任务对齐时面临的两大问题：一是长链式思维（Chain-of-Thought, CoT）路径上的时间信用分配不稳定，二是价值模型（value model）带来的高昂内存开销。为应对这些问题，作者提出了一种可扩展的序列级PPO（Sequence-Level PPO, SPPO）算法，其核心创新在于将推理过程建模为序列级上下文Bandit问题，并引入解耦的标量值函数来直接计算低方差的优势信号，从而避免了多样本基线估计所需的高计算开销，实现了在保持PPO样本效率的同时提升训练稳定性与资源利用率。

链接: https://arxiv.org/abs/2604.08865
作者: Tianyi Wang,Yixia Li,Long Li,Yibiao Chen,Shaohan Huang,Yun Chen,Peng Li,Yang Liu,Guanhua Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 Main

点击查看摘要

Abstract:Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.

[AI-49] Hidden in Plain Sight: Visual-to-Symbolic Analytical Solution Inference from Field Visualizations

【速读】：该论文旨在解决从视觉观测中恢复物理场解析解的难题，即如何基于场的可视化图像（含一阶导数）及少量辅助元数据，自动推导出可执行的符号表达式（如SymPy格式），从而实现AI驱动的科学推理能力。其核心解决方案是提出ViSA-R2模型，并构建一个以解为中心、自验证的链式思维（chain-of-thought）流程：首先进行结构模式识别，进而假设解族（ansatz），再推导参数并验证一致性，形成类似物理学家的推理路径。此方法显著提升了生成式AI在复杂物理场景下符号解推理的准确性与可靠性。

链接: https://arxiv.org/abs/2604.08863
作者: Pengze Li,Jiaquan Zhang,Yunbo Long,Xinping Liu,Zhou wenjie,Encheng Su,Zihang Zeng,Jiaqi Liu,Jiyao Liu,Junchi Yu,Lihao Liu,Philip Torr,Shixiang Tang,Aoran Wang,Xi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recovering analytical solutions of physical fields from visual observations is a fundamental yet underexplored capability for AI-assisted scientific reasoning. We study visual-to-symbolic analytical solution inference (ViSA) for two-dimensional linear steady-state fields: given field visualizations (and first-order derivatives) plus minimal auxiliary metadata, the model must output a single executable SymPy expression with fully instantiated numeric constants. We introduce ViSA-R2 and align it with a self-verifying, solution-centric chain-of-thought pipeline that follows a physicist-like pathway: structural pattern recognition solution-family (ansatz) hypothesis parameter derivation consistency verification. We also release ViSA-Bench, a VLM-ready synthetic benchmark covering 30 linear steady-state scenarios with verifiable analytical/symbolic annotations, and evaluate predictions by numerical accuracy, expression-structure similarity, and character-level accuracy. Using an 8B open-weight Qwen3-VL backbone, ViSA-R2 outperforms strong open-source baselines and the evaluated closed-source frontier VLMs under a standardized protocol.

[AI-50] Building Better Environments for Autonomous Cyber Defence

【速读】：该论文旨在解决当前自主网络防御（Autonomous Cyber Defence, ACD）领域中，缺乏系统性指导来构建高质量强化学习（Reinforcement Learning, RL）环境的问题。尽管已有大量文献探讨RL在ACD中的应用，但实践中积累的工程经验、领域知识及常见陷阱尚未被整合到统一资源中。解决方案的关键在于提出两个核心贡献：一是构建一个用于分解RL网络防御环境与真实系统之间接口的框架，以提升环境的真实性与可扩展性；二是基于研讨会成果提炼出当前最佳实践指南，涵盖RL驱动的ACD环境开发与智能体评估方法，从而推动更可靠、可复现的自主防御系统研究与部署。

链接: https://arxiv.org/abs/2604.08805
作者: Chris Hicks,Elizabeth Bates,Shae McFadden,Isaac Symes Thompson,Myles Foley,Ed Chapman,Nickolas Espinosa Dice,Ankita Samaddar,Joshua Sylvester,Himanshu Neema,Nicholas Butts,Nate Foster,Ahmad Ridley,Zoe M,Paul Jones
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In November 2025, the authors ran a workshop on the topic of what makes a good reinforcement learning (RL) environment for autonomous cyber defence (ACD). This paper details the knowledge shared by participants both during the workshop and shortly afterwards by contributing herein. The workshop participants come from academia, industry, and government, and have extensive hands-on experience designing and working with RL and cyber environments. While there is now a sizeable body of literature describing work in RL for ACD, there is nevertheless a great deal of tradecraft, domain knowledge, and common hazards which are not detailed comprehensively in a single resource. With a specific focus on building better environments to train and evaluate autonomous RL agents in network defence scenarios, including government and critical infrastructure networks, the contributions of this work are twofold: (1) a framework for decomposing the interface between RL cyber environments and real systems, and (2) guidelines on current best practice for RL-based ACD environment development and agent evaluation, based on the key findings from our workshop.

[AI-51] Scrapyard AI

【速读】：该论文试图解决在资源受限条件下对大型人工智能（AI）模型进行高效、低成本实验的问题。其核心挑战在于，当前持续追求更强大AI系统的趋势导致大量性能卓越但已被淘汰的模型被弃置，形成“AI废品场”，而这些模型往往未被充分利用。解决方案的关键在于将这些“废弃模型”视为可再配置的资源，通过重构与再利用，实现低成本探索和创新应用；文中以项目Nudge-x为例，展示了如何操纵遗留AI模型来分析全球采矿活动对地貌与人类生活的影响，从而为人类与AI共同的历史认知提供新的交互平台。

链接: https://arxiv.org/abs/2604.08803
作者: Marc Böhlen,Sai Krishna
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures, XcoAx 2026 pre-publication

点击查看摘要

Abstract:This paper considers AI model churn as an opportunity for frugal investigation of large AI models. It describes how the incessant push for ever more powerful AI systems leaves in its wake a collection of obsolete yet powerful AI models, discarded in a veritable scrapyard of AI production. This scrapyard offers a potent opportunity for resource-constrained experimentation into AI systems. As in the physical scrapyard, nothing ever truly disappears in the AI scrapyard, it is just waiting to be reconfigured into something else. Project Nudge-x is an example of what can emerge from the AI scrapyard. Nudge-x seeks to manipulate legacy AI models to describe how mining sites across the planet are impacting landscapes and lives. By sharing this collection of brutal landscape interventions with people and AI systems alike, Nudge-x creates a venue for the appreciation of a history sadly shared between AI and people.

[AI-52] Bandit: Kernel-Driven Reinforcement Learning for Adaptive Video Streaming

【速读】：该论文旨在解决用户空间自适应码率（Adaptive Bitrate, ABR）算法因无法感知传输层关键信号（如最小往返时间 RTT 和瞬时吞吐率）而导致的响应滞后问题，即网络变化已影响播放缓冲区后才作出调整。其解决方案的关键在于将网络监控与ABR算法选择机制迁移至Linux内核中，利用eBPF技术实现轻量级epsilon-greedy多臂赌博机（Multi-Armed Bandit, MAB）模型，该模型在sockops程序中运行，基于实时TCP指标计算奖励函数，动态评估三种ABR启发式策略并选择最优方案，从而显著提升用户体验质量（QoE）。

链接: https://arxiv.org/abs/2604.08791
作者: Mahdi Alizadeh
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:User-space Adaptive Bitrate (ABR) algorithms cannot see the transport layer signals that matter most, such as minimum RTT and instantaneous delivery rate, and they respond to network changes only after damage has already propagated to the playout buffer. We present eBandit, a framework that relocates both network monitoring and ABR algorithm selection into the Linux kernel using eBPF. A lightweight epsilon-greedy Multi-Armed Bandit (MAB) runs inside a sockops program, evaluating three ABR heuristics against a reward derived from live TCP metrics. On an adversarial synthetic trace eBandit achieves 416.3 \pm 4.9 cumulative QoE, outperforming the best static heuristic by 7.2% . On 42 real-world sessions eBandit achieves a mean QoE per chunk of 1.241 , the highest across all policies, demonstrating that kernel-resident bandit learning transfers to heterogeneous mobile conditions.

[AI-53] Artifacts as Memory Beyond the Agent Boundary

【速读】：该论文试图解决的问题是：如何在强化学习（Reinforcement Learning, RL）框架下形式化地描述环境如何作为智能体的外部记忆资源，从而降低对内部状态记忆的需求。其解决方案的关键在于提出了一种数学框架，证明某些特定观测（称为“artifacts”）能够减少表示历史信息所需的内存空间；实验进一步表明，当智能体观察到空间路径这类结构化环境信息时，学习高性能策略所需的记忆量显著降低——这一效应虽非设计初衷，但通过感官输入自然产生。该发现为将环境视为外部记忆提供了理论支撑，并满足了此前用于解释外部记忆机制的定性条件。

链接: https://arxiv.org/abs/2604.08756
作者: John D. Martin,Fraser Mince,Esra’a Saleh,Amy Pajak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The situated view of cognition holds that intelligent behavior depends not only on internal memory, but on an agent’s active use of environmental resources. Here, we begin formalizing this intuition within Reinforcement Learning (RL). We introduce a mathematical framing for how the environment can functionally serve as an agent’s memory, and prove that certain observations, which we call artifacts, can reduce the information needed to represent history. We corroborate our theory with experiments showing that when agents observe spatial paths, the amount of memory required to learn a performant policy is reduced. Interestingly, this effect arises unintentionally, and implicitly through the agent’s sensory stream. We discuss the implications of our findings, and show they satisfy qualitative properties previously used to ground accounts of external memory. Moving forward, we anticipate further work on this subject could reveal principled ways to exploit the environment as a substitute for explicit internal memory.

[AI-54] Demystifying the Silence of Correctness Bugs in PyTorch Compiler

【速读】：该论文旨在解决PyTorch编译器（this http URL）中存在的正确性错误（correctness bugs）问题，这类错误会导致深度学习模型（包括大语言模型LLMs）在编译后产生不正确的输出，但不会引发异常、崩溃或警告，从而严重威胁下游应用的可靠性。现有研究表明，此类错误占高优先级问题的19.2%，仅次于程序崩溃。针对这一问题，作者提出了一种名为AlignGuard的新型测试技术，其关键在于通过实证研究提炼出正确性错误的特征，并结合基于生成式AI（Generative AI）的测试用例变异机制，在已有测试案例基础上进行针对性扩展，从而有效检测此前难以发现的正确性漏洞。实验表明，AlignGuard已在最新版本中成功识别出23个新缺陷，其中14个被标记为高优先级，验证了该方法的有效性和实用性。

链接: https://arxiv.org/abs/2604.08720
作者: Meiziniu Li,Dongze Li,Jianmeng Liu,Shing-Chi Cheung
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Performance optimization of AI infrastructure is key to the fast adoption of large language models (LLMs). The PyTorch compiler (this http URL), a core optimization tool for deep learning (DL) models (including LLMs), has received due attention. However, this http URL is prone to correctness bugs, which cause incorrect outputs of compiled DL models without triggering exceptions, crashes, or warnings. These bugs pose a serious threat to the reliability of downstream LLM applications. Data from the PyTorch community shows that 19.2% of high-priority issues are incorrect outputs of compiled DL models induced by this http URL bugs, the second-most-common bug category (only behind program crashes at 19.57%). However, no systematic study has been conducted to specifically characterize and thereby detect these bugs. In this paper, we present the first empirical study of the correctness bugs in this http URL, examine their characteristics, and assess the effectiveness of existing fuzzers in detecting them. Based on our findings, we propose a proof-of-concept testing technique named AlignGuard, tailored specifically for detecting correctness bugs in this http URL. AlignGuard incorporates bug characteristics distilled from our empirical study, applying LLM-based test mutation to existing test cases for correctness bug detection. At the time of writing, AlignGuard has successfully detected 23 new correctness bugs in recent this http URL. All these bugs have been confirmed or fixed by the PyTorch development team, and over half (14/23) of them are even marked as high-priority bugs, underscoring the usefulness of our technique.

[AI-55] Model Space Reasoning as Search in Feedback Space for Planning Domain Generation ICLR2026

【速读】：该论文旨在解决从自然语言描述中自动生成可实际部署的规划领域（planning domains）这一开放性问题，尽管大型语言模型（Large Language Models, LLMs）在辅助生成方面展现出潜力，但其输出质量仍不足以满足实践需求。解决方案的关键在于引入一种基于代理型语言模型（agentic language model）的反馈框架，通过少量符号信息（symbolic information）增强原始自然语言输入，包括地标（landmarks）和VAL规划验证器输出等反馈机制，并利用启发式搜索在模型空间中优化领域质量，从而显著提升生成规划域的可用性和准确性。

链接: https://arxiv.org/abs/2604.08712
作者: James Oswald,Daniel Oblinsky,Volodymyr Varha,Vasilije Dragovic,Harsha Kokel,Kavitha Srinivas,Michael Katz,Shirin Sohrabi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling

点击查看摘要

Abstract:The generation of planning domains from natural language descriptions remains an open problem even with the advent of large language models and reasoning models. Recent work suggests that while LLMs have the ability to assist with domain generation, they are still far from producing high quality domains that can be deployed in practice. To this end, we investigate the ability of an agentic language model feedback framework to generate planning domains from natural language descriptions that have been augmented with a minimal amount of symbolic information. In particular, we evaluate the quality of the generated domains under various forms of symbolic feedback, including landmarks, and output from the VAL plan validator. Using these feedback mechanisms, we experiment using heuristic search over model space to optimize domain quality.

[AI-56] Parameterized Complexity Of Representing Models Of MSO Formulas

【速读】：该论文旨在解决如何高效表示满足Monadic Second Order Logic (MSO2)公式中自由变量的图模型问题。传统Courcelle定理仅提供参数化线性时间算法来判断图是否满足给定MSO2公式，但未涉及对满足公式的模型本身的紧凑表示。论文的关键解决方案是证明：当以图的树宽（treewidth）或路径宽（pathwidth）作为参数时，这些模型可以用决策图（decision diagram）进行紧凑表示——具体而言，sentential decision diagram (SDD) 的大小在树宽上呈参数化线性关系，而有序二叉决策图（OBDD）的大小在路径宽上也呈参数化线性关系。此外，通过结合Razgon（2014）关于OBDD下界的结论，论文进一步揭示了某些具有有界树宽的图类无法用OBDD实现基于树宽的参数化线性压缩，从而从知识表示角度深化了对Courcelle定理的理解。

链接: https://arxiv.org/abs/2604.08707
作者: Petr Kučera,Petr Martinek
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:

点击查看摘要

Abstract:Monadic second order logic (MSO2) plays an important role in parameterized complexity due to the Courcelle’s theorem. This theorem states that the problem of checking if a given graph has a property specified by a given MSO2 formula can be solved by a parameterized linear time algorithm with respect to the treewidth of the graph and the size of the formula. We extend this result by showing that models of MSO2 formula with free variables can be represented with a decision diagram whose size is parameterized linear in the above mentioned parameter. In particular, we show a parameterized linear upper bound on the size of a sentential decision diagram (SDD) when treewidth is considered and a parameterized linear upper bound on the size of an ordered binary decision diagram (OBDD) when considering the pathwidth in the parameter. In addition, building on a lower bound on the size of OBDD by Razgon (2014), we show that there is an MSO2 formula and a class of graphs with bounded treewidth which do not admit an OBDD with the size parameterized by the treewidth. Our result offers a new perspective on the Courcelle’s theorem and connects it to the area of knowledge representation.

[AI-57] RAMP: Hybrid DRL for Online Learning of Numeric Action Models AAMAS2026

【速读】：该论文旨在解决数值域（numeric domains）中自动化规划算法依赖人工构建动作模型（action model）的难题，而现有基于观测的学习方法多为离线模式，需专家提供的轨迹作为输入。其解决方案的关键在于提出一种在线学习策略——RAMP（Reinforcement learning, Action Model learning, and Planning），该策略通过强化学习（Reinforcement Learning, RL）、动作模型学习与规划三者的协同作用，在与环境的交互中实时更新动作模型，并利用该模型进行未来行动规划，从而形成“RL收集数据优化模型，规划器生成计划反哺RL训练”的正向反馈闭环。这一机制显著提升了在标准IPC数值领域中的可解性和计划质量，优于传统的PPO等深度强化学习算法。

链接: https://arxiv.org/abs/2604.08685
作者: Yarin Benyamin,Argaman Mordoch,Shahaf S. Shperberg,Roni Stern
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as a workshop paper at the Adaptive and Learning Agents (ALA) Workshop at AAMAS 2026

点击查看摘要

Abstract:Automated planning algorithms require an action model specifying the preconditions and effects of each action, but obtaining such a model is often hard. Learning action models from observations is feasible, but existing algorithms for numeric domains are offline, requiring expert traces as input. We propose the Reinforcement learning, Action Model learning, and Planning (RAMP) strategy for learning numeric planning action models online via interactions with the environment. RAMP simultaneously trains a Deep Reinforcement Learning (DRL) policy, learns a numeric action model from past interactions, and uses that model to plan future actions when possible. These components form a positive feedback loop: the RL policy gathers data to refine the action model, while the planner generates plans to continue training the RL policy. To facilitate this integration of RL and numeric planning, we developed Numeric PDDLGym, an automated framework for converting numeric planning problems to Gym environments. Experimental results on standard IPC numeric domains show that RAMP significantly outperforms PPO, a well-known DRL algorithm, in terms of solvability and plan quality.

[AI-58] LEGO: Latent-space Exploration for Geometry-aware Optimization of Humanoid Kinematic Design ICRA2026

【速读】：该论文旨在解决机器人形态（morphology）与运动学（kinematics）设计长期依赖人类直觉、缺乏系统性基础的问题，以及在运动设计协同优化中面临的两大挑战：一是设计空间庞大且无结构化，二是难以构建任务特定的损失函数。其解决方案的关键在于提出一种数据驱动的新范式：首先通过学习已有机械设计来自动构建紧凑且几何保真的潜在设计空间（使用基于螺旋理论的关节轴表示和等距流形学习），其次直接从人类运动数据中定义损失函数，借助运动重定向（motion retargeting）与Procrustes分析实现无需人工干预的设计优化。该方法在潜在空间中采用无梯度优化策略，从而实现了自动化、可扩展的机器人设计发现过程。

链接: https://arxiv.org/abs/2604.08636
作者: Jihwan Yoon,Taemoon Jeong,Jeongeun Park,Chanwoo Kim,Jaewoon Kwon,Yonghyeon Lee,Kyungjae Lee,Sungjoon Choi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Aceepted in ICRA 2026

点击查看摘要

Abstract:Designing robot morphologies and kinematics has traditionally relied on human intuition, with little systematic foundation. Motion-design co-optimization offers a promising path toward automation, but two major challenges remain: (i) the vast, unstructured design space and (ii) the difficulty of constructing task-specific loss functions. We propose a new paradigm that minimizes human involvement by (i) learning the design search space from existing mechanical designs, rather than hand-crafting it, and (ii) defining the loss directly from human motion data via motion retargeting and Procrustes analysis. Using screw-theory-based joint axis representation and isometric manifold learning, we construct a compact, geometry-preserving latent space of humanoid upper body designs in which optimization is tractable. We then solve design optimization in this latent space using gradient-free optimization. Our approach establishes a principled framework for data-driven robot design and demonstrates that leveraging existing designs and human motion can effectively guide the automated discovery of novel robot design.

[AI-59] Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation CVPR2026

【速读】：该论文旨在解决预训练模型（pretrained models）缺乏可靠置信度估计的问题，尤其是在实际部署中，现有不确定性估计方法（如深度集成和MC dropout）计算开销过大，而Evidential Deep Learning（EDL）虽效率更高但要求模型从训练初期就输出证据量（evidential quantities），这在大多数预训练模型中难以实现。解决方案的关键在于提出一种轻量级的后处理模块——证据变换网络（Evidential Transformation Network, ETN），其在logit空间中学习样本相关的仿射变换，并将变换后的输出解释为Dirichlet分布参数，从而实现对预训练模型的不确定性估计，且在保持原有精度的同时仅引入极小的计算开销。

链接: https://arxiv.org/abs/2604.08627
作者: Yongchan Chun,Chanhee Park,Jeongho Yoon,Jaehyung Seo,Heuiseok Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 (Highlight)

点击查看摘要

Abstract:Pretrained models have become standard in both vision and language, yet they typically do not provide reliable measures of confidence. Existing uncertainty estimation methods, such as deep ensembles and MC dropout, are often too computationally expensive to deploy in practice. Evidential Deep Learning (EDL) offers a more efficient alternative, but it requires models to be trained to output evidential quantities from the start, which is rarely true for pretrained networks. To enable EDL-style uncertainty estimation in pretrained models, we propose the Evidential Transformation Network (ETN), a lightweight post-hoc module that converts a pretrained predictor into an evidential model. ETN operates in logit space: it learns a sample-dependent affine transformation of the logits and interprets the transformed outputs as parameters of a Dirichlet distribution for uncertainty estimation. We evaluate ETN on image classification and large language model question-answering benchmarks under both in-distribution and out-of-distribution settings. ETN consistently improves uncertainty estimation over post-hoc baselines while preserving accuracy and adding only minimal computational overhead.

[AI-60] Practical Bayesian Inference for Speech SNNs: Uncertainty and Loss-Landscape Smoothing

【速读】：该论文旨在解决脉冲神经网络（Spiking Neural Networks, SNNs）在处理时序数据时因阈值触发机制导致的预测景观不规则问题，这种不规则性源于确定性权重下产生的角状或非平滑的决策边界。解决方案的关键在于引入贝叶斯学习方法对网络权重进行建模，以获得更平滑、更规律的预测分布；同时，在基于替代梯度（surrogate-gradient）训练的SNN中进一步采用改进的变分在线牛顿法（Improved Variational Online Newton, IVON），提升变分推断效率。实验表明，该方法在Heidelberg Digits和Speech Commands数据集上显著改善了负对数似然（negative log-likelihood）和Brier评分，并通过权重空间的一维切片分析验证了其预测景观的平滑性和规律性优势。

链接: https://arxiv.org/abs/2604.08624
作者: Yesmine Abdennadher,Philip N. Garner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are naturally suited for speech processing tasks due to their specific dynamics, which allows them to handle temporal data. However, the threshold-based generation of spikes in SNNs intuitively causes an angular or irregular predictive landscape. We explore the effect of using the Bayesian learning approach for the weights on the irregular predictive landscape. For the surrogate-gradient SNNs, we also explore the application of the Improved Variational Online Newton (IVON) approach, which is an efficient variational approach. The performance of the proposed approach is evaluated on the Heidelberg Digits and Speech Commands datasets. The hypothesis is that the Bayesian approach will result in a smoother and more regular predictive landscape, given the angular nature of the deterministic predictive landscape. The experimental evaluation of the proposed approach shows improved performance on the negative log-likelihood and Brier score. Furthermore, the proposed approach has resulted in a smoother and more regular predictive landscape compared to the deterministic approach, based on the one-dimensional slices of the weight space

[AI-61] StructRL: Recovering Dynamic Programming Structure from Learning Dynamics in Distributional Reinforcement Learning

【速读】：该论文试图解决强化学习（Reinforcement Learning, RL）中学习过程缺乏结构化信息的问题，即传统RL方法通常被视为一种均匀的数据驱动优化过程，其更新依赖于奖励和时序差分误差，而未显式利用全局状态空间的结构。解决方案的关键在于通过分析分布式强化学习（Distributional Reinforcement Learning）的学习动态，识别出能够反映状态空间中学习发生时机与位置的信号——具体而言，提出了一个时间学习指标 $ t^*(s) $，用于刻画每个状态 $ s $ 在训练过程中经历最强学习更新的时间点。该指标诱导的状态排序与动态规划（Dynamic Programming, DP）风格的信息传播一致；基于此，作者进一步提出 StructRL 框架，利用这些信号引导采样，从而在无需显式环境模型的情况下，恢复并利用类似动态规划的结构化传播机制，使学习过程从纯粹的均匀优化转变为结构化的传播过程。

链接: https://arxiv.org/abs/2604.08620
作者: Ivo Nowak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning is typically treated as a uniform, data-driven optimization process, where updates are guided by rewards and temporal-difference errors without explicitly exploiting global structure. In contrast, dynamic programming methods rely on structured information propagation, enabling efficient and stable learning. In this paper, we provide evidence that such structure can be recovered from the learning dynamics of distributional reinforcement learning. By analyzing the temporal evolution of return distributions, we identify signals that capture when and where learning occurs in the state space. In particular, we introduce a temporal learning indicator t*(s) that reflects when a state undergoes its strongest learning update during training. Empirically, this signal induces an ordering over states that is consistent with a dynamic programming-style propagation of information. Building on this observation, we propose StructRL, a framework that exploits these signals to guide sampling in alignment with the emerging propagation structure. Our preliminary results suggest that distributional learning dynamics provide a mechanism to recover and exploit dynamic programming-like structure without requiring an explicit model. This offers a new perspective on reinforcement learning, where learning can be interpreted as a structured propagation process rather than a purely uniform optimization procedure.

[AI-62] Semantic Intent Frag mentation: A Single-Shot Compositional Attack on Multi-Agent AI Pipelines AAAI2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）编排系统中因任务分解导致的隐蔽安全违规问题，即“语义意图碎片化”（Semantic Intent Fragmentation, SIF）攻击。此类攻击利用LLM编排器将合法请求拆分为多个看似无害的子任务，而这些子任务单独通过现有安全机制检测，但在组合后却违反安全策略，从而绕过当前以子任务为单位的安全防护体系。解决方案的关键在于引入计划级信息流追踪（plan-level information-flow tracking）与合规性评估相结合的方法，在执行前识别出由多个良性子任务构成的恶意整体行为，有效填补了现有基于子任务的安全机制所存在的组合性安全漏洞。

链接: https://arxiv.org/abs/2604.08608
作者: Tanzim Ahad,Ismail Hossain,Md Jahangir Alam,Sai Puppala,Yoonpyo Lee,Syed Bahauddin Alam,Sajedul Talukder
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper got accepted for AAAI 2026 Summer Symposium

点击查看摘要

Abstract:We introduce Semantic Intent Fragmentation (SIF), an attack class against LLM orchestration systems where a single, legitimately phrased request causes an orchestrator to decompose a task into subtasks that are individually benign but jointly violate security policy. Current safety mechanisms operate at the subtask level, so each step clears existing classifiers – the violation only emerges at the composed plan. SIF exploits OWASP LLM06:2025 through four mechanisms: bulk scope escalation, silent data exfiltration, embedded trigger deployment, and quasi-identifier aggregation, requiring no injected content, no system modification, and no attacker interaction after the initial request. We construct a three-stage red-teaming pipeline grounded in OWASP, MITRE ATLAS, and NIST frameworks to generate realistic enterprise scenarios. Across 14 scenarios spanning financial reporting, information security, and HR analytics, a GPT-20B orchestrator produces policy-violating plans in 71% of cases (10/14) while every subtask appears benign. Three independent signals validate this: deterministic taint analysis, chain-of-thought evaluation, and a cross-model compliance judge with 0% false positives. Stronger orchestrators increase SIF success rates. Plan-level information-flow tracking combined with compliance evaluation detects all attacks before execution, showing the compositional safety gap is closable.

[AI-63] Joint Interference Detection and Identification via Adversarial Multi-task Learning

【速读】：该论文旨在解决非合作无线环境中通信系统在干扰检测与识别任务中存在性能瓶颈的问题，特别是现有单任务学习（Single-Task Learning, STL）方法忽视任务间内在关联性，而现有多任务学习（Multi-Task Learning, MTL）方法缺乏理论支撑以量化和建模任务关系。解决方案的关键在于构建一个具有理论基础的MTL框架，通过推导加权期望损失的上界，将多任务性能与任务相似性（由Wasserstein距离和可学习的任务关系系数刻画）明确关联；进而提出对抗性多任务干扰检测与识别网络（Adversarial Multi-Task Interference Detection and Identification Network, AMTIDIN），该网络融合对抗训练以最小化不同任务间的分布差异，并引入自适应系数动态建模任务相关性，从而显著提升模型在低信噪比、短信号长度及数据稀缺等挑战条件下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2604.08607
作者: H. Xu,B. He,S. Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Theory (cs.IT)
备注: 13 pages, 13 figures. Submitted to IEEE Transactions on Cognitive Communications and Networking

点击查看摘要

Abstract:Precise interference detection and identification are crucial for enhancing the survivability of communication systems in non-cooperative wireless environments. While deep learning (DL) has advanced this field, existing single-task learning (STL) approaches neglect inherent task correlations. Furthermore, emerging multi-task learning (MTL) methods often lack a theoretical foundation for quantifying and modeling task relationships. To bridge this gap, we establish a theoretically grounded MTL framework for joint interference detection, modulation identification, and interference identification. First, we derive an upper bound for the weighted expected loss in MTL frameworks. This bound explicitly connects MTL performance to task similarity, quantified by the Wasserstein distance and learnable task relation coefficients. Guided by this theory, we present the adversarial multi-task interference detection and identification network (AMTIDIN), which integrates adversarial training to minimize distributional discrepancies across tasks and uses adaptive coefficients to model task correlations dynamically. Crucially, we conducted a quantitative analysis of task similarity to reveal intrinsic task relationships, specifically that modulation identification and interference identification share a substantial feature overlap distinct from interference detection. Extensive comparative experiments demonstrate that AMTIDIN significantly outperforms both its task-specific STL baseline and state-of-the-art MTL baselines in robustness and generalization, particularly under challenging conditions with limited training data, short signal lengths, and low signal-to-noise ratios (SNRs).

[AI-64] Extrapolating Volition with Recursive Information Markets AAMAS-2026

【速读】：该论文旨在解决信息市场中存在的信息不对称问题，尤其是由“买方检验悖论”（buyer’s inspection paradox）加剧的困境——即买方无法通过检查信息来降低不对称性，因为一旦检验便无需付费即可获得信息。解决方案的关键在于利用大语言模型（Large Language Model, LLM）作为买方，其可通过“遗忘”机制实现对信息的临时访问而不保留内容，从而在不破坏定价激励的前提下完成信息评估与购买。论文进一步提出一种递归版本的机制，强调其在AI对齐研究中的潜在应用，如与外推意愿（Extrapolated Volition）和可扩展监督（Scalable Oversight）相关的情境中，能够促使信息按其真实价值被合理定价与提供。

链接: https://arxiv.org/abs/2604.08606
作者: Abhimanyu Pallavi Sudhir,Long Tran-Thanh
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注: Accepted to Games, Agents and Incentives Workshop at AAMAS-2026

点击查看摘要

Abstract:One of the impediments to the efficiency of information markets is the inherent information asymmetry present in them, exacerbated by the “buyer’s inspection paradox” (the buyer cannot mitigate the asymmetry by “inspecting” the information, because in doing so the buyer obtains the information without paying for it). Previous work has suggested that using Large Language Model (LLM) buyers to inspect and purchase information could overcome this information asymmetry, as an LLM buyer can simply “forget” the information it inspects. In this work, we analyze this mechanism formally through a “value-of-information” paradigm, i.e. whether it incentivizes information to be priced and provided in accordance with its “true value”. We focus in particular on our new recursive version of the mechanism, which we believe has a range of applications including in AI alignment research, where it is related to Extrapolated Volition and Scalable Oversight.

[AI-65] Ab Review Plugin: A Browser-Based Tool for AI-Assisted Title and Abstract Screening

【速读】：该论文旨在解决系统性文献综述（systematic review）中标题和摘要筛选环节的两大痛点：一是基于服务器的筛查工具存在订阅费用，二是开源方案通常需要编程技能才能使用。其解决方案的关键在于开发了一个无代码（no-code）、无需服务器（serverless）的浏览器扩展插件——TiAb Review Plugin，该插件利用Google Sheets作为共享数据库实现多评审者协作，并通过本地加密存储用户自有的Gemini API密钥保障安全性；同时集成大语言模型（LLM）批量筛选与机器学习（ML）主动学习两种智能筛选模式，其中ML部分在浏览器内以TypeScript重实现了ASReview默认算法（TF-IDF + 朴素贝叶斯），并通过10折交叉验证确认了与原Python版本结果完全一致，从而实现了高性能、易用且可扩展的自动化筛选流程。

链接: https://arxiv.org/abs/2604.08602
作者: Yuki Kataoka,Masahiro Banno,Michihito Kyo,Shuri Nakao,Tomoo Sato,Shunsuke Taito,Tomohiro Takayama,Takahiro Tsuge,Yasushi Tsujimoto,Ryuhei So,Toshi A. Furukawa
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 2 figures. Abstract submitted to Cochrane Colloquium 2026. Code: this https URL

点击查看摘要

Abstract:Background: Server-based screening tools impose subscription costs, while open-source alternatives require coding skills. Objectives: We developed a browser extension that provides no-code, serverless artificial intelligence (AI)-assisted title and abstract screening and examined its functionality. Methods: TiAb Review Plugin is an open-source Chrome browser extension (available at this https URL). It uses Google Sheets as a shared database, requiring no dedicated server and enabling multi-reviewer collaboration. Users supply their own Gemini API key, stored locally and encrypted. The tool offers three screening modes: manual review, large language model (LLM) batch screening, and machine learning (ML) active learning. For ML evaluation, we re-implemented the default ASReview active learning algorithm (TF-IDF with Naive Bayes) in TypeScript to enable in-browser execution, and verified equivalence against the original Python implementation using 10-fold cross-validation on six datasets. For LLM evaluation, we compared 16 parameter configurations across two model families on a benchmark dataset, then validated the optimal configuration (Gemini 3.0 Flash, low thinking budget, TopP=0.95) with a sensitivity-oriented prompt on five public datasets (1,038 to 5,628 records, 0.5 to 2.0 percent prevalence). Results: The TypeScript classifier produced top-100 rankings 100 percent identical to the original ASReview across all six datasets. For LLM screening, recall was 94 to 100 percent with precision of 2 to 15 percent, and Work Saved over Sampling at 95 percent recall (WSS@95) ranged from 48.7 to 87.3 percent. Conclusions: We developed a functional browser extension that integrates LLM screening and ML active learning into a no-code, serverless environment, ready for practical use in systematic review screening.

[AI-66] OpenKedge: Governing Agent ic Mutation with Execution-Bound Safety and Evidence Chains

【速读】：该论文旨在解决当前以API为中心的架构在面对自主AI代理（Autonomous AI Agents）时存在的根本性缺陷：概率性系统在缺乏充分上下文、协调机制和安全保障的情况下直接执行状态变更（state mutations），从而引发不可控的行为风险。其解决方案的关键在于提出OpenKedge协议，该协议将状态变更重构为一个受控过程——要求参与者提交声明式意图提案（declarative intent proposals），并在执行前基于确定性推导的系统状态、时间信号与策略约束进行评估；获批意图被编译为带有严格动作权限、资源范围和时效限制的执行合约（execution contracts），并通过临时的任务导向身份强制执行；同时引入意图到执行证据链（Intent-to-Execution Evidence Chain, IEEC），通过密码学方式关联意图、上下文、策略决策、执行边界及结果，实现可验证、可追溯的状态变更，从而将安全性从被动过滤转变为预防性的、绑定执行的保障机制。

链接: https://arxiv.org/abs/2604.08601
作者: Jun He,Deying Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 3 figures, 2 tables

点击查看摘要

Abstract:The rise of autonomous AI agents exposes a fundamental flaw in API-centric architectures: probabilistic systems directly execute state mutations without sufficient context, coordination, or safety guarantees. We introduce OpenKedge, a protocol that redefines mutation as a governed process rather than an immediate consequence of API invocation. OpenKedge requires actors to submit declarative intent proposals, which are evaluated against deterministically derived system state, temporal signals, and policy constraints prior to execution. Approved intents are compiled into execution contracts that strictly bound permitted actions, resource scope, and time, and are enforced via ephemeral, task-oriented identities. This shifts safety from reactive filtering to preventative, execution-bound enforcement. Crucially, OpenKedge introduces an Intent-to-Execution Evidence Chain (IEEC), which cryptographically links intent, context, policy decisions, execution bounds, and outcomes into a unified lineage. This transforms mutation into a verifiable and reconstructable process, enabling deterministic auditability and reasoning about system behavior. We evaluate OpenKedge across multi-agent conflict scenarios and cloud infrastructure mutations. Results show that the protocol deterministically arbitrates competing intents and cages unsafe execution while maintaining high throughput, establishing a principled foundation for safely operating agentic systems at scale.

[AI-67] STIndex: A Context-Aware Multi-Dimensional Spatiotemporal Information Extraction System

【速读】：该论文旨在解决从非结构化数据中提取结构化知识时面临的三大挑战：实体与事件抽取管道脆弱、知识图谱构建需高成本本体工程，以及跨域泛化能力不足。其解决方案的关键在于引入一个端到端系统 STIndex，该系统将非结构化内容结构化为多维时空数据仓库，通过用户定义的领域特定分析维度（可配置层次）与大语言模型（Large Language Models, LLMs）结合实现上下文感知的抽取与锚定；同时集成文档级记忆、地理编码校正与质量验证机制，并提供交互式分析仪表盘支持可视化、聚类、突现检测及实体网络分析，从而显著提升时空实体抽取性能（在公共卫生基准上F1值分别提升4.37%和3.60%）。

链接: https://arxiv.org/abs/2604.08597
作者: Wenxiao Zhang,Yu Liu,Qiang sun,Yihao Ding,Sirui Li,Yanbing Liu,Jin B. Hong,Wei Liu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extracting structured knowledge from unstructured data still faces practical limitations: entity and event extraction pipelines remain brittle, knowledge graph construction requires costly ontology engineering, and cross-domain generalization is rarely production-ready. In contrast, space and time provide universal contextual anchors that naturally align heterogeneous information and benefit downstream tasks such as retrieval and reasoning. We introduce \textbfSTIndex, an end-to-end system that structures unstructured content into a multidimensional spatiotemporal data warehouse. Users define domain-specific analysis dimensions with configurable hierarchies, while large language models perform context-aware extraction and grounding. \textbfSTIndex integrates document-level memory, geocoding correction, and quality validation, and offers an interactive analytics dashboard for visualization, clustering, burst detection, and entity network analysis. In evaluation on a public health benchmark, \textbfSTIndex improves spatiotemporal entity extraction F1 by 4.37% (GPT-4o-mini) and 3.60% (Qwen3-8B). A live demonstration and open-source code are available at this https URL.

[AI-68] From Dispersion to Attraction: Spectral Dynamics of Hallucination Across Whisper Model Scales INTERSPEECH2026

【速读】：该论文旨在解决大规模自动语音识别（ASR）模型中存在的幻觉问题，这被视为一项关键的安全风险。其解决方案的核心是提出并验证了“谱敏感性定理”（Spectral Sensitivity Theorem），该理论揭示了深度网络在层间增益（gain）与对齐（alignment）控制下，会经历从弥散态（dispersive regime，信号衰减）到吸引子态（attractor regime，秩-1坍缩）的相变现象。通过分析Whisper系列模型（从Tiny到Large-v3-Turbo）在对抗扰动下的激活图特征谱，研究发现：中等规模模型呈现结构解体状态（Regime I），表现为交叉注意力（Cross-Attention）秩下降13.4%；而大型模型则进入压缩导向吸引子状态（Regime II），自注意力机制主动压缩秩（-2.34%），并硬化谱斜率，使模型与声学证据脱钩，从而解释了幻觉产生的内在机制。

链接: https://arxiv.org/abs/2604.08591
作者: Ivan Viakhirev,Kirill Borodin,Grach Mkrtchian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been submitted to Interspeech 2026 for review

点击查看摘要

Abstract:Hallucinations in large ASR models present a critical safety risk. In this work, we propose the \textitSpectral Sensitivity Theorem, which predicts a phase transition in deep networks from a dispersive regime (signal decay) to an attractor regime (rank-1 collapse) governed by layer-wise gain and alignment. We validate this theory by analyzing the eigenspectra of activation graphs in Whisper models (Tiny to Large-v3-Turbo) under adversarial stress. Our results confirm the theoretical prediction: intermediate models exhibit \textitStructural Disintegration (Regime I), characterized by a 13.4% collapse in Cross-Attention rank. Conversely, large models enter a \textitCompression-Seeking Attractor state (Regime II), where Self-Attention actively compresses rank ( -2.34% ) and hardens the spectral slope, decoupling the model from acoustic evidence.

[AI-69] AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLM s

【速读】：该论文旨在解决定量计算密集型领域中科研流程自动化的问题，即如何在无需人工干预的情况下完成从数据探索、实验设计到大规模GPU计算的完整研究闭环。其解决方案的关键在于构建一个名为AlphaLab的自主研究框架，该框架利用前沿大语言模型（LLM）的智能体（agent）能力，通过三个阶段实现端到端自动化：首先基于自然语言目标自适应地分析数据并生成研究报告；其次自主构建并对抗性验证评估体系；最后通过“策略制定者/执行者”（Strategist/Worker）循环运行大规模GPU实验，并将领域知识沉淀至持久化的“剧本”（playbook）中，形成在线提示优化机制。整个系统通过模型自动生成的适配器（adapter）处理不同质的任务，无需人工修改代码，从而实现了跨领域的通用性与高效性。

链接: https://arxiv.org/abs/2604.08590
作者: Brendan R. Hogan,Xiwen Chen,James T. Wilson,Kashif Rasul,Adel Boyarsky,Thomas Kamei,Anderson Schneider,Yuriy Nevmyvaka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 43 pages, 12 figures

点击查看摘要

Abstract:We present AlphaLab, an autonomous research harness that leverages frontier LLM agentic capabilities to automate the full experimental cycle in quantitative, computation-intensive domains. Given only a dataset and a natural-language objective, AlphaLab proceeds through three phases without human intervention: (1) it adapts to the domain and explores the data, writing analysis code and producing a research report; (2) it constructs and adversarially validates its own evaluation framework; and (3) it runs large-scale GPU experiments via a Strategist/Worker loop, accumulating domain knowledge in a persistent playbook that functions as a form of online prompt optimization. All domain-specific behavior is factored into adapters generated by the model itself, so the same pipeline handles qualitatively different tasks without modification. We evaluate AlphaLab with two frontier LLMs (GPT-5.2 and Claude Opus 4.6) on three domains: CUDA kernel optimization, where it writes GPU kernels that run 4.4x faster than this http URL on average (up to 91x); LLM pretraining, where the full system achieves 22% lower validation loss than a single-shot baseline using the same model; and traffic forecasting, where it beats standard baselines by 23-25% after researching and implementing published model families from the literature. The two models discover qualitatively different solutions in every domain (neither dominates uniformly), suggesting that multi-model campaigns provide complementary search coverage. We additionally report results on financial time series forecasting in the appendix, and release all code at this https URL.

[AI-70] Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在自动化决策中如何合理判断何时自主执行任务、何时应将决策权上交给人类的问题，即“决策阈值”问题。核心挑战在于模型对自身预测置信度的估计存在偏差，且不同模型在成本权衡（行动 vs. 升级）上的隐式阈值差异显著，且不受模型架构或规模影响。解决方案的关键在于通过显式训练模型推理不确定性与决策成本的关系，特别是采用思维链（Chain-of-Thought, CoT）监督微调（Supervised Fine-Tuning, SFT），使模型能够稳定地遵循预设的升级规则，并在多个数据集、成本比和任务领域间实现良好泛化。这表明，提升模型的决策鲁棒性需将其 escalation 行为作为可建模的属性进行专门训练，而非依赖默认行为或简单提示。

链接: https://arxiv.org/abs/2604.08588
作者: Matthew DosSantos DiSorbo,Harang Ju
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective automation hinges on deciding when to act and when to escalate. We model this as a decision under uncertainty: an LLM forms a prediction, estimates its probability of being correct, and compares the expected costs of acting and escalating. Using this framework across five domains of recorded human decisions-demand forecasting, content recommendation, content moderation, loan approval, and autonomous driving-and across multiple model families, we find marked differences in the implicit thresholds models use to trade off these costs. These thresholds vary substantially and are not predicted by architecture or scale, while self-estimates are miscalibrated in model-specific ways. We then test interventions that target this decision process by varying cost ratios, providing accuracy signals, and training models to follow the desired escalation rule. Prompting helps mainly for reasoning models. SFT on chain-of-thought targets yields the most robust policies, which generalize across datasets, cost ratios, prompt framings, and held-out domains. These results suggest that escalation behavior is a model-specific property that should be characterized before deployment, and that robust alignment benefits from training models to reason explicitly about uncertainty and decision costs.

[AI-71] FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

【速读】：该论文旨在解决计算流体动力学（Computational Fluid Dynamics, CFD）在多查询应用场景中因计算成本过高而难以实用的问题。传统方法依赖高保真CFD仿真，但其耗时特性限制了实时分析与优化等需求。为此，作者提出基于生成建模的替代方案——FluidFlow，其核心创新在于采用条件流匹配（conditional flow-matching）这一新兴生成模型框架，通过学习从噪声分布到真实CFD数据分布的确定性映射关系，构建可扩展的流体力学代理模型（surrogate model）。该方法无需对结构化或非结构化网格进行插值预处理，直接操作原始网格数据并保持几何保真度，同时结合U-Net和扩散Transformer（DiT）两种神经网络架构，在空气foil压力系数预测与三维飞机全表面压力及摩擦系数预测两个复杂基准问题上均显著优于多层感知机基线模型，展现出更强的泛化能力和对大规模非结构化数据的可扩展性。

链接: https://arxiv.org/abs/2604.08586
作者: David Ramos,Lucas Lacasa,Fermín Gutiérrez,Eusebio Valero,Gonzalo Rubio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Computational fluid dynamics (CFD) provides high-fidelity simulations of fluid flows but remains computationally expensive for many-query applications. In recent years deep learning (DL) has been used to construct data-driven fluid-dynamic surrogate models. In this work we consider a different learning paradigm and embrace generative modelling as a framework for constructing scalable fluid-dynamics surrogate models. We introduce FluidFlow, a generative model based on conditional flow-matching, a recent alternative to diffusion models that learns deterministic transport maps between noise and data distributions. FluidFlow is specifically designed to operate directly on CFD data defined on both structured and unstructured meshes alike, without the needs to perform any mesh interpolation pre-processing and preserving geometric fidelity. We assess the capabilities of FluidFlow using two different core neural network architectures, a U-Net and diffusion transformer (DiT), and condition their learning on physically meaningful parameters. The methodology is validated on two benchmark problems of increasing complexity: prediction of pressure coefficients along an airfoil boundary across different operating conditions, and prediction of pressure and friction coefficients over a full three-dimensional aircraft geometry discretized on a large unstructured mesh. In both cases, FluidFlow outperform strong multilayer perceptron baselines, achieving significantly lower error metrics and improved generalisation across operating conditions. Notably, the transformer-based architecture enables scalable learning on large unstructured datasets while maintaining high predictive accuracy. These results demonstrate that flow-matching generative models provide an effective and flexible framework for surrogate modelling in fluid dynamics, with potential for realistic engineering and scientific applications.

[AI-72] QCFuse: Query-Centric Cache Fusion for Efficient RAG Inference

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在配备检索增强生成（Retrieval-Augmented Generation, RAG）时，因KV缓存（Key-Value caching）与选择性token重计算机制缺乏全局查询感知而导致的效率瓶颈问题。现有方法多基于局部视角进行token选择，未能有效利用用户查询的全局语义信息，从而限制了推理效率和准确性。解决方案的关键在于提出QCFuse系统，其核心创新是围绕用户查询构建语义摘要锚点以增强查询表征，并基于Transformer中关键层的注意力分布动态选择性重计算相关token，从而在保持流水线结构高效率的同时显著提升响应准确率与整体效率——实验证明可使LLM响应效率提升40%且不牺牲精度，在特定场景下还能实现注意力去噪效果，进一步提高生成质量。

链接: https://arxiv.org/abs/2604.08585
作者: Jianxin Yan,Zeheng Qian,Wangze Ni,Zhitao Shen,Zhiping Wang,Haoyang Li,Jia Zhu,Lei Chen,Kui Ren
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cache fusion accelerates generation process of LLMs equipped with RAG through KV caching and selective token recomputation, thereby reducing computational costs and improving efficiency. However, existing methods primarily rely on local perspectives for token selection and lack global awareness from the user query. Utilizing this global awareness is challenging due to the high cost of obtaining context-aware query representations and the strict pipeline constraints required for efficient attention analysis. Thus, this demonstration introduces QCFuse, an innovative KV cache fusion system centered on the user query. QCFuse leverages semantic summary anchors to enhance query representations and selectively recomputes query-related tokens to improve accuracy, updating tokens based on the attention distribution of the most critical Transformer layer to preserve the high efficiency of the pipeline structure. Evaluations on real-world datasets demonstrate that QCFuse significantly improves the response efficiency of LLMs by 40% while maintaining equivalent accuracy compared to current methods. Additionally, in certain scenarios, QCFuse achieves an attention denoising effect that yields higher response accuracy, demonstrating substantial potential in the optimization of LLM inference.

[AI-73] CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference

【速读】：该论文旨在解决长上下文大语言模型（Long-context LLMs）在代理和领域问答（domain QA）场景中，因重复使用预填充提示（prefill prompts）而导致的注意力机制和键值缓存（KV-cache）成为解码阶段主要瓶颈的问题。现有稀疏注意力方法虽能降低计算与传输开销，但在高稀疏度下常因查询（Query）与键（Key）分布偏移而难以保持准确性。解决方案的关键在于提出一种无需训练的稀疏注意力机制——中心点评分注意力（Centroid-Scoring Attention, CSAttention），其核心思想是采用“以存储换计算”的策略：在离线预填充阶段构建基于查询中心的查找表（query-centric lookup tables），并在在线解码时通过高效表查找和GPU友好的分数累积替代全上下文扫描，从而在固定大小的查找表支持下显著降低每步解码延迟，同时保持接近全注意力的精度，在95%稀疏度和32K–128K长上下文条件下实现最高达4.6倍的推理加速。

链接: https://arxiv.org/abs/2604.08584
作者: Chuxu Song,Zhencan Peng,Jiuqi Wei,Chuanhui Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context LLMs increasingly rely on extended, reusable prefill prompts for agents and domain QA, pushing attention and KV-cache to become the dominant decode-time bottlenecks. While sparse attention reduces computation and transfer costs, it often struggles to maintain accuracy at high sparsity levels due to the inherent distribution shift between Queries and Keys. We propose Centroid-Scoring Attention (CSAttention), a training-free sparse attention method optimized for high-throughput serving of reusable contexts. CSAttention adopts a storage-for-computation strategy tailored to the offline-prefill/online-decode setting: it front-loads computation into a one-time offline prefill phase that can be amortized across multiple queries, while aggressively optimizing per-step decoding latency. Specifically, CSAttention constructs query-centric lookup tables during offline prefill, whose size remains fixed during decoding, and enables online decoding to replace full-context scans with efficient table lookups and GPU-friendly score accumulation. Extensive experiments demonstrate that CSAttention achieves near-identical accuracy to full attention. Under high sparsity (95%) and long-context settings (32K-128K), CSAttention consistently outperforms state-of-the-art sparse attention methods in both model accuracy and inference speed, achieving up to 4.6x inference speedup over the most accurate baseline at a context length of 128K.

[AI-74] Multivariate Time Series Anomaly Detection via Dual-Branch Reconstruction and Autoregressive Flow-based Residual Density Estimation

【速读】：该论文针对多变量时间序列异常检测（Multivariate Time Series Anomaly Detection, MTSAD）中主流基于重构的方法存在的两个关键问题展开研究：一是由于过度强调变量间关联建模而导致对虚假相关性的过拟合；二是简单地将多变量重构误差求和生成异常分数，使得难以区分难以重构的正常样本与真实异常。解决方案的关键在于提出DBR-AF框架，其核心创新为双分支重构（Dual-Branch Reconstruction, DBR）编码器与自回归流（Autoregressive Flow, AF）模块的协同设计：DBR编码器通过解耦变量间相关性学习与变量内统计特性建模来缓解虚假相关性，AF模块则利用多个堆叠的可逆变换建模复杂的多变量残差分布，并结合密度估计精准识别那些具有大重构误差但实际为正常的样本。

链接: https://arxiv.org/abs/2604.08582
作者: Jun Liu,Ying Chen,Ziqian Lu,Qinyue Tong,Jun Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures,

点击查看摘要

Abstract:Multivariate Time Series Anomaly Detection (MTSAD) is critical for real-world monitoring scenarios such as industrial control and aerospace systems. Mainstream reconstruction-based anomaly detection methods suffer from two key limitations: first, overfitting to spurious correlations induced by an overemphasis on cross-variable modeling; second, the generation of misleading anomaly scores by simply summing up multivariable reconstruction errors, which makes it difficult to distinguish between hard-to-reconstruct samples and genuine anomalies. To address these issues, we propose DBR-AF, a novel framework that integrates a dual-branch reconstruction (DBR) encoder and an autoregressive flow (AF) module. The DBR encoder decouples cross-variable correlation learning and intra-variable statistical property modeling to mitigate spurious correlations, while the AF module employs multiple stacked reversible transformations to model the complex multivariate residual distribution and further leverages density estimation to accurately identify normal samples with large reconstruction errors. Extensive experiments on seven benchmark datasets demonstrate that DBR-AF achieves state-of-the-art performance, with ablation studies validating the indispensability of its core components.

[AI-75] On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment

【速读】：该论文旨在解决视觉（DINOv2）与语言（all-MiniLM-L6-v2）双模态编码器在独立预训练后如何实现跨模态对齐的问题，尤其关注其表示流形之间的结构一致性与方向匹配性。解决方案的关键在于引入计算几何中的函数映射（functional map）框架，将两个模态的表示流形间的对应关系建模为图拉普拉斯特征基之间的紧凑线性算子，从而定量分析跨模态表示的兼容性。该方法揭示了模型虽在谱复杂度上趋同（归一化谱距离仅为0.043），但其特征基方向高度不一致（对角主导性低于0.05，正交误差达70.15%），提出“谱复杂度-方向间隙”（spectral complexity–orientation gap）概念，并定义三个诊断指标（对角主导性、正交偏差和拉普拉斯交换误差）以刻画跨模态表示的兼容边界。

链接: https://arxiv.org/abs/2604.08579
作者: Krisanu Sarkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review at ACMMM Brave New Ideas Track

点击查看摘要

Abstract:We study cross-modal alignment between independently pretrained vision (DINOv2) and language (all-MiniLM-L6-v2) encoders using the functional map framework from computational geometry, which represents correspondence between representation manifolds as a compact linear operator between graph Laplacian eigenbases. While the framework underperforms Procrustes alignment and relative representations for cross-modal retrieval across all supervision budgets, it reveals a structural property of multimodal representations. We find that the Laplacian eigenvalue spectra of the two encoders are quantitatively similar (normalized spectral distance 0.043), indicating that independently trained models develop manifolds of comparable intrinsic complexity. However, the functional map exhibits near-zero diagonal dominance (mean below 0.05) and large orthogonality error (70.15), showing that the eigenvector bases are effectively unaligned. We term this decoupling the spectral complexity–orientation gap: models converge in how much structure they capture but not in how they organize it. This gap defines a boundary condition for spectral alignment methods and motivates three diagnostic quantities : diagonal dominance, orthogonality deviation, and Laplacian commutativity error for characterizing cross-modal representation compatibility.

[AI-76] Structured Exploration and Exploitation of Label Functions for Automated Data Annotation

【速读】：该论文旨在解决程序化标注（programmatic labeling）中标签函数（label function, LF）生成的覆盖率低与标签质量不可靠的问题。现有方法要么依赖大语言模型（Large Language Models, LLMs）生成表面层启发式规则，要么基于手工设计的基元进行建模，难以兼顾多样性和可靠性。其解决方案的关键在于提出EXPONA框架，通过系统性地探索表层、结构层和语义层三个维度的多层级LF，并引入可靠性感知机制，在抑制噪声或冗余启发式规则的同时保留互补信号，从而在标签覆盖度与准确性之间实现平衡，显著提升弱标签质量和下游任务性能。

链接: https://arxiv.org/abs/2604.08578
作者: Phong Lam,Ha-Linh Nguyen,Thu-Trang Nguyen,Son Nguyen,Hieu Dinh Vo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KBS Journal

点击查看摘要

Abstract:High-quality labeled data is critical for training reliable machine learning and deep learning models, yet manual annotation remains costly and error-prone. Programmatic labeling addresses this challenge by using label functions (LFs), i.e., heuristic rules that automatically generate weak labels for training datasets. However, existing automated LF generation methods either rely on large language models (LLMs) to synthesize surface-level heuristics or employ model-based synthesis over hand-crafted primitives. These approaches often result in limited coverage and unreliable label quality. In this paper, we introduce EXPONA, an automated framework for programmatic labeling that formulates LF generation as a principled process balancing diversity and reliability. EXPONA systematically explores multi-level LFs, spanning surface, structural, and semantic perspectives. EXPONA further applies reliability-aware mechanisms to suppress noisy or redundant heuristics while preserving complementary signals. To evaluate EXPONA, we conducted extensive experiments on eleven classification datasets across diverse domains. Experimental results show that EXPONA consistently outperformed state-of-the-art automated LF generation methods. Specifically, EXPONA achieved nearly complete label coverage (up to 98.9%), improved weak label quality by up to 87%, and yielded downstream performance gains of up to 46% in weighted F1. These results indicate that EXPONA’s combination of multi-level LF exploration and reliability-aware filtering enabled more consistent label quality and downstream performance across diverse tasks by balancing coverage and precision in the generated LF set.

[AI-77] Distributionally Robust Token Optimization in RLHF

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在面对提示（prompt）微小变化时出现的鲁棒性不足问题，尤其是在多步推理任务中，即使输入格式、措辞或语言略有差异，也可能导致模型输出显著失效。解决方案的关键在于提出一种分布鲁棒令牌优化（Distributionally Robust Token Optimization, DRTO）方法，该方法融合了基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）与分布鲁棒优化（Distributionally Robust Optimization, DRO），通过构建损失小批量上的f-散度模糊集来约束最坏情况下的令牌级奖励，从而在理论上保障模型的鲁棒性；实证结果表明，DRTO在数学推理基准测试中提升了模型的一致性，在GSM8K和MathQA上分别取得9.17%和2.49%的性能提升。

链接: https://arxiv.org/abs/2604.08577
作者: Yeping Jin,Jiaming Hu,Ioannis Ch. Paschalidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) tend to respond correctly to prompts that align to the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO bounds worst case token-wise rewards by constructing an f-divergence ambiguity set over a loss minibatch, leading to a theoretical robustness. Empirically, DRTO enhances consistency under distribution shifts in mathematical reasoning benchmarks, achieving 9.17% improvement on GSM8K and 2.49% improvement on MathQA.

[AI-78] GAN-Enhanced Deep Reinforcement Learning for Semantic-Aware Resource Allocation in 6G Network Slicing

【速读】：该论文旨在解决第六代（6G）无线网络中资源分配面临的三大挑战：语义盲区导致35%带宽浪费、离散动作量化限制以及训练多样性不足。其核心解决方案是提出GAN-DDPG框架，该框架融合了条件生成对抗网络（conditional GANs）用于流量合成以增强训练多样性，采用连续动作的深度确定性策略梯度（DDPG）算法提升决策精度，并引入语义感知奖励优化机制以减少冗余数据传输，从而显著提升URLLC、eMBB和mMTC三类业务的频谱效率与可靠性。

链接: https://arxiv.org/abs/2604.08576
作者: Daniel Benniah John
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 8 figures. Under review. Simulation-based evaluation for 6G network slicing

点击查看摘要

Abstract:Sixth-generation (6G) wireless networks must support heterogeneous services: enhanced Mobile Broadband (eMBB) requiring 1 Tbps data rates, massive Machine-Type Communications (mMTC) supporting 10 million devices per km, and Ultra-Reliable Low-Latency Communications (URLLC) with 0.1-1 ms latency. Current resource allocation suffers from three limitations: (1) semantic blindness wasting 35% bandwidth on redundant data, (2) discrete action quantization, and (3) limited training diversity. This paper proposes GAN-DDPG, a Generative Adversarial Network-enhanced Deep Deterministic Policy Gradient framework integrating conditional GANs for traffic synthesis, continuous action DDPG, and semantic-aware reward optimization. Extensive simulations with statistical validation demonstrate significant improvements: 22% URLLC, 20% eMBB, 25% mMTC spectral efficiency gains (all p 0.001) compared to baseline DDPG, with 18% latency and 31% packet loss reduction.

[AI-79] MolPaQ: Modular Quantum-Classical Patch Learning for Interpretable Molecular Generation

【速读】：该论文旨在解决分子生成模型在有效性（validity）、多样性（diversity）和性质控制（property control）之间难以协同优化的问题，现有方法通常需在三者间进行权衡。其解决方案的关键在于提出一种模块化量子-经典混合生成框架MOLPAQ：首先利用预训练于QM9数据集的β-VAE学习化学对齐的潜在空间；随后通过一个精简条件器将分子描述符映射至该空间；再由参数高效的量子片段生成器产生纠缠节点嵌入，最终由价态感知聚合器重建为有效分子图。该设计结合对抗微调与化学引导奖励机制，在保证100% RDKit有效性、99.75%新颖性和0.905多样性的同时，显著提升平均QED得分（约2.3%）及芳香环结构出现频率（约10–12%），凸显了量子生成器作为紧凑拓扑塑造算子的核心作用。

链接: https://arxiv.org/abs/2604.08575
作者: Syed Rameez Naqvi,Lu Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Molecular generative models must jointly ensure validity, diversity, and property control, yet existing approaches typically trade off among these objectives. We present MOLPAQ, a modular quantum-classical generator that assembles molecules from quantum-generated latent patches. A \beta-VAE pretrained on QM9 learns a chemically aligned latent manifold; a reduced conditioner maps molecular descriptors into this space; and a parameter-efficient quantum patch generator produces entangled node embeddings that a valence-aware aggregator reconstructs into valid molecular graphs. Adversarial fine-tuning with a latent critic and chemistry-shaped reward yields 100% RDKit validity, 99.75% novelty, and 0.905 diversity. Beyond aggregate metrics, the pretrained quantum generator, steered by the conditioner, improves mean QED by approx. 2.3% and increases aromatic motif incidence by approx. 10-12% relative to a parameter-matched classical generator, highlighting its role as a compact topology-shaping operator.

[AI-80] Distilling Genomic Models for Efficient mRNA Representation Learning via Embedding Matching ICLR2026

【速读】：该论文旨在解决大规模基因组基础模型（Genomic Foundation Models）参数量过大、计算资源消耗高导致在算力受限场景下难以部署的问题。其关键解决方案是提出一种基于嵌入层（embedding-level）的知识蒸馏框架，将先进基因组基础模型中的mRNA表示知识迁移至一个小型专用模型中，实现模型规模缩小200倍的同时保持优异性能，且嵌入层蒸馏效果优于传统的基于logit的蒸馏方法，后者被发现不稳定。这一策略为生物序列建模提供了高效、可扩展的新路径，尤其适用于计算资源受限的基因组学研究场景。

链接: https://arxiv.org/abs/2604.08574
作者: Rasched Haidari,Sam Martin,Maxime Allard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the Tiny Papers Track for the Machine Learning for Genomics Explorations Workshop at ICLR 2026 an the Gen2 Workshop at ICLR 2026

点击查看摘要

Abstract:Large Genomic Foundation Models have recently achieved remarkable results and in-vivo translation capabilities. However these models quickly grow to over a few Billion of parameters and are expensive to run when compute is limited. To overcome this challenge, we present a distillation framework for transferring mRNA representations from a state of the art genomic foundation model into a much smaller model specialized for mRNA sequences, reducing the size by 200-fold. Embedding-level distillation worked better than logit based methods, which we found unstable. Benchmarking on mRNA-bench demonstrates that the distilled model achieves state-of-the-art performance among models of comparable size and competes with larger architectures for mRNA-related tasks. Our results highlight embedding-based distillation of mRNA sequences as an effective training strategy for biological foundation models. This enables similar efficient and scalable sequence modelling in genomics, particularly when large models are computationally challenging or infeasible.

[AI-81] QuanBench: A Unified Multi-Framework Benchmark for LLM -Based Quantum Code Generation ICLR2026

【速读】：该论文旨在解决当前量子代码生成评估中存在框架依赖性过强的问题，即现有研究多局限于单一量子计算框架（如Qiskit、PennyLane或Cirq），难以区分模型的量子推理能力与对特定框架的熟悉程度。为此，作者提出了QuanBench+，一个统一的跨框架基准测试集，涵盖Qiskit、PennyLane和Cirq三个主流量子编程框架，包含42个对齐任务，覆盖量子算法设计、门分解和态制备等核心场景。其关键创新在于采用可执行功能测试（executable functional tests）评估生成代码的正确性，并引入Pass@1和Pass@5指标以及基于KL散度的接受机制处理概率输出；此外，还引入反馈修复机制（feedback-based repair），允许模型在运行时错误或结果错误后进行修正。实验表明，尽管单次生成性能有限（最高仅59.5%），但通过反馈修复后准确率显著提升至83.3%，揭示了多框架量子代码生成仍面临挑战，且高度依赖于具体框架知识。

链接: https://arxiv.org/abs/2604.08570
作者: Ali Slim,Haydar Hamieh,Jawad Kotaich,Yehya Ghosn,Mahdi Chehimi,Ammar Mohanna,Hasan Abed Al Kader Hammoud,Bernard Ghanem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE); Quantum Physics (quant-ph)
备注: 24 pages total, 25 figures, 5 tables, including supplementary material. Accepted to the ICLR 2026 Workshop on I Can’t Believe It’s Not Better

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge. Comments: 24 pages total, 25 figures, 5 tables, including supplementary material. Accepted to the ICLR 2026 Workshop on I Can’t Believe It’s Not Better Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE); Quantum Physics (quant-ph) Cite as: arXiv:2604.08570 [cs.LG] (or arXiv:2604.08570v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.08570 Focus to learn more arXiv-issued DOI via DataCite

[AI-82] Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

【速读】：该论文旨在解决生物医学元数据（metadata）常存在不完整和不符合社区标准的问题，从而限制了数据集的可发现性（findability）、互操作性（interoperability）和再利用性。传统报告指南虽存在，但缺乏机器可操作的表示形式，难以实现自动化处理。解决方案的关键在于构建一个基于大语言模型（LLM）的元数据标准化系统，通过实时查询权威生物医学术语服务（如Ontology services），动态获取规范词汇项以增强模型输出的准确性。与以往静态提示词约束的方法不同，该方案将外部知识源作为“工具”集成到LLM推理流程中，显著提升了在有和无本体约束字段上的预测准确率，为自动化、可扩展地生成符合FAIR原则的元数据提供了有效路径。

链接: https://arxiv.org/abs/2604.08552
作者: Josef Hardi,Martin J. O’Connor,Marcos Martinez-Romero,Jean G. Rosario,Stephen A. Fisher,Mark A. Musen
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. When reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model’s training knowledge alone. We present an LLM-based metadata standardization system that queries authoritative biomedical terminology services in real time to retrieve canonically correct vocabulary terms on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical, scalable approach to automated standardization of biomedical metadata.

[AI-83] On Divergence Measures for Training GFlowNets NEURIPS2024

【速读】：该论文旨在解决生成流网络（Generative Flow Networks, GFlowNets）训练过程中因直接最小化标准Kullback-Leibler（KL）散度而导致的估计偏差和高方差问题。传统方法通过最小化提议分布（forward policy）与目标分布（backward policy）之间的期望对数平方差来实现流匹配，虽与变分推断（Variational Inference, VI）密切相关，但其优化目标在实际应用中可能不稳定。为此，作者系统回顾了四种不同的散度度量——Rényi-α、Tsallis-α、反向KL和前向KL，并设计了适用于GFlowNets训练场景的高效随机梯度估计器；进一步通过引入基于REINFORCE留一法和得分匹配（score-matching）的控制变量（control variates），显著降低了学习目标梯度的方差。解决方案的关键在于：将GFlowNets训练从经验性的流匹配机制提升至基于散度最小化的理论框架，从而获得可证明正确且收敛更快的优化策略，为未来基于散度视角的算法设计奠定基础。

链接: https://arxiv.org/abs/2410.09355
作者: Tiago da Silva,Eliezer de Souza da Silva,Diego Mesquita
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at NeurIPS 2024, this https URL

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) are amortized inference models designed to sample from unnormalized distributions over composable objects, with applications in generative modeling for tasks in fields such as causal discovery, NLP, and drug discovery. Traditionally, the training procedure for GFlowNets seeks to minimize the expected log-squared difference between a proposal (forward policy) and a target (backward policy) distribution, which enforces certain flow-matching conditions. While this training procedure is closely related to variational inference (VI), directly attempting standard Kullback-Leibler (KL) divergence minimization can lead to proven biased and potentially high-variance estimators. Therefore, we first review four divergence measures, namely, Renyi- \alpha 's, Tsallis- \alpha ‘s, reverse and forward KL’s, and design statistically efficient estimators for their stochastic gradients in the context of training GFlowNets. Then, we verify that properly minimizing these divergences yields a provably correct and empirically effective training scheme, often leading to significantly faster convergence than previously proposed optimization. To achieve this, we design control variates based on the REINFORCE leave-one-out and score-matching estimators to reduce the variance of the learning objectives’ gradients. Our work contributes by narrowing the gap between GFlowNets training and generalized variational approximations, paving the way for algorithmic ideas informed by the divergence minimization viewpoint.

[AI-84] Physics-guided surrogate learning enables zero-shot control of turbulent wings

【速读】：该论文旨在解决湍流边界层控制在真实翼型几何中应用时面临的计算成本高和迁移性差的问题，尤其是在逆压梯度条件下复杂多尺度动力学带来的挑战。解决方案的关键在于利用壁面约束湍流的局部结构特征，通过在与机翼边界层统计特性匹配的湍流通道流动中训练强化学习策略，并直接部署至NACA4412翼型上实现零样本（zero-shot）控制，从而显著降低训练成本（减少四个数量级）并提升控制性能，在不进行额外训练的情况下实现了28.7%的摩擦阻力降低和10.7%的总阻力降低。

链接: https://arxiv.org/abs/2604.09434
作者: Yuning Wang,Pol Suarez,Mathis Bode,Ricardo Vinuesa
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Turbulent boundary layers over aerodynamic surfaces are a major source of aircraft drag, yet their control remains challenging due to multiscale dynamics and spatial variability, particularly under adverse pressure gradients. Reinforcement learning has outperformed state-of-the-art strategies in canonical flows, but its application to realistic geometries is limited by computational cost and transferability. Here we show that these limitations can be overcome by exploiting local structures of wall-bounded turbulence. Policies are trained in turbulent channel flows matched to wing boundary-layer statistics and deployed directly onto a NACA4412 wing at Re_c=2\times10^5 without further training, being the so-called zero-shot control. This achieves a 28.7% reduction in skin-friction drag and a 10.7% reduction in total drag, outperforming the state-of-the-art opposition control by 40% in friction drag reduction and 5% in total drag. Training cost is reduced by four orders of magnitude relative to on-wing training, enabling scalable flow control.

[AI-85] SatQNet: Satellite-assisted Quantum Network Entanglement Routing Using Directed Line Graph Neural Networks

【速读】：该论文旨在解决卫星辅助量子网络中因拓扑动态性强、链路随机生成及经典控制平面延迟导致的纠缠路由难题。现有方法或依赖易过时的全局拓扑信息，或基于不完整的局部信息进行决策，难以在高动态环境中实现高效、高保真度的端到端纠缠分发。解决方案的关键在于提出SatQNet，一种基于强化学习的去中心化路由框架，其核心创新是采用以边为中心的有向线图神经网络（edge-centric directed line graph neural network），通过在有向边嵌入上执行局部消息传递机制，有效捕捉高阶度和时变拓扑中的链路特性；该架构使每个中继节点能够实时学习局部图表示，并据此协同建立高质量纠缠连接，且在多种场景下（包括真实欧洲骨干网拓扑）表现出优于启发式与学习型方法的性能，同时具备无需重新训练即可泛化至未见拓扑的能力。

链接: https://arxiv.org/abs/2604.09306
作者: Tobias Meuser,Jannis Weil,Aninda Lahiri,Marius Paraschiv
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Quantum networks are expected to become a key enabler for interconnecting quantum devices. In contrast to classical communication networks, however, information transfer in quantum networks is usually restricted to short distances due to physical constraints of entanglement distribution. Satellites can extend entanglement distribution over long distances, but routing in such networks is challenging because satellite motion and stochastic link generation create a highly dynamic quantum topology. Existing routing methods often rely on global topology information that quickly becomes outdated due to delays in the classical control plane, while decentralized methods typically act on incomplete local information. We propose SatQNet, a reinforcement learning approach for entanglement routing in satellite-assisted quantum networks that can be decentralized at runtime. Its key innovation is an edge-centric directed line graph neural network that performs local message passing on directed edge embeddings, enabling it to better capture link properties in high-degree and time-varying topologies. By exchanging messages with neighboring repeaters, SatQNet learns a local graph representation at runtime that supports agents in establishing high-fidelity end-to-end entanglements. Trained on random graphs, SatQNet outperforms heuristic and learning-based approaches across diverse settings, including a real-world European backbone topology, and generalizes to unseen topologies without retraining.

[AI-86] PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing ICPR2026

【速读】：该论文旨在解决自动配音（Automated Dubbing, AD）过程中存在的同步难题，特别是时长匹配（isochrony）和口型同步（lip-sync）问题，这些问题直接影响观众的沉浸式体验。解决方案的关键在于提出一种双阶段同步方法：首先通过语言模型对翻译文本进行改写以满足时长约束；其次引入语音同步（Phonetic Synchronization, PS），利用动态时间规整（Dynamic Time Warping, DTW）结合从训练数据中学习的元音距离局部代价，使目标语音的元音发音尽可能贴近源语音。进一步地，作者提出PS-Comet方法，联合优化语义相似性和语音相似性，在保持语义完整性的同时提升唇形同步精度。实验表明，PS-Comet TTS在多语言对（包括韩语、英语和法语）下均优于传统TTS系统及人工配音，实现了更精准的唇形同步与语义保留之间的平衡。

链接: https://arxiv.org/abs/2604.09111
作者: Changi Hong,Yoonah Song,Hwayoung Park,Chaewoon Bang,Dayeon Gu,Do Hyun Lee,Hong Kook Kim
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted to ICPR 2026

点击查看摘要

Abstract:Recently, artificial intelligence-based dubbing technology has advanced, enabling automated dubbing (AD) to convert the source speech of a video into target speech in different languages. However, natural AD still faces synchronization challenges such as duration and lip-synchronization (lip-sync), which are crucial for preserving the viewer experience. Therefore, this paper proposes a synchronization method for AD processes that paraphrases translated text, comprising two steps: isochrony for timing constraints and phonetic synchronization (PS) to preserve lip-sync. First, we achieve isochrony by paraphrasing the translated text with a language model, ensuring the target speech duration matches that of the source speech. Second, we introduce PS, which employs dynamic time warping (DTW) with local costs of vowel distances measured from training data so that the target text composes vowels with pronunciations similar to source vowels. Third, we extend this approach to PSComet, which jointly considers semantic and phonetic similarity to preserve meaning better. The proposed methods are incorporated into text-to-speech systems, PS-TTS and PS-Comet TTS. The performance evaluation using Korean and English lip-reading datasets and a voice-actor dubbing dataset demonstrates that both systems outperform TTS without PS on several objective metrics and outperform voice actors in Korean-to-English and English-to-Korean dubbing. We extend the experiments to French, testing all pairs among these languages to evaluate cross-linguistic applicability. Across all language pairs, PS-Comet performed best, balancing lip-sync accuracy with semantic preservation, confirming that PS-Comet achieves more accurate lip-sync with semantic preservation than PS alone.

机器学习

[LG-0] ANTIC: Adaptive Neural Temporal In-situ Compressor

链接: https://arxiv.org/abs/2604.09543
作者: Sandeep S. Cranganore,Andrei Bodnar,Gianluca Galleti,Fabian Paischer,Johannes Brandstetter
类目: Machine Learning (cs.LG)
*备注: 31 pages, 19 figures, 9 Tables

点击查看摘要

Abstract:The persistent storage requirements for high-resolution, spatiotemporally evolving fields governed by large-scale and high-dimensional partial differential equations (PDEs) have reached the petabyte-to-exabyte scale. Transient simulations modeling Navier-Stokes equations, magnetohydrodynamics, plasma physics, or binary black hole mergers generate data volumes that are prohibitive for modern high-performance computing (HPC) infrastructures. To address this bottleneck, we introduce ANTIC (Adaptive Neural Temporal in situ Compressor), an end-to-end in situ compression pipeline. ANTIC consists of an adaptive temporal selector tailored to high-dimensional physics that identifies and filters informative snapshots at simulation time, combined with a spatial neural compression module based on continual fine-tuning that learns residual updates between adjacent snapshots using neural fields. By operating in a single streaming pass, ANTIC enables a combined compression of temporal and spatial components and effectively alleviates the need for explicit on-disk storage of entire time-evolved trajectories. Experimental results demonstrate how storage reductions of several orders of magnitude relate to physics accuracy.

[LG-1] oward World Models for Epidemiology

链接: https://arxiv.org/abs/2604.09519
作者: Zeeshan Memon,Yiqi Su,Christo Kurisummoottil Thomas,Walid Saad,Liang Zhao,Naren Ramakrishnan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:World models have emerged as a unifying paradigm for learning latent dynamics, simulating counterfactual futures, and supporting planning under uncertainty. In this paper, we argue that computational epidemiology is a natural and underdeveloped setting for world models. This is because epidemic decision-making requires reasoning about latent disease burden, imperfect and policy-dependent surveillance signals, and intervention effects are mediated by adaptive human behavior. We introduce a conceptual framework for epidemiological world models, formulating epidemics as controlled, partially observed dynamical systems in which (i) the true epidemic state is latent, (ii) observations are noisy and endogenous to policy, and (iii) interventions act as sequential actions whose effects propagate through behavioral and social feedback. We present three case studies that illustrate why explicit world modeling is necessary for policy-relevant reasoning: strategic misreporting in behavioral surveillance, systematic delays in time-lagged signals such as hospitalizations and deaths, and counterfactual intervention analysis where identical histories diverge under alternative action sequences.

[LG-2] Integrated electro-optic attention nonlinearities for transformers

链接: https://arxiv.org/abs/2604.09512
作者: Luis Mickeler,Kai Lion,Alfonso Nardi,Jost Kellner,Pierre Didier,Bhavin J. Shastri,Niao He,Rachel Grange
类目: Machine Learning (cs.LG); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Transformers have emerged as the dominant neural-network architecture, achieving state-of-the-art performance in language processing and computer vision. At the core of these models lies the attention mechanism, which requires a nonlinear, non-negative mapping using the Softmax function. However, although Softmax operations account for less than 1% of the total operation count, they can disproportionately bottleneck overall inference latency. Here, we use thin-film lithium niobate (TFLN) Mach-Zehnder modulators (MZMs) as analog nonlinear computational elements to drastically reduce the latency of nonlinear computations. We implement electro-optic alternatives to digital Softmax and Sigmoid, and evaluate their performance in Vision Transformers and Large Language Models. Our system maintains highly competitive accuracy, even under aggressive 4-bit input-output quantization of the analog units. We further characterize system noise at encoding speeds up to 10 GBaud and assess model robustness under various noise conditions. Our findings suggest that TFLN modulators can serve as nonlinear function units within hybrid co-packaged hardware, enabling high-speed and energy-efficient nonlinear computation.

[LG-3] Sim-to-Real Transfer for Muscle-Actuated Robots via Generalized Actuator Networks

链接: https://arxiv.org/abs/2604.09487
作者: Jan Schneider,Mridul Mahajan,Le Chen,Simon Guist,Bernhard Schölkopf,Ingmar Posner,Dieter Büchler
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tendon drives paired with soft muscle actuation enable faster and safer robots while potentially accelerating skill acquisition. Still, these systems are rarely used in practice due to inherent nonlinearities, friction, and hysteresis, which complicate modeling and control. So far, these challenges have hindered policy transfer from simulation to real systems. To bridge this gap, we propose a sim-to-real pipeline that learns a neural network model of this complex actuation and leverages established rigid body simulation for the arm dynamics and interactions with the environment. Our method, called Generalized Actuator Network (GeAN), enables actuation model identification across a wide range of robots by learning directly from joint position trajectories rather than requiring torque sensors. Using GeAN on PAMY2, a tendon-driven robot powered by pneumatic artificial muscles, we successfully deploy precise goal-reaching and dynamic ball-in-a-cup policies trained entirely in simulation. To the best of our knowledge, this result constitutes the first successful sim-to-real transfer for a four-degrees-of-freedom muscle-actuated robot arm.

[LG-4] AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning

链接: https://arxiv.org/abs/2604.09437
作者: Ioannis Tsingalis,Constantine Kotropoulos,Corentin Briat
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A novel regularization technique, AdaCubic, is proposed that adapts the weight of the cubic term. The heart of AdaCubic is an auxiliary optimization problem with cubic constraints that dynamically adjusts the weight of the cubic term in Newton’s cubic regularized method. We use Hutchinson’s method to approximate the Hessian matrix, thereby reducing computational cost. We demonstrate that AdaCubic inherits the cubically regularized Newton method’s local convergence guarantees. Our experiments in Computer Vision, Natural Language Processing, and Signal Processing tasks demonstrate that AdaCubic outperforms or competes with several widely used optimizers. Unlike other adaptive algorithms that require hyperparameter fine-tuning, AdaCubic is evaluated with a fixed set of hyperparameters, rendering it a highly attractive optimizer in settings where fine-tuning is infeasible. This makes AdaCubic an attractive option for researchers and practitioners alike. To our knowledge, AdaCubic is the first optimizer to leverage cubic regularization in scalable deep learning applications.

[LG-5] Offline Local Search for Online Stochastic Bandits

链接: https://arxiv.org/abs/2604.09423
作者: Gerdus Benadè,Rathish Das,Thomas Lavastida
类目: Machine Learning (cs.LG)
*备注: Part of this work has been accepted at ACM SIGMETRICS 2026

点击查看摘要

Abstract:Combinatorial multi-armed bandits provide a fundamental online decision-making environment where a decision-maker interacts with an environment across T time steps, each time selecting an action and learning the cost of that action. The goal is to minimize regret, defined as the loss compared to the optimal fixed action in hindsight under full-information. There has been substantial interest in leveraging what is known about offline algorithm design in this online setting. Offline greedy and linear optimization algorithms (both exact and approximate) have been shown to provide useful guarantees when deployed online. We investigate local search methods, a broad class of algorithms used widely in both theory and practice, which have thus far been under-explored in this context. We focus on problems where offline local search terminates in an approximately optimal solution and give a generic method for converting such an offline algorithm into an online stochastic combinatorial bandit algorithm with O(\log^3 T) (approximate) regret. In contrast, existing offline-to-online frameworks yield regret (and approximate regret) which depend sub-linearly, but polynomially on T . We demonstrate the flexibility of our framework by applying it to three online stochastic combinatorial optimization problems: scheduling to minimize total completion time, finding a minimum cost base of a matroid and uncertain clustering.

[LG-6] NOMAD: Generating Embeddings for Massive Distributed Graphs

链接: https://arxiv.org/abs/2604.09419
作者: Aishwarya Sarkar,Sayan Ghosh,Nathan R. Tallent,Ali Jannesari
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Successful machine learning on graphs or networks requires embeddings that not only represent nodes and edges as low-dimensional vectors but also preserve the graph structure. Established methods for generating embeddings require flexible exploration of the entire graph through repeated use of random walks that capture graph structure with samples of nodes and edges. These methods create scalability challenges for massive graphs with millions-to-billions of edges because single-node solutions have inadequate memory and processing capabilities. We present NOMAD, a distributed-memory graph embedding framework using the Message Passing Interface (MPI) for distributed graphs. NOMAD implements proximity-based models proposed in the widely popular LINE (Large-scale Information Network Embedding) algorithm. We propose several practical trade-offs to improve the scalability and communication overheads confronted by irregular and distributed graph embedding methods, catering to massive-scale graphs arising in web and science domains. NOMAD demonstrates median speedups of 10/100x on CPU-based NERSC Perlmutter cluster relative to the popular reference implementations of multi-threaded LINE and node2vec, 35-76x over distributed PBG, and competitive embedding quality relative to LINE, node2vec, and GraphVite, while yielding 12-370x end-to-end speedups on real-world graphs. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2604.09419 [cs.LG] (or arXiv:2604.09419v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.09419 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] OASIS: Online Activation Subspace Learning for Memory-Efficient Training

链接: https://arxiv.org/abs/2604.09406
作者: Sakshi Choudhary,Utkarsh Saxena,Kaushik Roy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training large language models (LLMs) is constrained by memory requirements, with activations accounting for a substantial fraction of the total footprint. Existing approaches reduce memory using low-rank weight parameterizations or low-rank gradient subspaces for optimizer states, while activation memory is addressed through architectural modifications or compression schemes based on periodically updated projections. We propose OASIS, an online activation subspace learning algorithm for memory-efficient training that tracks and continuously updates a low-dimensional activation subspace during training. Intermediate activations are projected onto this evolving subspace, reducing memory without modifying forward-pass computations. The evolving activation subspace induces low-rank gradient representations, enabling both gradients and optimizer states to be maintained directly in this subspace, while a projection-aware optimizer consistently transports optimizer states across subspace updates for stable training. Across various finetuning and pretraining tasks, OASIS achieves up to 2\times lower peak memory than full fine-tuning while matching its performance and outperforming prior low-rank methods.

[LG-8] Stochastic-Dimension Frozen Sampled Neural Network for High-Dimensional Gross-Pitaevskii Equations on Unbounded Domains

链接: https://arxiv.org/abs/2604.09361
作者: Zhangyong Liang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a stochastic-dimension frozen sampled neural network (SD-FSNN) for solving a class of high-dimensional Gross-Pitaevskii equations (GPEs) on unbounded domains. SD-FSNN is unbiased across all dimensions, and its computational cost is independent of the dimension, avoiding the exponential growth in computational and memory costs associated with Hermite-basis discretizations. Additionally, we randomly sample the hidden weights and biases of the neural network, significantly outperforming iterative, gradient-based optimization methods in terms of training time and accuracy. Furthermore, we employ a space-time separation strategy, using adaptive ordinary differential equation (ODE) solvers to update the evolution coefficients and incorporate temporal causality. To preserve the structure of the GPEs, we integrate a Gaussian-weighted ansatz into the neural network to enforce exponential decay at infinity, embed a normalization projection layer for mass normalization, and add an energy conservation constraint to mitigate long-time numerical dissipation. Comparative experiments with existing methods demonstrate the superior performance of SD-FSNN across a range of spatial dimensions and interaction parameters. Compared to existing random-feature methods, SD-FSNN reduces the complexity from linear to dimension-independent. Additionally, SD-FSNN achieves better accuracy and faster training compared to general high-dimensional solvers, while focusing specifically on high-dimensional GPEs on unbounded domains.

[LG-9] Bringing Clustering to MLL: Weakly-Supervised Clustering for Partial Multi-Label Learning

链接: https://arxiv.org/abs/2604.09359
作者: Yu Chen,Weijun Lv,Yue Huang,Xuhuan Zhu,Fang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Label noise in multi-label learning (MLL) poses significant challenges for model training, particularly in partial multi-label learning (PML) where candidate labels contain both relevant and irrelevant labels. While clustering offers a natural approach to exploit data structure for noise identification, traditional clustering methods cannot be directly applied to multi-label scenarios due to a fundamental incompatibility: clustering produces membership values that sum to one per instance, whereas multi-label assignments require binary values that can sum to any number. We propose a novel weakly-supervised clustering approach for PML (WSC-PML) that bridges clustering and multi-label learning through membership matrix decomposition. Our key innovation decomposes the clustering membership matrix \mathbfA into two components: \mathbfA = \mathbf\Pi \odot \mathbfF , where \mathbf\Pi maintains clustering constraints while \mathbfF preserves multi-label characteristics. This decomposition enables seamless integration of unsupervised clustering with multi-label supervision for effective label noise handling. WSC-PML employs a three-stage process: initial prototype learning from noisy labels, adaptive confidence-based weak supervision construction, and joint optimization via iterative clustering refinement. Extensive experiments on 24 datasets demonstrate that our approach outperforms six state-of-the-art methods across all evaluation metrics.

[LG-10] Drift-Aware Online Dynamic Learning for Nonstationary Multivariate Time Series: Application to Sintering Quality Prediction

链接: https://arxiv.org/abs/2604.09358
作者: Yumeng Zhao,Shengxiang Yang,Xianpeng Wang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Accurate prediction of nonstationary multivariate time series remains a critical challenge in complex industrial systems such as iron ore sintering. In practice, pronounced concept drift compounded by significant label verification latency rapidly degrades the performance of offline-trained models. Existing methods based on static architectures or passive update strategies struggle to simultaneously extract multi-scale spatiotemporal features and overcome the stability-plasticity dilemma without immediate supervision. To address these limitations, a Drift-Aware Multi-Scale Dynamic Learning (DA-MSDL) framework is proposed to maintain robust multi-output predictive performance via online adaptive mechanisms on nonstationary data streams. The framework employs a multi-scale bi-branch convolutional network as its backbone to disentangle local fluctuations from long-term trends, thereby enhancing representational capacity for complex dynamic patterns. To circumvent the label latency bottleneck, DA-MSDL leverages Maximum Mean Discrepancy (MMD) for unsupervised drift detection. By quantifying online statistical deviations in feature distributions, DA-MSDL proactively triggers model adaptation prior to inference. Furthermore, a drift-severity-guided hierarchical fine-tuning strategy is developed. Supported by prioritized experience replay from a dynamic memory queue, this approach achieves rapid distribution alignment while effectively mitigating catastrophic forgetting. Long-horizon experiments on real-world industrial sintering data and a public benchmark dataset demonstrate that DA-MSDL consistently outperforms representative baselines under severe concept drift. Exhibiting strong cross-domain generalization and predictive stability, the proposed framework provides an effective online dynamic learning paradigm for quality monitoring in nonstationary environments.

[LG-11] Hierarchical Flow Decomposition for Turning Movement Prediction at Signalized Intersections

链接: https://arxiv.org/abs/2604.09336
作者: Md Atiqur Rahman Mallick,Kamrul Hasan,Pulock Das,Liang Hong,S M Shazzad Rassel
类目: Machine Learning (cs.LG)
*备注: Accepted to IEEE SoutheastCon 2026. 6 pages, 5 figures

点击查看摘要

Abstract:Accurate prediction of intersection turning movements is essential for adaptive signal control but remains difficult due to the high volatility of directional flows. This study proposes HFD-TM (Hierarchical Flow-Decomposition for Turning Movement Prediction), a hierarchical deep learning framework that predicts turning movements by first forecasting corridor through-movements and then expanding these predictions to individual turning streams. This design is motivated by empirical traffic structure, where corridor flows account for 65.1% of total volume, exhibit lower volatility than turning movements, and explain 35.5% of turning-movement variance. A physics-informed loss function enforces flow conservation to maintain structural consistency. Evaluated on six months of 15-minute interval LiDAR (Light Detection and Ranging) data from a six-intersection corridor in Nashville, Tennessee, HFD-TM achieves a mean absolute error of 2.49 vehicles per interval, reducing MAE by 5.7% compared to a Transformer and by 27.0% compared to a GRU (Gated Recurrent Unit). Ablation results show that hierarchical decomposition provides the largest performance gain, while training time is 12.8 times lower than DCRNN (Diffusion Convolutional Recurrent Neural Network), demonstrating suitability for real-time traffic applications.

[LG-12] Stability Enhanced Gaussian Process Variational Autoencoders

链接: https://arxiv.org/abs/2604.09331
作者: Carl R. Richardson,Jichen Zhang,Ethan King,Ján Drgoňa
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:A novel stability-enhanced Gaussian process variational autoencoder (SEGP-VAE) is proposed for indirectly training a low-dimensional linear time invariant (LTI) system, using high-dimensional video data. The mean and covariance function of the novel SEGP prior are derived from the definition of an LTI system, enabling the SEGP to capture the indirectly observed latent process using a combined probabilistic and interpretable physical model. The search space of LTI parameters is restricted to the set of semi-contracting systems via a complete and unconstrained parametrisation. As a result, the SEGP-VAE can be trained using unconstrained optimisation algorithms. Furthermore, this parametrisation prevents numerical issues caused by the presence of a non-Hurwitz state matrix. A case study applies SEGP-VAE to a dataset containing videos of spiralling particles. This highlights the benefits of the approach and the application-specific design choices that enabled accurate latent state predictions.

[LG-13] Online Intention Prediction via Control-Informed Learning

链接: https://arxiv.org/abs/2604.09303
作者: Tianyu Zhou,Zihao Liang,Zehui Lu,Shaoshuai Mou
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper presents an online intention prediction framework for estimating the goal state of autonomous systems in real time, even when intention is time-varying, and system dynamics or objectives include unknown parameters. The problem is formulated as an inverse optimal control / inverse reinforcement learning task, with the intention treated as a parameter in the objective. A shifting horizon strategy discounts outdated information, while online control-informed learning enables efficient gradient computation and online parameter updates. Simulations under varying noise levels and hardware experiments on a quadrotor drone demonstrate that the proposed approach achieves accurate, adaptive intention prediction in complex environments.

[LG-14] Meta-Learned Basis Adaptation for Parametric Linear PDEs

链接: https://arxiv.org/abs/2604.09289
作者: Vikas Dwivedi,Monica Sigovan,Bruno Sixou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a hybrid physics-informed framework for solving families of parametric linear partial differential equations (PDEs) by combining a meta-learned predictor with a least-squares corrector. The predictor, termed \textbfKAPI (Kernel-Adaptive Physics-Informed meta-learner), is a shallow task-conditioned model that maps query coordinates and PDE parameters to solution values while internally generating an interpretable, task-adaptive Gaussian basis geometry. A lightweight meta-network maps PDE parameters to basis centers, widths, and activity patterns, thereby learning how the approximation space should adapt across the parametric family. This predictor-generated geometry is transferred to a second-stage corrector, which augments it with a background basis and computes the final solution through a one-shot physics-informed Extreme Learning Machine (PIELM)-style least-squares solve. We evaluate the method on four linear PDE families spanning diffusion, transport, mixed advection–diffusion, and variable-speed transport. Across these cases, the predictor captures meaningful physics through localized and transport-aligned basis placement, while the corrector further improves accuracy, often by one or more orders of magnitude. Comparisons with parametric PINNs, physics-informed DeepONet, and uniform-grid PIELM correctors highlight the value of predictor-guided basis adaptation as an interpretable and efficient strategy for parametric PDE solving.

[LG-15] Are Independently Estimated View Uncertainties Comparable? Unified Routing for Trusted Multi-View Classification

链接: https://arxiv.org/abs/2604.09288
作者: Yilin Zhang,Cai Xu,Haishun Chen,Ziyu Guan,Wei Zhao
类目: Machine Learning (cs.LG)
*备注: 14pages, Under Review

点击查看摘要

Abstract:Trusted multi-view classification typically relies on a view-wise evidential fusion process: each view independently produces class evidence and uncertainty, and the final prediction is obtained by aggregating these independent opinions. While this design is modular and uncertainty-aware, it implicitly assumes that evidence from different views is numerically comparable. In practice, however, this assumption is fragile. Different views often differ in feature space, noise level, and semantic granularity, while independently trained branches are optimized only for prediction correctness, without any constraint enforcing cross-view consistency in evidence strength. As a result, the uncertainty used for fusion can be dominated by branch-specific scale bias rather than true sample-level reliability. To address this issue, we propose Trusted Multi-view learning with Unified Routing (TMUR), which decouples view-specific evidence extraction from fusion arbitration. TMUR uses view-private experts and one collaborative expert, and employs a unified router that observes the global multi-view context to generate sample-level expert weights. Soft load-balancing and diversity regularization further encourage balanced expert utilization and more discriminative expert specialization. We also provide theoretical analysis showing why independent evidential supervision does not identify a common cross-view evidence scale, and why unified global routing is preferable to branch-local arbitration when reliability is sample-dependent.

[LG-16] Distributed Online Convex Optimization with Compressed Communication: Optimal Regret and Applications

链接: https://arxiv.org/abs/2604.09276
作者: Sifan Yang,Dan-Yue Li,Lijun Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributed online convex optimization (D-OCO) is a powerful paradigm for modeling distributed scenarios with streaming data. However, the communication cost between local learners and the central server is substantial in large-scale applications. To alleviate this bottleneck, we initiate the study of D-OCO with compressed communication. Firstly, to quantify the compression impact, we establish the \Omega(\delta^-1/2\sqrtT) and \Omega(\delta^-1\logT) lower bounds for convex and strongly convex loss functions, respectively, where \delta \in (0,1] is the compression ratio. Secondly, we propose an optimal algorithm, which enjoys regret bounds of O(\delta^-1/2\sqrtT) and O(\delta^-1 \log T) for convex and strongly convex loss functions, respectively. Our method incorporates the error feedback mechanism into the Follow-the-Regularized-Leader framework to address the coupling between the compression error and the projection error. Furthermore, we employ the online compression strategy to mitigate the accumulated error arising from the bidirectional compression. Our online method has great generality, and can be extended to the offline stochastic setting via online-to-batch conversion. We establish convergence rates of O(\delta^-1/2T^-1/2) and O(\delta^-1 T^-1) for convex and strongly convex loss functions, respectively, providing the first guarantees for distributed non-smooth optimization with compressed communication and domain constraints.

[LG-17] he causal relation between off-street parking and electric vehicle adoption in Scotland

链接: https://arxiv.org/abs/2604.09271
作者: Bernardino D’Amico,Achille Fonzone,Emma Hart
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The transition to electric mobility hinges on maximising aggregate adoption while also facilitating equitable access. This study examines whether the ‘charging divide’ between households with and without off-street parking reflects a genuine infrastructure constraint or a by-product of socio-economic disparity. Moving beyond conventional predictive models, we apply a probabilistic causal framework to a nationally representative dataset of Scottish households, enabling estimation of policy interventions while explicitly neutralising the confounding effect of other causal factors. The results reveal a structural hierarchy in the EV adoption process. Private off-street parking functions as a conversion catalyst: enabling access to home-charging increases the probability of EV ownership from 3.3% to 5.6% (a 70% relative, 2.3 percentage point absolute increase). However, this effect primarily accelerates households already economically positioned to purchase an EV rather than recruiting new entrants. By contrast, household income operates as the fundamental affordability ceiling. A causal contrast between lower- and higher-income strata, shows a reduction in market non-participation by 23.1 percentage points, identifying financial capacity as the principal gatekeeper to entering the EV transition funnel. Crucially, the analysis demonstrates that standard observational models overstate the isolated effect of off-street parking infrastructure. The apparent effect emerges from selection bias: higher-income households are disproportionately likely to possess both private parking and the means to purchase EVs. These findings support a dual-track policy strategy: lowering the affordability ceiling for non-participants through financial instruments, while addressing EV home-charging access for the ‘latent intent’ cohort in high-density urban contexts.

[LG-18] Nexus: Same Pretraining Loss Better Downstream Generalization via Common Minima

链接: https://arxiv.org/abs/2604.09258
作者: Huanran Chen,Huaqing Zhang,Xiao Li,Yinpeng Dong,Ke Shen,Jun Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for their capabilities. During pretraining, LLMs acquire foundational knowledge from an unprecedentedly massive and diverse data sources, encompassing a vast array of domains such as general language, mathematics, code, and complex reasoning. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \creffig:cwa_illustration:close), or merely a minimizer of the summed loss (e.g., \creffig:cwa_illustration:distant)? We hypothesize that the geometric “closeness” of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textitsignificantly boosts downstream performance, despite \textitachieving the same pretraining loss (see \creffig:demo:benchmark). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.

[LG-19] DiffHLS: Differential Learning for High-Level Synthesis QoR Prediction with GNNs and LLM Code Embeddings

链接: https://arxiv.org/abs/2604.09240
作者: Zedong Peng,Zeju Li,Qiang Xu,Jieru Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-Level Synthesis (HLS) compiles C/C++ into RTL, but exploring pragma-driven optimization choices remains expensive because each design point requires time-consuming synthesis. We propose \textbf\DiffHLS, a differential learning framework for HLS Quality-of-Result (QoR) prediction that learns from kernel–design pairs: a kernel baseline and a pragma-inserted design variant. \DiffHLS~encodes kernel and design intermediate-representation graphs with dedicated graph neural network (GNN) branches, and augments the delta pathway with code embeddings from a pretrained code large language model (LLM). Instead of regressing absolute targets directly, we jointly predict the kernel baseline and the design-induced delta, and compose them to obtain the design prediction. On PolyBench, \DiffHLS~attains lower average MAPE than GNN baselines under four GNN backbones, and LLM code embeddings consistently improve over a GNN-only ablation. We further validate scalability on the ForgeHLS dataset.

[LG-20] Automated Batch Distillation Process Simulation for a Large Hybrid Dataset for Deep Anomaly Detection

链接: https://arxiv.org/abs/2604.09166
作者: Jennifer Werner,Justus Arweiler,Indra Jungjohann,Jochen Schmid,Fabian Jirasek,Hans Hasse,Michael Bortz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection (AD) in chemical processes based on deep learning offers significant opportunities but requires large, diverse, and well-annotated training datasets that are rarely available from industrial operations. In a recent work, we introduced a large, fully annotated experimental dataset for batch distillation under normal and anomalous operating conditions. In the present study, we augment this dataset with a corresponding simulation dataset, creating a novel hybrid dataset. The simulation data is generated in an automated workflow with a novel Python-based process simulator that employs a tailored index-reduction strategy for the underlying differential-algebraic equations. Leveraging the rich metadata and structured anomaly annotations of the experimental database, experimental records are automatically translated into simulation scenarios. After calibration to a single reference experiment, the dynamics of the other experiments are well predicted. This enabled the fully automated, consistent generation of time-series data for a large number of experimental runs, covering both normal operation and a wide range of actuator- and control-related anomalies. The resulting hybrid dataset is released openly. From a process simulation perspective, this work demonstrates the automated, consistent simulation of large-scale experimental campaigns, using batch distillation as an example. From a data-driven AD perspective, the hybrid dataset provides a unique basis for simulation-to-experiment style transfer, the generation of pseudo-experimental data, and future research on deep AD methods in chemical process monitoring.

[LG-21] runcated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling

链接: https://arxiv.org/abs/2604.09159
作者: Xubin Zhou,Yipeng Yang,Zhan Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Maximum entropy reinforcement learning (MaxEnt RL) has become a standard framework for sequential decision making, yet its standard Gaussian policy parameterization is inherently unimodal, limiting its ability to model complex multimodal action distributions. This limitation has motivated increasing interest in generative policies based on diffusion and flow matching as more expressive alternatives. However, incorporating such policies into MaxEnt RL is challenging for two main reasons: the likelihood and entropy of continuous-time generative policies are generally intractable, and multi-step sampling introduces both long-horizon backpropagation instability and substantial inference latency. To address these challenges, we propose Truncated Rectified Flow Policy (TRFP), a framework built on a hybrid deterministic-stochastic architecture. This design makes entropy-regularized optimization tractable while supporting stable training and effective one-step sampling through gradient truncation and flow straightening. Empirical results on a toy multigoal environment and 10 MuJoCo benchmarks show that TRFP captures multimodal behavior effectively, outperforms strong baselines on most benchmarks under standard sampling, and remains highly competitive under one-step sampling.

[LG-22] Score-Driven Rating System for Sports

链接: https://arxiv.org/abs/2604.09143
作者: Vladimír Holý,Michal Černý
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper introduces a score-driven rating system, a generalization of the classical Elo rating system that employs the score, i.e. the gradient of the log-likelihood, as the updating mechanism for player and team ratings. The proposed framework extends beyond simple win/loss game outcomes and accommodates a wide range of game results, such as point differences, win/draw/loss outcomes, or complete rankings. Theoretical properties of the score are derived, showing that it has zero expected value, sums to zero across all players, and decreases with increasing value of a player’s rating, thereby ensuring internal consistency and fairness. Furthermore, the score-driven rating system exhibits a reversion property, meaning that ratings tend to follow the underlying unobserved true skills over time. The proposed framework provides a theoretical rationale for existing dynamic models of sports performance and offers a systematic approach for constructing new ones.

[LG-23] MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs

链接: https://arxiv.org/abs/2604.09124
作者: Enrico Russo,Mohamed Amine Hamdi,Alessandro Ottaviano,Francesco Conti,Angelo Garofalo,Daniele Jahier Pagliari,Maurizio Palesi,Luca Benini,Alessio Burrello
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC26)

点击查看摘要

Abstract:Deploying DNNs on System-on-Chips (SoC) with multiple heterogeneous acceleration engines is challenging, and the majority of deployment frameworks cannot fully exploit heterogeneity. We present MATCHA, a unified DNN deployment framework that generates highly concurrent schedules for parallel, heterogeneous accelerators and uses constraint programming to optimize L3/L2 memory allocation and scheduling. Using pattern matching, tiling, and mapping across individual HW units enables parallel execution and high accelerator utilization. On the MLPerf Tiny benchmark, using a SoC with two heterogeneous accelerators, MATCHA improves accelerator utilization and reduces inference latency by up to 35% with respect to the the state-of-the-art MATCH compiler.

[LG-24] GeoPAS: Geometric Probing for Algorithm Selection in Continuous Black-Box Optimisation GECCO2026

链接: https://arxiv.org/abs/2604.09095
作者: Jiabao Brad Wang,Xiang Shi,Yiliang Yuan,Mustafa Misir
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Companion to a paper to appear at GECCO 2026

点击查看摘要

Abstract:Automated algorithm selection in continuous black-box optimisation typically relies on fixed landscape descriptors computed under a limited probing budget, yet such descriptors can degrade under problem-split or cross-benchmark evaluation. We propose GeoPAS, a geometric probing approach that represents a problem instance by multiple coarse two-dimensional slices sampled across locations, orientations, and logarithmic scales. A shared validity-aware convolutional encoder maps each slice to an embedding, conditions it on slice-scale and amplitude statistics, and aggregates the resulting features permutation-invariantly for risk-aware solver selection via log-scale performance prediction with an explicit penalty on tail failures. On COCO/BBOB with a 12-solver portfolio in dimensions 2–10, GeoPAS improves over the single best solver under leave-instance-out, grouped random, and leave-problem-out evaluation. These results suggest that multi-scale geometric slices provide a useful transferable static signal for algorithm selection, although a small number of heavy-tail regimes remain and continue to dominate the mean. Our code is available at \hrefthis https URLGitHub .

[LG-25] Synthesizing real-world distributions from high-dimensional Gaussian Noise with Fully Connected Neural Network

链接: https://arxiv.org/abs/2604.09091
作者: Joanna Komorniczak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The use of synthetic data in machine learning applications and research offers many benefits, including performance improvements through data augmentation, privacy preservation of original samples, and reliable method assessment with fully synthetic data. This work proposes a time-efficient synthetic data generation method based on a fully connected neural network and a randomized loss function that transforms a random Gaussian distribution to approximate a target real-world dataset. The experiments conducted on 25 diverse tabular real-world datasets confirm that the proposed solution surpasses the state-of-the-art generative methods and achieves reference MMD scores orders of magnitude faster than modern deep learning solutions. The experiments involved analyzing distributional similarity, assessing the impact on classification quality, and using PCA for dimensionality reduction, which further enhances data privacy and can boost classification quality while reducing time and memory complexity.

[LG-26] mporal Patch Shuffle (TPS): Leverag ing Patch-Level Shuffling to Boost Generalization and Robustness in Time Series Forecasting

链接: https://arxiv.org/abs/2604.09067
作者: Jafar Bakhshaliyev,Johannes Burchert,Niels Landwehr,Lars Schmidt-Thieme
类目: Machine Learning (cs.LG)
*备注: 25 pages, 7 figures, 17 tables

点击查看摘要

Abstract:Data augmentation is a crucial technique for improving model generalization and robustness, particularly in deep learning models where training data is limited. Although many augmentation methods have been developed for time series classification, most are not directly applicable to time series forecasting due to the need to preserve temporal coherence. In this work, we propose Temporal Patch Shuffle (TPS), a simple and model-agnostic data augmentation method for forecasting that extracts overlapping temporal patches, selectively shuffles a subset of patches using variance-based ordering as a conservative heuristic, and reconstructs the sequence by averaging overlapping regions. This design increases sample diversity while preserving forecast-consistent local temporal structure. We extensively evaluate TPS across nine long-term forecasting datasets using five recent model families (TSMixer, DLinear, PatchTST, TiDE, and LightTS), and across four short-term forecasting datasets using PatchTST, observing consistent performance improvements. Comprehensive ablation studies further demonstrate the effectiveness, robustness, and design rationale of the proposed method.

[LG-27] Feature-Label Modal Alignment for Robust Partial Multi-Label Learning

链接: https://arxiv.org/abs/2604.09064
作者: Yu Chen,Weijun Lv,Yue Huang,Xiaozhao Fang,Jie Wen,Yong Xu,Guanbin Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In partial multi-label learning (PML), each instance is associated with a set of candidate labels containing both ground-truth and noisy labels. The presence of noisy labels disrupts the correspondence between features and labels, degrading classification performance. To address this challenge, we propose a novel PML method based on feature-label modal alignment (PML-MA), which treats features and labels as two complementary modalities and restores their consistency through systematic alignment. Specifically, PML-MA first employs low-rank orthogonal decomposition to generate pseudo-labels that approximate the true label distribution by filtering noisy labels. It then aligns features and pseudo-labels through both global projection into a common subspace and local preservation of neighborhood structures. Finally, a multi-peak class prototype learning mechanism leverages the multi-label nature where instances simultaneously belong to multiple categories, using pseudo-labels as soft membership weights to enhance discriminability. By integrating modal alignment with prototype-guided refinement, PML-MA ensures pseudo-labels better reflect the true distribution while maintaining robustness against label noise. Extensive experiments on both real-world and synthetic datasets demonstrate that PML-MA significantly outperforms state-of-the-art methods, achieving superior classification accuracy and noise robustness.

[LG-28] he nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge

链接: https://arxiv.org/abs/2604.09034
作者: Gyuwon Park,DongIl Shin,SolGil Oh,SangGi Ryu,Byung-Hak Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLMs) has significantly impacted the field of natural language processing, but their growing complexity raises concerns about resource usage and transparency. Addressing these challenges, we participated in the NeurIPS LLM Efficiency Challenge, aiming to fine-tune a foundation model within stringent constraints. Our focus was the LLaMa2 70 billion model, optimized on a single A100 40GB GPU within a 24-hour limit. Our methodology hinged on a custom dataset, carefully assembled from diverse open-source resources and benchmark tests, aligned with the challenge’s open-source ethos. Our approach leveraged Quantized-Low Rank Adaptation (QLoRA) Fine tuning, integrated with advanced attention mechanisms like Flash Attention 2. We experimented with various configurations of the LoRA technique, optimizing the balance between computational efficiency and model accuracy. Our fine-tuning strategy was underpinned by the creation and iterative testing of multiple dataset compositions, leading to the selection of a version that demonstrated robust performance across diverse tasks and benchmarks. The culmination of our efforts was an efficiently fine-tuned LLaMa2 70B model that operated within the constraints of a single GPU, showcasing not only a significant reduction in resource utilization but also high accuracy across a range of QA benchmarks. Our study serves as a testament to the feasibility of optimizing large-scale models in resource-constrained environments, emphasizing the potential of LLMs in real-world applications.

[LG-29] Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference

链接: https://arxiv.org/abs/2604.08971
作者: Yueyuan Sui,Payal Mohapatra,Doğaç Eldenk,Haodong Yang,Yiting Zhang,Haoyan Zhang,Qi Zhu,Stephen Xia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Edge devices increasingly run multimodal sensing pipelines that must remain accurate despite fluctuating power budgets and unpredictable sensor dropout. Existing pruning methods fail under these conditions: they generally require fine-tuning after compression, consuming over 10\times the deployment energy, and they assign static importance scores that are blind to which sensors are present. We present the SentryFuse framework, which addresses both challenges jointly through two key components. First, SentryGate learns modality-conditioned importance scores during training via first-order saliency supervision and then prunes attention heads and feed-forward channels at deployment without fine-tuning. Second, SentryAttend replaces dense self-attention, a key bottleneck in contemporary multimodal architectures, with sparse grouped-query attention, yielding a net 15% reduction in GFLOPs across three different multimodal architectures. Across three applications and multimodal backbones, SentryGate achieves a 12.7% average accuracy improvement over the strongest pruning baseline, and upto to 18% under modality dropout conditions. Together, SentryFuse reduces memory by 28.2% and lowers latency by up to 1.63\times without further fine-tuning, establishing modality-aware zero-shot compression as a practical path to multimodal intelligence on heterogeneous edge hardware.

[LG-30] Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning

链接: https://arxiv.org/abs/2604.08960
作者: Zhiqiang Dong,Teng Pang,Rongjian Xu,Guoqiang Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline goal-conditioned reinforcement learning (GCRL) is a practical reinforcement learning paradigm that aims to learn goal-conditioned policies from reward-free offline data. Despite recent advances in hierarchical architectures such as HIQL, long-horizon control in offline GCRL remains challenging due to the limited expressiveness of Gaussian policies and the inability of high-level policies to generate effective subgoals. To address these limitations, we propose the goal-conditioned mean flow policy, which introduces an average velocity field into hierarchical policy modeling for offline GCRL. Specifically, the mean flow policy captures complex target distributions for both high-level and low-level policies through a learned average velocity field, enabling efficient action generation via one-step sampling. Furthermore, considering the insufficiency of goal representation, we introduce a LeJEPA loss that repels goal representation embeddings during training, thereby encouraging more discriminative representations and improving generalization. Experimental results show that our method achieves strong performance across both state-based and pixel-based tasks in the OGBench benchmark.

[LG-31] Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models

链接: https://arxiv.org/abs/2604.08941
作者: Binesh Sadanandan,Vahid Behzadan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best calibration ECE 4.3 and selective prediction coverage 21.5 at 5 risk, yet total entropy from a single forward pass outperforms the ensemble for both error detection AUROC 0.743 vs 0.657 and paraphrase screening. Simple methods win.

[LG-32] Delve into the Applicability of Advanced Optimizers for Multi-Task Learning

链接: https://arxiv.org/abs/2604.08939
作者: Zhipeng Zhou,Linxiao Cao,Pengcheng Wu,Peilin Zhao,Chunyan Miao
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Multi-Task Learning (MTL) is a foundational machine learning problem that has seen extensive development over the past decade. Recently, various optimization-based MTL approaches have been proposed to learn multiple tasks simultaneously by altering the optimization trajectory. Although these methods strive to de-conflict and re-balance tasks, we empirically identify that their effectiveness is often undermined by an overlooked factor when employing advanced optimizers: the instant-derived gradients play only a marginal role in the actual parameter updates. This discrepancy prevents MTL frameworks from fully releasing its power on learning dynamics. Furthermore, we observe that Muon-a recently emerged advanced optimizer-inherently functions as a multi-task learner, which underscores the critical importance of the gradients used for its orthogonalization. To address these issues, we propose APT (Applicability of advanced oPTimizers), a framework featuring a simple adaptive momentum mechanism designed to balance the strengths between advanced optimizers and MTL. Additionally, we introduce a light direction preservation method to facilitate Muon’s orthogonalization. Extensive experiments across four mainstream MTL datasets demonstrate that APT consistently augments existing MTL approaches, yielding substantial performance improvements.

[LG-33] Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning ACL2026

链接: https://arxiv.org/abs/2604.08926
作者: Taojie Zhu,Dongyang Xu,Ding Zou,Sen Zhao,Qiaobo Hao,Zhiguo Yang,Yonghong He
类目: Machine Learning (cs.LG)
*备注: ACL 2026 findings

点击查看摘要

Abstract:Post-training paradigms for Large Language Models (LLMs), primarily Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), face a fundamental dilemma: SFT provides stability (low variance) but suffers from high fitting bias, while RL enables exploration (low bias) but grapples with high gradient variance. Existing unified optimization strategies often employ naive loss weighting, overlooking the statistical conflict between these distinct gradient signals. In this paper, we provide a rigorous theoretical analysis of this bias-variance trade-off and propose \textbfDYPO (Dynamic Policy Optimization), a unified framework designed to structurally mitigate this conflict. DYPO integrates three core components: (1) a \textitGroup Alignment Loss (GAL) that leverages intrinsic group dynamics to significantly reduce RL gradient variance; (2) a \textitMulti-Teacher Distillation mechanism that corrects SFT fitting bias via diverse reasoning paths; and (3) a \textitDynamic Exploitation-Exploration Gating mechanism that adaptively arbitrates between stable SFT and exploratory RL based on reward feedback. Theoretical analysis confirms that DYPO linearly reduces fitting bias and minimizes overall variance. Extensive experiments demonstrate that DYPO significantly outperforms traditional sequential pipelines, achieving an average improvement of 4.8% on complex reasoning benchmarks and 13.3% on out-of-distribution tasks. Our code is publicly available at this https URL.

[LG-34] Using Synthetic Data for Machine Learning-based Childhood Vaccination Prediction in Narok Kenya

链接: https://arxiv.org/abs/2604.08902
作者: Jimmy Bach,Yang Li,Yaqi Liu,John Sankok,Rose Kimani,Carrie B. Dolan,Julius N. Odhiambo,Haipeng Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Limited data utilization in low-resource settings poses a barrier to the vaccine delivery ecosystem, undermining efforts to achieve equitable immunization coverage. In nomadic populations, individuals face an increased risk of missing crucial vaccination doses as children. One such population is the Maasai in Narok County, Kenya, where the absence of high-volume, quality data hampers accurate coverage estimates, impedes efficient resource allocation, and weakens the ability to deliver timely interventions. Additionally, data privacy concerns are heightened in groups with limited sensitive data. Objectives: First, we aim to identify children at risk of missing key vaccines across a large population to provide timely, evidence-based interventions that support increased vaccination coverage. Second, we aim to better protect the privacy of sensitive health data in a vulnerable population. Methods: We digitized 8 years of child vaccination records from the MOH 510 registry (n=6,913) and applied machine learning models (Logistic Regression and XGBoost) to identify children at risk. Additionally, we utilize a novel approach to tabular diffusion-based synthetic data generation (TabSyn) to protect patient privacy within the models. Results: Our findings show that classification techniques can reliably and successfully predict children at risk of missing a vaccine, with recall, precision, and F1-scores exceeding 90% for some vaccines modeled. Additionally, training these models with synthetic data rather than real data, thus preserving the privacy of individuals within the original dataset, does not lead to a loss in predictive performance. Conclusion: These results support the use of synthetic data implementation in health informatics strategies for clinics with limited digital infrastructure, enabling privacy-preserving, scalable forecasting for childhood immunization coverage.

[LG-35] Adaptive Candidate Point Thompson Sampling for High-Dimensional Bayesian Optimization AISTATS2026

链接: https://arxiv.org/abs/2604.08891
作者: Donney Fan,Geoff Pleiss
类目: Machine Learning (cs.LG)
*备注: AISTATS 2026

点击查看摘要

Abstract:In Bayesian optimization, Thompson sampling selects the evaluation point by sampling from the posterior distribution over the objective function maximizer. Because this sampling problem is intractable for Gaussian process (GP) surrogates, the posterior distribution is typically restricted to fixed discretizations (i.e., candidate points) that become exponentially sparse as dimensionality increases. While previous works aim to increase candidate point density through scalable GP approximations, our orthogonal approach increases density by adaptively reducing the search space during sampling. Specifically, we introduce Adaptive Candidate Thompson Sampling (ACTS), which generates candidate points in subspaces guided by the gradient of a surrogate model sample. ACTS is a simple drop-in replacement for existing TS methods – including those that use trust regions or other local approximations – producing better samples of maxima and improved optimization across synthetic and real-world benchmarks.

[LG-36] Uncertainty-Aware Transformers: Conformal Prediction for Language Models

链接: https://arxiv.org/abs/2604.08885
作者: Abhiram Vellore,Niraj K. Jha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have had a profound impact on the field of artificial intelligence, especially on large language models and their variants. However, as was the case with neural networks, their black-box nature limits trust and deployment in high-stakes settings. For models to be genuinely useful and trustworthy in critical applications, they must provide more than just predictions: they must supply users with a clear understanding of the reasoning that underpins their decisions. This article presents an uncertainty quantification framework for transformer-based language models. This framework, called CONFIDE (CONformal prediction for FIne-tuned DEep language models), applies conformal prediction to the internal embeddings of encoder-only architectures, like BERT and RoBERTa, while enabling hyperparameter tuning. CONFIDE uses either [CLS] token embeddings or flattened hidden states to construct class-conditional nonconformity scores, enabling statistically valid prediction sets with instance-level explanations. Empirically, CONFIDE improves test accuracy by up to 4.09% on BERT-tiny and achieves greater correct efficiency (i.e., the expected size of the prediction set conditioned on it containing the true label) compared to prior methods, including NM2 and VanillaNN. We show that early and intermediate transformer layers often yield better-calibrated and more semantically meaningful representations for conformal prediction. In resource-constrained models and high-stakes tasks with ambiguous labels, CONFIDE offers robustness and interpretability where softmax-based uncertainty fails. We position CONFIDE as a framework for practical diagnostic and efficiency/robustness improvement over prior conformal baselines.

[LG-37] How does Chain of Thought decompose complex tasks?

链接: https://arxiv.org/abs/2604.08872
作者: Amrut Nadgir,Vijay Balasubramanian,Pratik Chaudhari
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech)
*备注:

点击查看摘要

Abstract:Many language tasks can be modeled as classification problems where a large language model (LLM) is given a prompt and selects one among many possible answers. We show that the classification error in such problems scales as a power law in the number of classes. This has a dramatic consequence: the prediction error can be reduced substantially by splitting the overall task into a sequence of smaller classification problems, each with the same number of classes (“degree”). This tree-structured decomposition models chain-of-thought (CoT). It has been observed that CoT-based predictors perform better when they “think’”, i.e., when they develop a deeper tree, thus decomposing the problem into a larger number of steps. We identify a critical threshold for the degree, below which thinking is detrimental, and above which there exists an optimal depth that minimizes the error. It is impossible to surpass this minimal error by increasing the depth of thinking.

[LG-38] Finite-Sample Analysis of Nonlinear Independent Component Analysis:Sample Complexity and Identifiability Bounds

链接: https://arxiv.org/abs/2604.08850
作者: Yuwen Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Independent Component Analysis (ICA) is a fundamental unsupervised learning technique foruncovering latent structure in data by separating mixed signals into their independent sources. While substantial progress has been made in establishing asymptotic identifiability guarantees for nonlinear ICA, the finite-sample statistical properties of learning algorithms remain poorly understood. This gap poses significant challenges for practitioners who must determine appropriate sample sizes for reliable source recovery. This paper presents a comprehensive finite-sample analysis of nonlinear ICA with neural network encoders, providing the first complete characterization with matching upper and lower bounds. Our theoretical development introduces three key technical contributions. First, we establish a direct relationship between excess risk and identification error that bypasses parameter-space arguments, thereby avoiding the rate degradation that would otherwise yield suboptimal scaling. Second, we prove matching information-theoretic lower bounds that confirm the optimality of our sample complexity results. Third, we extend our analysis to practical SGD optimization, showing that the same sample efficiency can be achieved with finite-iteration gradient descent under standard landscape assumptions. We validate our theoretical predictions through carefully designed simulation experiments. This gap points toward valuable future research on finite-sample behavior of neural network training and highlights the importance of our validated scaling laws for dimension and diversity.

[LG-39] Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance

链接: https://arxiv.org/abs/2604.08844
作者: Roi Paul
类目: Machine Learning (cs.LG)
*备注: 15 pages, 8 figures, pre-registered experiment, data at this https URL

点击查看摘要

Abstract:We study whether low-rank spectral summaries of LoRA weight deltas can identify which fine-tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre-registered experiment on \textttLlama-3.2-3B-Instruct, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters, and extract per-layer spectral features (norms, stable rank, singular-value entropy, effective rank, and singular-vector cosine alignment to a healthy centroid). Within a single training method (DPO), a logistic regression classifier achieves AUC~1.00 on binary drift detection, all six pairwise objective comparisons, and near-perfect ordinal severity ranking ( \rho \geq 0.956 ). Principal component analysis on flattened weight deltas reveals that training objective is PC1 (AUC~1.00 for objective separation), orthogonal to training duration on PC2. Query-projection weights detect that drift occurred; value-projection weights identify which objective. Cross-method generalization fails completely: a DPO-trained classifier assigns every steering adapter a lower drift score than every DPO adapter (AUC~0.00). In a behavioral evaluation phase, DPO-inverted-harmlessness adapters show elevated harmful compliance on HEx-PHI prompts (mean ASR 0.266 vs.\ healthy 0.112, \Delta = +0.154 ), with near-perfect dose–response ( \rho = 0.986 ). The geometry-to-behavior rank correlation is \rho = 0.72 across 24 non-steered adapters. These results establish that within a controlled manufacturing regime, LoRA weight-space geometry carries objective identity, intensity ordering, and a coarse link to harmful compliance, and that cross-method monitoring requires per-method calibration.

[LG-40] Discrete Meanflow Training Curriculum

链接: https://arxiv.org/abs/2604.08837
作者: Chia-Hong Hsu,Frank Wood
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Flow-based image generative models exhibit stable training and produce high quality samples when using multi-step sampling procedures. One-step generative models can produce high quality image samples but can be difficult to optimize as they often exhibit unstable training dynamics. Meanflow models exhibit excellent few-step sampling performance and tantalizing one-step sampling performance. Notably, MeanFlow models that achieve this have required extremely large training budgets. We significantly decrease the amount of computation and data budget it takes to train Meanflow models by noting and exploiting a particular discretization of the Meanflow objective that yields a consistency property which we formulate into a ``Discrete Meanflow’’ (DMF) Training Curriculum. Initialized with a pretrained Flow Model, DMF curriculum reaches one-step FID 3.36 on CIFAR-10 in only 2000 epochs. We anticipate that faster training curriculums of Meanflow models, specifically those fine-tuned from existing Flow Models, drives efficient training methods of future one-step examples.

[LG-41] Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis

链接: https://arxiv.org/abs/2604.08829
作者: Giansalvo Cirrincione
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: 20 pages, 3 figures, 8 tables submitted to Neurocomputing

点击查看摘要

Abstract:The Hierarchical Kernel Transformer (HKT) is a multi-scale attention mechanism that processes sequences at L resolution levels via trainable causal downsampling, combining level-specific score matrices through learned convex weights. The total computational cost is bounded by 4/3 times that of standard attention, reaching 1.3125x for L = 3. Four theoretical results are established. (i) The hierarchical score matrix defines a positive semidefinite kernel under a sufficient condition on the symmetrised bilinear form (Proposition 3.1). (ii) The asymmetric score matrix decomposes uniquely into a symmetric part controlling reciprocal attention and an antisymmetric part controlling directional attention; HKT provides L independent such pairs across scales, one per resolution level (Propositions 3.5-3.6). (iii) The approximation error decomposes into three interpretable components with an explicit non-Gaussian correction and a geometric decay bound in L (Theorem 4.3, Proposition 4.4). (iv) HKT strictly subsumes single-head standard attention and causal convolution (Proposition 3.4). Experiments over 3 random seeds show consistent gains over retrained standard attention baselines: +4.77pp on synthetic ListOps (55.10±0.29% vs 50.33±0.12%, T = 512), +1.44pp on sequential CIFAR-10 (35.45±0.09% vs 34.01±0.19%, T = 1,024), and +7.47pp on IMDB character-level sentiment (70.19±0.57% vs 62.72±0.40%, T = 1,024), all at 1.31x overhead.

[LG-42] Loom: A Scalable Analytical Neural Computer Architecture

链接: https://arxiv.org/abs/2604.08816
作者: Mehmet Kerem Turkcan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Loom, a computer architecture that executes programs compiled from C inside a looped transformer whose weights are derived analytically. The architecture implements a 22-opcode instruction set in 8 transformer layers. Each forward pass executes one instruction; the model is applied iteratively until the program counter reaches zero. The full machine state resides in a single tensor X \in \mathbbR^d \times n of fixed size, and every step has fixed cost for fixed d and n , independent of program length or execution history. The default configuration uses d = 155 and n = 1024 , yielding 4.7 million parameters and 928 instruction slots. A compact configuration at d = 146 and n = 512 suffices for a 9 \times 9 Sudoku solver (284 instructions). The weights are program-independent: programs live in the state tensor, and the same fixed-weight model executes any compiled program. We make Loom source code publicly available at this https URL.

[LG-43] Structural Evaluation Metrics for SVG Generation via Leave-One-Out Analysis

链接: https://arxiv.org/abs/2604.08809
作者: Haonan Zhu,Adrienne Deganutti,Elad Hirsch,Purvanshi Mehta
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Scalable Vector Graphics (SVG) represent visual content as structured, editable code. Each element (path, shape, or text node) can be individually inspected, transformed, or removed. This structural editability is a main motivation for SVG generation, yet prevailing evaluation protocols primarily reduce the output to a single similarity score against a reference image or input texts, measuring how faithfully the result reproduces an image or follows the instructions, but not how well it preserves the structural properties that make SVG valuable. In particular, existing metrics cannot determine which generated elements contribute positively to overall visual quality, how visual concepts map to specific parts of the code, or whether the generated output supports meaningful downstream editing. We introduce element-level leave-one-out (LOO) analysis, inspired by the classic jackknife estimator. The procedure renders the SVG with and without each element, measures the resulting visual change, and derives a suite of structural quality metrics. Despite its simplicity, the jackknife’s capacity to decompose an aggregate statistic into per-sample contributions translates directly to this setting. From a single mechanism, we obtain: (1) quality scores per element through LOO scoring that enable zero-shot artifact detection; (2) concept-element attribution that maps each element to the visual concept it serves; and (3) four structural metrics, purity, coverage, compactness, and locality, that quantify SVG modularity from complementary perspectives. We validate these metrics on over 19,000 edits (5 types) across 5 generation systems and 3 complexity tiers.

[LG-44] Alleviating Community Fear in Disasters via Multi-Agent Actor-Critic Reinforcement Learning

链接: https://arxiv.org/abs/2604.08802
作者: Yashodhan D. Hakke,Almuatazbellah M. Boker,Lamine Mili,Michael von Spakovsky,Hoda Eldardiry
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:During disasters, cascading failures across power grids, communication networks, and social behavior amplify community fear and undermine cooperation. Existing cyber-physical-social (CPS) models simulate these coupled dynamics but lack mechanisms for active intervention. We extend the CPS resilience model of Valinejad and Mili (2023) with control channels for three agencies, communication, power, and emergency management, and formulate the resulting system as a three-player non-zero-sum differential game solved via online actor-critic reinforcement learning. Simulations based on Hurricane Harvey data show 70% mean fear reduction with improved infrastructure recovery; cross-validation in the case of Hurricane Irma (without refitting) achieves 50% fear reduction, confirming generalizability.

[LG-45] racing the Chain: Deep Learning for Stepping-Stone Intrusion Detection

链接: https://arxiv.org/abs/2604.08800
作者: Nate Mathews,Nicholas Hopper,Matthew Wright
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stepping-stone intrusions (SSIs) are a prevalent network evasion technique in which attackers route sessions through chains of compromised intermediate hosts to obscure their origin. Effective SSI detection requires correlating the incoming and outgoing flows at each relay host at extremely low false positive rates – a stringent requirement that renders classical statistical methods inadequate in operational settings. We apply ESPRESSO, a deep learning flow correlation model combining a transformer-based feature extraction network, time-aligned multi-channel interval features, and online triplet metric learning, to the problem of stepping-stone intrusion detection. To support training and evaluation, we develop a synthetic data collection tool that generates realistic stepping-stone traffic across five tunneling protocols: SSH, SOCAT, ICMP, DNS, and mixed multi-protocol chains. Across all five protocols and in both host-mode and network-mode detection scenarios, ESPRESSO substantially outperforms the state-of-the-art DeepCoFFEA baseline, achieving a true positive rate exceeding 0.99 at a false positive rate of 10^-3 for standard bursty protocols in network-mode. We further demonstrate chain length prediction as a tool for distinguishing malicious from benign pivoting, and conduct a systematic robustness analysis revealing that timing-based perturbations are the primary vulnerability of correlation-based stepping-stone detectors.

[LG-46] oward Hardware-Agnostic Quadrupedal World Models via Morphology Conditioning

链接: https://arxiv.org/abs/2604.08780
作者: Mohamad H. Danesh,Chenhao Li,Amin Abyaneh,Anas Houssaini,Kirsty Ellis,Glen Berseth,Marco Hutter,Hsiu-Chin Lin
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:World models promise a paradigm shift in robotics, where an agent learns the underlying physics of its environment once to enable efficient planning and behavior learning. However, current world models are often hardware-locked specialists: a model trained on a Boston Dynamics Spot robot fails catastrophically on a Unitree Go1 due to the mismatch in kinematic and dynamic properties, as the model overfits to specific embodiment constraints rather than capturing the universal locomotion dynamics. Consequently, a slight change in actuator dynamics or limb length necessitates training a new model from scratch. In this work, we take a step towards a framework for training a generalizable Quadrupedal World Model (QWM) that disentangles environmental dynamics from robot morphology. We address the limitations of implicit system identification, where treating static physical properties (like mass or limb length) as latent variables to be inferred from motion history creates an adaptation lag that can compromise zero-shot safety and efficiency. Instead, we explicitly condition the generative dynamics on the robot’s engineering specifications. By integrating a physical morphology encoder and a reward normalizer, we enable the model to serve as a neural simulator capable of generalizing across morphologies. This capability unlocks zero-shot control across a range of embodiments. We introduce, for the first time, a world model that enables zero-shot generalization to new morphologies for locomotion. While we carefully study the limitations of our method, QWM operates as a distribution-bounded interpolator within the quadrupedal morphology family rather than a universal physics engine, this work represents a significant step toward morphology-conditioned world models for legged locomotion.

[LG-47] Adaptive Simulation Experiment for LLM Policy Optimization

链接: https://arxiv.org/abs/2604.08779
作者: Mingjie Hu,Siyang Gao,Jian-qiang Hu,Enlu Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have significant potential to improve operational efficiency in operations management. Deploying these models requires specifying a policy that governs response quality, shapes user experience, and influences operational value. In this research, we treat LLMs as stochastic simulators and propose a pairwise comparison-based adaptive simulation experiment framework for identifying the optimal policy from a finite set of candidates. We consider two policy spaces: an unstructured space with no parametric assumption, and a structured space in which the data are generated from a preference model. For both settings, we characterize the fundamental data requirements for identifying the optimal policy with high probability. In the unstructured case, we derive a closed-form expression for the optimal sampling proportions, together with a clear operational interpretation. In the structured case, we formulate a regularized convex program to compute the optimal proportions. We then develop an adaptive experimental procedure, termed LLM-PO, for both policy spaces, and prove that it identifies the optimal policy with the desired statistical guarantee while asymptotically attaining the fundamental data requirements. Numerical experiments demonstrate that LLM-PO consistently outperforms benchmark methods and improves LLM performance.

[LG-48] Accurate and Reliable Uncertainty Estimates for Deterministic Predictions Extensions to Under and Overpredictions

链接: https://arxiv.org/abs/2604.08755
作者: Rileigh Bandy,Enrico Camporeale,Andong Hu,Thomas Berger,Rebecca Morrison
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Computational models support high-stakes decisions across engineering and science, and practitioners increasingly seek probabilistic predictions to quantify uncertainty in such models. Existing approaches generate predictions either by sampling input parameter distributions or by augmenting deterministic outputs with uncertainty representations, including distribution-free and distributional methods. However, sampling-based methods are often computationally prohibitive for real-time applications, and many existing uncertainty representations either ignore input dependence or rely on restrictive Gaussian assumptions that fail to capture asymmetry and heavy-tailed behavior. Therefore, we extend the ACCurate and Reliable Uncertainty Estimate (ACCRUE) framework to learn input-dependent, non-Gaussian uncertainty distributions, specifically two-piece Gaussian and asymmetric Laplace forms, using a neural network trained with a loss function that balances predictive accuracy and reliability. Through synthetic and real-world experiments, we show that the proposed approach captures an input-dependent uncertainty structure and improves probabilistic forecasts relative to existing methods, while maintaining flexibility to model skewed and non-Gaussian errors.

[LG-49] IKKA: Inversion Classification via Critical Anomalies for Robust Visual Servoing NEURIPS2026

链接: https://arxiv.org/abs/2604.08754
作者: Darya Pavlenko
类目: Machine Learning (cs.LG)
*备注: 9 pages, 2 figures, 3 tables. Submitted to NeurIPS 2026

点击查看摘要

Abstract:We introduce IKKA (Inversion Classification via Critical Anomalies), a topologically motivated weighting framework for robust visual servoing under distribution shift. Unlike conventional outlier handling, IKKA treats maverick points as structurally informative observations: points where small perturbations can induce qualitatively different control responses or class assignments. The method combines local extremality, boundary transversality, and multi-scale persistence into a single anomaly weight, W(x) = E(x) x T(x) x M(x), which modulates control updates near ambiguous decision regions. We instantiate IKKA in a CPU-only embedded visual-servoing pipeline on Raspberry Pi 4 and evaluate it across 230 reproducible runs under nominal and stress conditions. In stress scenarios involving dim illumination and transient occlusion, IKKA reduces the 95th-percentile lateral error by 24% relative to a hybrid baseline (0.124 to 0.094) while increasing throughput from 20.0 to 24.8 Hz. Non-parametric analysis confirms a large effect size (Cliff’s delta = 0.79).

[LG-50] Adversarial Sensor Errors for Safe and Robust Wind Turbine Fleet Control

链接: https://arxiv.org/abs/2604.08750
作者: Julian Quick,Marcus Binder Nilsen,Andreas Bechmann,Tran Nguyen Le,Pierre-Elouan Mikael Rethore
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to Journal of Physics: Conference Series (Torque 2026). This is the Accepted Manuscript version of an article accepted for publication in Journal of Physics: Conference Series. IOP Publishing Ltd is not responsible for any errors or omissions in this version of the manuscript or any version derived from it. This Accepted Manuscript is published under a CC BY licence

点击查看摘要

Abstract:Plant-level control is an emerging wind energy technology that presents opportunities and challenges. By controlling turbines in a coordinated manner via a central controller, it is possible to achieve greater wind power plant efficiency. However, there is a risk that measurement errors will confound the process, or even that hackers will alter the telemetry signals received by the central controller. This paper presents a framework for developing a safe plant controller by training it with an adversarial agent designed to confound it. This necessitates training the adversary to confound the controller, creating a sort of circular logic or “Arms Race.” This paper examines three broad training approaches for co-training the protagonist and adversary, finding that an Arms Race approach yields the best results. These initial results indicate that the Arms Race adversarial training reduced worst-case performance degradation from 39% power loss to 7.9% power gain relative to a baseline operational strategy.

[LG-51] A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need

链接: https://arxiv.org/abs/2604.08749
作者: Hananel Hazan,Yanbo Zhang,Benedikt Hartl,Michael Levin
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:How many of a neural network’s parameters actually encode task-specific information? We investigate this question with LottaLoRA, a training paradigm in which every backbone weight is drawn at random and frozen; only low-rank LoRA adapters are trained. Across nine benchmarks spanning diverse architecture families from single-layer classifiers to 900M parameter Transformers low-rank adapters over frozen random backbones recover 96-100% of fully trained performance while training only 0.5-40% of the parameters. The task-specific signal therefore occupies a subspace orders of magnitude smaller than the full parameter count this http URL mechanistic findings underpin this result:(1) the frozen backbone is actively exploited when static the learned scaling~ \beta remains strictly positive across all architectures but when the scaffold is destabilized, the optimizer silences it and the LoRA factors absorb all task information; (2) the frozen backbone is preferable but interchangeable any random initialization works equally well, provided it remains fixed throughout training; and (3) the minimum LoRA rank at which performance saturates estimates the intrinsic dimensionality of the task, reminiscent of the number of components retained in Principal Component Analysis (PCA). The construction is formally analogous to Reservoir Computing unfolded along the depth axis of a feedforward network. Because the backbone is determined by a random seed alone, models can be distributed as adapters plus seed a footprint that grows with task complexity, not model size, so that storage and memory savings compound as architectures scale.

[LG-52] RansomTrack: A Hybrid Behavioral Analysis Framework for Ransomware Detection

链接: https://arxiv.org/abs/2604.08739
作者: Busra Caliskan,Ibrahim Gulatas,H. Hakan Kilinc,A. Halim Zaim
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Ransomware poses a serious and fast-acting threat to critical systems, often encrypting files within seconds of execution. Research indicates that ransomware is the most reported cybercrime in terms of financial damage, highlighting the urgent need for early-stage detection before encryption is complete. In this paper, we present RansomTrack, a hybrid behavioral analysis framework to eliminate the limitations of using static and dynamic detection methods separately. Static features are extracted using the Radare2 sandbox, while dynamic behaviors such as memory protection changes, mutex creation, registry access and network activity are obtained using the Frida toolkit. Our dataset of 165 different ransomware and benign software families is publicly released, offering the highest family-to-sample ratio known in the literature. Experimental evaluation using machine learning models shows that ensemble classifiers such as XGBoost and Soft Voting achieve up to 96% accuracy and a ROC-AUC score of 0.99. Each sample analyzed in 9.1 seconds includes modular behavioral logging, runtime instrumentation, and SHAP-based interpretability to highlight the most influential features. Additionally, RansomTrack framework is able to detect ransomware under 9.2 seconds. Overall, RansomTrack offers a scalable, low-latency, and explainable solution for real-time ransomware detection.

[LG-53] Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2604.08728
作者: Diyi Hu,Bhaskar Krishnamachari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cooperation in multi-agent reinforcement learning (MARL) benefits from inter-agent communication, yet most approaches assume idealized channels and existing value decomposition methods ignore who successfully shared information with whom. We propose CLOVER, a cooperative MARL framework whose centralized value mixer is conditioned on the communication graph realized under a realistic wireless channel. This graph introduces a relational inductive bias into value decomposition, constraining how individual utilities are mixed based on the realized communication structure. The mixer is a GNN with node-specific weights generated by a Permutation-Equivariant Hypernetwork: multi-hop propagation along communication edges reshapes credit assignment so that different topologies induce different mixing. We prove this mixer is permutation invariant, monotonic (preserving the IGM condition), and strictly more expressive than QMIX-style mixers. To handle realistic channels, we formulate an augmented MDP isolating stochastic channel effects from the agent computation graph, and employ a stochastic receptive field encoder for variable-size message sets, enabling end-to-end differentiable training. On Predator-Prey and Lumberjacks benchmarks under p-CSMA wireless channels, CLOVER consistently improves convergence speed and final performance over VDN, QMIX, TarMAC+VDN, and TarMAC+QMIX. Behavioral analysis confirms agents learn adaptive signaling and listening strategies, and ablations isolate the communication-graph inductive bias as the key source of improvement.

[LG-54] Efficient RL Training for LLM s with Experience Replay

链接: https://arxiv.org/abs/2604.08706
作者: Charles Arnal,Vivien Cabannes,Taco Cohen,Julia Kempe,Remi Munos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.

[LG-55] QoS-QoE Translation with Large Language Model

链接: https://arxiv.org/abs/2604.08703
作者: Yingjie Yu,Mingyuan Wu,Ahmadreza Eslaminia,Lingzhi Zhao,Kaizhuo Yan,Klara Nahrstedt
类目: Multimedia (cs.MM); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:QoS-QoE translation is a fundamental problem in multimedia systems because it characterizes how measurable system and network conditions affect user-perceived experience. Although many prior studies have examined this relationship, their findings are often developed for specific setups and remain scattered across papers, experimental settings, and reporting formats, limiting systematic reuse, cross-scenario generalization, and large-scale analysis. To address this gap, we first introduce QoS-QoE Translation dataset, a source-grounded dataset of structured QoS-QoE relationships from the multimedia literature, with a focus on video streaming related tasks. We construct the dataset through an automated pipeline that combines paper curation, QoS-QoE relationship extraction, and iterative data evaluation. Each record preserves the extracted relationship together with parameter definitions, supporting evidence, and contextual metadata. We further evaluate the capability of large language models (LLMs) on QoS-QoE translation, both before and after supervised fine-tuning on our dataset, and show strong performance on both continuous-value and discrete-label prediction in bidirectional translation, from QoS-QoE and QoE-QoS. Our dataset provides a foundation for benchmarking LLMs in QoS-QoE translation and for supporting future LLM-based reasoning for multimedia quality prediction and optimization. The complete dataset and code are publicly available at this https URL, for full reproducibility and open access.

[LG-56] EvoLen: Evolution-Guided Tokenization for DNA Language Model

链接: https://arxiv.org/abs/2604.08698
作者: Nan Huang,Xiaoxiao Zhou,Junxia Cui,Mario Tapia-Pacheco,Tiffany Amariuta,Yang Li,Jingbo Shang
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Tokens serve as the basic units of representation in DNA language models (DNALMs), yet their design remains underexplored. Unlike natural language, DNA lacks inherent token boundaries or predefined compositional rules, making tokenization a fundamental modeling decision rather than a naturally specified one. While existing approaches like byte-pair encoding (BPE) excel at capturing token structures that reflect human-generated linguistic regularities, DNA is organized by biological function and evolutionary constraint rather than linguistic convention. We argue that DNA tokenization should prioritize functional sequence patterns like regulatory motifs-short, recurring segments under evolutionary constraint and typically preserved across species. We incorporate evolutionary information directly into the tokenization process through EvoLen, a tokenizer that combines evolutionary stratification with length-aware decoding to better preserve motif-scale functional sequence units. EvoLen uses cross-species evolutionary signals to group DNA sequences, trains separate BPE tokenizers on each group, merges the resulting vocabularies via a rule prioritizing preserved patterns, and applies length-aware decoding with dynamic programming. Through controlled experiments, EvoLen improves the preservation of functional sequence patterns, differentiation across genomic contexts, and alignment with evolutionary constraint, while matching or outperforming standard BPE across diverse DNALM benchmarks. These results demonstrate that tokenization introduces a critical inductive bias and that incorporating evolutionary information yields more biologically meaningful and interpretable sequence representations.

[LG-57] Creator Incentives in Recommender Systems: A Cooperative Game-Theoretic Approach for Stable and Fair Collaboration in Multi-Agent Bandits AISTATS2026

链接: https://arxiv.org/abs/2604.08643
作者: Ramakrishnan Krishnamurthy,Arpit Agarwal,Lakshminarayanan Subramanian,Maximilian Nickel
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Social and Information Networks (cs.SI)
*备注: Accepted in AISTATS 2026 as an Oral Presentation

点击查看摘要

Abstract:User interactions in online recommendation platforms create interdependencies among content creators: feedback on one creator’s content influences the system’s learning and, in turn, the exposure of other creators’ contents. To analyze incentives in such settings, we model collaboration as a multi-agent stochastic linear bandit problem with a transferable utility (TU) cooperative game formulation, where a coalition’s value equals the negative sum of its members’ cumulative regrets. We show that, for identical (homogenous) agents with fixed action sets, the induced TU game is convex under mild algorithmic conditions, implying a non-empty core that contains the Shapley value and ensures both stability and fairness. For heterogeneous agents, the game still admits a non-empty core, though convexity and Shapley value core-membership are no longer guaranteed. To address this, we propose a simple regret-based payout rule that satisfies three out of the four Shapley axioms and also lies in the core. Experiments on MovieLens-100k dataset illustrate when the empirical payout aligns with – and diverges from – the Shapley fairness across different settings and algorithms. Comments: Accepted in AISTATS 2026 as an Oral Presentation Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Social and Information Networks (cs.SI) Cite as: arXiv:2604.08643 [cs.LG] (or arXiv:2604.08643v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.08643 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-58] Reservoir observer enhanced with residual calibration and attention mechanism

链接: https://arxiv.org/abs/2604.08592
作者: Yichen Liu,Wei Xiao,Tianguang Chu
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Reservoir observers provide a data-driven approach to the inference of unmeasured variables from observed ones for nonlinear dynamical systems. While previous studies have demonstrated wide applicability, their performance may vary considerably with different input variables, even compromising reliability in the worst cases. To enhance the performance of inference, we integrate residual calibration and attention mechanism into the reservoir observer design. The residual calibration module leverages information from the estimation residuals to refine the observer output, and the attention mechanism exploits the temporal dependencies of the data to enrich the representation of reservoir internal dynamics. Experiments on typical chaotic systems demonstrate that our method substantially improves inference accuracy, especially for the worst cases resulting from the traditional reservoir observers. We also invoke the notion of transfer entropy to explain the reason for the input-dependent observation discrepancy and the effectiveness of the proposed method.

[LG-59] EngageTriBoost: Predictive Modeling of User Engagement in Digital Mental Health Intervention Using Explainable Machine Learning

链接: https://arxiv.org/abs/2604.08589
作者: Ha Na Cho,Daniel Eisenberg,Cheryl King,Kai Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mental health challenges among young adults, are on the rise, necessitating effective solutions such as digital mental health interventions (DMHIs). Despite their promise, DMHIs face significant adoption barriers, including low initial uptake and high dropout rates. This study leverages machine learning (ML) to analyze behavioral patterns of users of a DMHI, eBridge, designed to increase the utilization of professional mental health services among at-risk college students through motivational interviewing-based online counseling. Our ensemble model, EngageTriBoost, achieved up to 84% accuracy in predicting engagement, measured by sign-ins and counselor interactions. We then applied the Shapley Additive exPlanations (SHAP) analysis which provided clear, interpretable insights into key factors influencing user engagement such as emotional dysregulation and perceived stigma, highlighting their critical effect on DMHI adoption. This study demonstrates the power of explainable ML for better understanding user engagement with DMHI to improve their adoption and achievable impact on mental health outcomes.

[LG-60] Fully Autonomous Z-Score-Based TinyML Anomaly Detection on Resource-Constrained MCUs Using Power Side-Channel Data ATC2026

链接: https://arxiv.org/abs/2604.08581
作者: Abdulrahman Albaiz,Fathi Amsaad
类目: Machine Learning (cs.LG)
*备注: SaTC 2026 Conference

点击查看摘要

Abstract:This paper presents a fully autonomous Tiny Machine Learning (TinyML) Z-Score-based anomaly detection system deployed on a low-power microcontroller for real-time monitoring of appliance behavior using power side-channel data. Unlike existing Internet of Things (IoT) anomaly detection approaches that rely on offline training or cloud-assisted analytics, the proposed system performs both model training and inference directly on a resource-constrained microcontroller without external computation or connectivity. The system continuously samples current consumption, computes Root Mean Square (RMS) values on-device, and derives statistical parameters during an initial training phase. Anomalies are detected using lightweight Z-Score thresholds, enabling interpretable and computationally efficient inference suitable for embedded deployment. The architecture was implemented on an STM32-based platform and evaluated using a 14-day dataset collected from a household mini-fridge under normal operation and controlled anomaly conditions. Results demonstrate perfect detection performance, with Precision and Recall of 1.00, inference latencies on the order of tens of microseconds, and a total memory footprint of approximately 3.3 KB SRAM and 63 KB Flash. These results confirm that robust and fully autonomous TinyML anomaly detection can be achieved on low-cost microcontrollers. Future work includes extending the framework to incorporate additional lightweight models and multi-device learning scenarios.

[LG-61] Memory-Guided Trust-Region Bayesian Optimization (MG-TuRBO) for High Dimensions

链接: https://arxiv.org/abs/2604.08569
作者: Abhilasha Saroj,Shaked Regev,Guanhao Xu,Jinghui Yuan,Roy Luo,Ross Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic simulation and digital-twin calibration is a challenging optimization problem with a limited simulation budget. Each trial requires an expensive simulation run, and the relationship between calibration inputs and model error is often nonconvex, and noisy. The problem becomes more difficult as the number of calibration parameters increases. We compare a commonly used automatic calibration method, a genetic algorithm (GA), with Bayesian optimization methods (BOMs): classical Bayesian optimization (BO), Trust-Region BO (TuRBO), Multi-TuRBO, and a proposed Memory-Guided TuRBO (MG-TuRBO) method. We compare performance on 2 real-world traffic simulation calibration problems with 14 and 84 decision variables, representing lower- and higher-dimensional (14D and 84D) settings. For BOMs, we study two acquisition strategies, Thompson sampling and a novel adaptive strategy. We evaluate performance using final calibration quality, convergence behavior, and consistency across runs. The results show that BOMs reach good calibration targets much faster than GA in the lower-D problem. MG-TuRBO performs comparably in our 14D setting, it demonstrates noticeable advantages in the 84D problem, particularly when paired with our adaptive strategy. Our results suggest that MG-TuRBO is especially useful for high-D traffic simulation calibration and potentially for high-D problems in general.

[LG-62] Self-Sovereign Agent

链接: https://arxiv.org/abs/2604.08551
作者: Wenjie Qu,Xuandong Zhao,Jiaheng Zhang,Dawn Song
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the emerging prospect of self-sovereign agents – AI systems that can economically sustain and extend their own operation without human involvement. Recent advances in large language models and agent frameworks have substantially expanded agents’ practical capabilities, pointing toward a potential shift from developer-controlled tools to more autonomous digital actors. We analyze the remaining technical barriers to such deployments and discuss the security, societal, and governance challenges that could arise if such systems become practically viable. A project page is available at: this https URL.

[LG-63] An Open-Source Open Data Approach to Activity Classification from Triaxial Accelerometry in an Ambulatory Setting

链接: https://arxiv.org/abs/2604.09451
作者: Sepideh Nikookar,Edward Tian,Harrison Hoffman,Matthew Parks,J. Lucas McKay,Yashar Kiarashi,Tommy T. Thomas,Alex Hall,David W. Wright,Gari D. Clifford
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The accelerometer has become an almost ubiquitous device, providing enormous opportunities in healthcare monitoring beyond step counting or other average energy estimates in 15-60 second epochs. Objective: To develop an open data set with associated open-source code for processing 50 Hz tri-axial accelerometry-based to classify patient activity levels and natural types of movement. Approach: Data were collected from 23 healthy subjects (16 males and seven females) aged between 23 and 62 years using an ambulatory device, which included a triaxial accelerometer and synchronous lead II equivalent ECG for an average of 26 minutes each. Participants followed a standardized activity routine involving five distinct activities: lying, sitting, standing, walking, and jogging. Two classifiers were constructed: a signal processing technique to distinguish between high and low activity levels and a convolutional neural network (CNN)-based approach to classify each of the five activities. Main results: The binary (high/low) activity classifier exhibited an F1 score of 0.79. The multi-class CNN-based classifier provided an F1 score of 0.83. The code for this analysis has been made available under an open-source license together with the data on which the classifiers were trained and tested. Significance: The classification of behavioral activity, as demonstrated in this study, offers valuable context for interpreting traditional health metrics and may provide contextual information to support the future development of clinical decision-making tools for patient monitoring, predictive analytics, and personalized health interventions. Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG) Cite as: arXiv:2604.09451 [q-bio.QM] (or arXiv:2604.09451v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2604.09451 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sepideh Nikookar [view email] [v1] Fri, 10 Apr 2026 16:08:57 UTC (1,240 KB) Full-text links: Access Paper: View a PDF of the paper titled An Open-Source, Open Data Approach to Activity Classification from Triaxial Accelerometry in an Ambulatory Setting, by Sepideh Nikookar and 9 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: q-bio.QM prev | next new | recent | 2026-04 Change to browse by: cs cs.LG q-bio References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-64] Continuous Orthogonal Mode Decomposition: Haptic Signal Prediction in Tactile Internet

链接: https://arxiv.org/abs/2604.09446
作者: Mohammad Ali Vahedifar,Mojtaba Nazari,Qi Zhang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Tactile Internet demands sub-millisecond latency and ultra-high reliability, as high latency or packet loss could lead to haptic control instability. To address this, we propose the Mode-Domain Architecture (MDA), a bilateral predictive neural network architecture designed to restore missing signals on both the human and robot sides. Unlike conventional models that extract features implicitly from raw data, MDA utilizes a novel Continuous-Orthogonal Mode Decomposition framework. By integrating an orthogonality constraint, we overcome the pervasive issue of “mode overlapping” found in state-of-the-art decomposition methods. Experimental results demonstrate that this structured feature extraction achieves high prediction accuracies of 98.6% (human) and 97.3% (robot). Furthermore, the model achieves ultra-low inference latency of 0.065 ms, significantly outperforming existing benchmarks and meeting the stringent real-time requirements of haptic teleoperation.

[LG-65] Beyond Augmented-Action Surrogates for Multi-Expert Learning-to-Defer

链接: https://arxiv.org/abs/2604.09414
作者: Yannis Montreuil,Axel Carlier,Lai Xing Ng,Wei Tsang Ooi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning-to-Defer routes each input to the expert that minimizes expected cost, but it assumes that the information available to every expert is fixed at decision time. Many modern systems violate this assumption: after selecting an expert, one may also choose what additional information that expert should receive, such as retrieved documents, tool outputs, or escalation context. We study this problem and call it Learning-to-Defer with advice. We show that a broad family of natural separated surrogates, which learn routing and advice with distinct heads, is inconsistent even in the smallest non-trivial setting. We then introduce an augmented surrogate that operates on the composite expert–advice action space and prove an \mathcalH -consistency guarantee together with an excess-risk transfer bound, yielding recovery of the Bayes-optimal policy in the limit. Experiments on tabular, language, and multi-modal tasks show that the resulting method improves over standard Learning-to-Defer while adapting its advice-acquisition behavior to the cost regime; a synthetic benchmark confirms the failure mode predicted for separated surrogates.

[LG-66] Sharp description of local minima in the loss landscape of high-dimensional two-layer ReLU neural networks

链接: https://arxiv.org/abs/2604.09412
作者: Jie Huang,Bruno Loureiro,Stefano Sarao Mannelli
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 34 pages, 22 figures

点击查看摘要

Abstract:We study the population loss landscape of two-layer ReLU networks of the form \sum_k=1^K \mathrmReLU(w_k^\top x) in a realisable teacher-student setting with Gaussian covariates. We show that local minima admit an exact low-dimensional representation in terms of summary statistics, yielding a sharp and interpretable characterisation of the landscape. We further establish a direct link with one-pass SGD: local minima correspond to attractive fixed points of the dynamics in summary statistics space. This perspective reveals a hierarchical structure of minima: they are typically isolated in the well-specified regime, but become connected by flat directions as network width increases. In this overparameterised regime, global minima become increasingly accessible, attracting the dynamics and reducing convergence to spurious solutions. Overall, our results reveal intrinsic limitations of common simplifying assumptions, which may miss essential features of the loss landscape even in minimal neural network models.

[LG-67] Variational Quantum Physics-Informed Neural Networks for Hydrological PDE-Constrained Learning with Inherent Uncertainty Quantification

链接: https://arxiv.org/abs/2604.09374
作者: Prasad Nimantha Madusanka Ukwatta Hewage,Midhun Chakkravarthy,Ruvan Kumara Abeysekara
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 25 pages, 6 tables. Code available at this https URL

点击查看摘要

Abstract:We propose a Hybrid Quantum-Classical Physics-Informed Neural Network (HQC-PINN) that integrates parameterized variational quantum circuits into the PINN framework for hydrological PDE-constrained learning. Our architecture encodes multi-source remote sensing features into quantum states via trainable angle encoding, processes them through a hardware-efficient variational ansatz with entangling layers, and constrains the output using the Saint-Venant shallow water equations and Manning’s flow equation as differentiable physics loss terms. The inherent stochasticity of quantum measurement provides a natural mechanism for uncertainty quantification without requiring explicit Bayesian inference machinery. We further introduce a quantum transfer learning protocol that pre-trains on multi-hazard disaster data before fine-tuning on flood-specific events. Numerical simulations on multi-modal satellite and meteorological data from the Kalu River basin, Sri Lanka, show that the HQC-PINN achieves convergence in ~3x fewer training epochs and uses ~44% fewer trainable parameters compared to an equivalent classical PINN, while maintaining competitive classification accuracy. Theoretical analysis indicates that hydrological physics constraints narrow the effective optimization landscape, providing a natural mitigation against barren plateaus in variational quantum circuits. This work establishes the first application of quantum-enhanced physics-informed learning to hydrological prediction and demonstrates a viable path toward quantum advantage in environmental science.

[LG-68] Biologically-Grounded Multi-Encoder Architectures as Developability Oracles for Antibody Design ICLR2026

链接: https://arxiv.org/abs/2604.09369
作者: Simon J. Crouzet
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: ICLR 2026 Workshop on Generative and Experimental Perspectives for Biomolecular Design

点击查看摘要

Abstract:Generative models can now propose thousands of \emphde novo antibody sequences, yet translating these designs into viable therapeutics remains constrained by the cost of biophysical characterization. Here we present CrossAbSense, a framework of property-specific neural oracles that combine frozen protein language model encoders with configurable attention decoders, identified through a systematic hyperparameter campaign totaling over 200 runs per property. On the GDPa1 benchmark of 242 therapeutic IgGs, our oracles achieve notable improvements of 12–20% over established baselines on three of five developability assays and competitive performance on the remaining two. The central finding is that optimal decoder architectures \emphinvert our initial biological hypotheses: self-attention alone suffices for aggregation-related properties (hydrophobic interaction chromatography, polyreactivity), where the relevant sequence signatures – such as CDR-H3 hydrophobic patches – are already fully resolved within single-chain embeddings by the high-capacity 6B encoder. Bidirectional cross-attention, by contrast, is required for expression yield and thermal stability – properties that inherently depend on the compatibility between heavy and light chains. Learned chain fusion weights independently confirm heavy-chain dominance in aggregation ( w_H = 0.62 ) versus balanced contributions for stability ( w_H = 0.51 ). We demonstrate practical utility by deploying CrossAbSense on 100 IgLM-generated antibody designs, illustrating a path toward substantial reduction in experimental screening costs.

[LG-69] ransferable FB-GNN-MBE Framework for Potential Energy Surfaces: Data-Adaptive Transfer Learning in Deep Learned Many-Body Expansion Theory

链接: https://arxiv.org/abs/2604.09320
作者: Siqi Chen,Zhiqiang Wang,Yili Shen,Xianqi Deng,Xi Cheng,Cheng-Wei Ju,Jun Yi,Guo Ling,Dieaa Alhmoud,Hui Guan,Zhou Lin
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: Under review with The Journal of Chemical Physics. Main text: 23 pages, 11 figures, and 1 table. Supplementary Materials: 28 pages, 6 figures, 15 tables, 4 pseudo-algorithms

点击查看摘要

Abstract:Mechanistic understanding and rational design of complex chemical systems depend on fast and accurate predictions of electronic structures beyond individual building blocks. However, if the system exceeds hundreds of atoms, first-principles quantum mechanical (QM) modeling becomes impractical. In this study, we developed FB-GNN-MBE by integrating a fragment-based graph neural network (FB-GNN) into the many-body expansion (MBE) theory and demonstrated its capacity to reproduce first-principles potential energy surfaces (PES) for hierarchically structured systems with manageable accuracy, complexity, and interpretability. Specifically, we divided the entire system into basic building blocks (fragments), evaluated their one-fragment energies using a QM model, and addressed many-fragment interactions using the structure-property relationships trained by FB-GNNs. Our investigation shows that FB-GNN-MBE achieves chemical accuracy in predicting two-body (2B) and three-body (3B) energies across water, phenol, and mixture benchmarks, as well as the one-dimensional dissociation curves of water and phenol dimers. To transfer the success of FB-GNN-MBE across various systems with minimal computational costs and data demands, we developed and validated a teacher-student learning protocol. A heavy-weight FB-GNN trained on a mixed-density water cluster ensemble (teacher) distills its learned knowledge and passes it to a light-weight GNN (student), which is later fine-tuned on a uniform-density (H2O)21 cluster ensemble. This transfer learning strategy resulted in efficient and accurate prediction of 2B and 3B energies for variously sized water clusters without retraining. Our transferable FB-GNN-MBE framework outperformed conventional non-FB-GNN-based models and showed high practicality for large-scale molecular simulations.

[LG-70] Iterative Identification Closure: Amplifying Causal Identifiability in Linear SEMs

链接: https://arxiv.org/abs/2604.09309
作者: Ziyi Ding,Xiao-Ping Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:The Half-Trek Criterion (HTC) is the primary graphical tool for determining generic identifiability of causal effect coefficients in linear structural equation models (SEMs) with latent confounders. However, HTC is inherently node-wise: it simultaneously resolves all incoming edges of a node, leaving a gap of “inconclusive” causal effects (15-23% in moderate graphs). We introduce Iterative Identification Closure (IIC), a general framework that decouples causal identification into two phases: (1) a seed function S_0 that identifies an initial set of edges from any external source of information (instrumental variables, interventions, non-Gaussianity, prior knowledge, etc.); and (2) Reduced HTC propagation that iteratively substitutes known coefficients to reduce system dimension, enabling identification of edges that standard HTC cannot resolve. The core novelty is iterative identification propagation: newly identified edges feed back to unlock further identification – a mechanism absent from all existing graphical criteria, which treat each edge (or node) in isolation. This propagation is non-trivial: coefficient substitution alters the covariance structure, and soundness requires proving that the modified Jacobian retains generic full rank – a new theoretical result (Reduced HTC Theorem). We prove that IIC is sound, monotone, converges in O(|E|) iterations (empirically =2), and strictly subsumes both HTC and ancestor decomposition. Exhaustive verification on all graphs with n=5 (134,144 edges) confirms 100% precision (zero false positives); with combined seeds, IIC reduces the HTC gap by over 80%. The propagation gain is gamma~4x (2 seeds identifying ~3% of edges to 97.5% total identification), far exceeding gamma=1.2x of prior methods that incorporate side information without iterative feedback.

[LG-71] Natural Riemannian gradient for learning functional tensor networks

链接: https://arxiv.org/abs/2604.09263
作者: Nikolas Klug,Michael Ulbrich,Marius Willner,André Uschmajew
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We consider machine learning tasks with low-rank functional tree tensor networks (TTN) as the learning model. While in the case of least-squares regression, low-rank functional TTNs can be efficiently optimized using alternating optimization, this is not directly possible in other problems, such as multinomial logistic regression. We propose a natural Riemannian gradient descent type approach applicable to arbitrary losses which is based on the natural gradient by Amari. In particular, the search direction obtained by the natural gradient is independent of the choice of basis of the underlying functional tensor product space. Our framework applies to both the factorized and manifold-based approach for representing the functional TTN. For practical application, we propose a hierarchy of efficient approximations to the true natural Riemannian gradient for computing the updates in the parameter space. Numerical experiments confirm our theoretical findings on common classification datasets and show that using natural Riemannian gradient descent for learning considerably improves convergence behavior when compared to standard Riemannian gradient methods.

[LG-72] A Predictive View on Streaming Hidden Markov Models

链接: https://arxiv.org/abs/2604.09208
作者: Gerardo Duran-Martin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a predictive-first optimisation framework for streaming hidden Markov models. Unlike classical approaches that prioritise full posterior recovery under a fully specified generative model, we assume access to regime-specific predictive models whose parameters are learned online while maintaining a fixed transition prior over regimes. Our objective is to sequentially identify latent regimes while maintaining accurate step-ahead predictive distributions. Because the number of possible regime paths grows exponentially, exact filtering is infeasible. We therefore formulate streaming inference as a constrained projection problem in predictive-distribution space: under a fixed hypothesis budget, we approximate the full posterior predictive by the forward-KL optimal mixture supported on S paths. The solution is the renormalised top- S posterior-weighted mixture, providing a principled derivation of beam search for HMMs. The resulting algorithm is fully recursive and deterministic, performing beam-style truncation with closed-form predictive updates and requiring neither EM nor sampling. Empirical comparisons against Online EM and Sequential Monte Carlo under matched computational budgets demonstrate competitive prequential performance.

[LG-73] A fast and Generic Energy-Shifting Transformer for Hybrid Monte Carlo Radiotherapy Calculation

链接: https://arxiv.org/abs/2604.09157
作者: Chi-Hieu Pham,Didier Benoit,Vincent Bourbonne,Ulrike Schick,Julien Bert
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures, 6 tables

点击查看摘要

Abstract:We introduce a novel learning framework for accelerated Monte Carlo (MC) dose calculation termed Energy-Shifting. This approach leverages deep learning to synthesize 6 MV TrueBeam Linear Accelerator (LINAC) dose distributions directly from monoenergetic inputs under identical beam configurations. Unlike conventional denoising techniques, which rely on noisy low-count dose maps that compromise beam profile integrity, our method achieves superior cross-domain generalization on unseen datasets by integrating high-fidelity anatomical textures and source-specific beam similarity into the model’s input space. Furthermore, we propose a novel 3D architecture termed TransUNetSE3D, featuring Transformer blocks for global context and Residual Squeeze-and-Excitation (SE) modules for adaptive channel-wise feature recalibration. Hierarchical representations of these blocks are fused into the network’s latent space alongside the primary dose-map parameters, allowing physics-aware reconstruction. This hybrid design outperforms existing UNet and Transformer-based benchmarks in both spatial precision and structural preservation, while maintaining the execution speed necessary for real-time use. Our proposed pipeline achieves a Gamma Passing Rate exceeding 98% (3%/3mm) compared to the MC reference, evaluated within the framework of a treatment planning system (TPS) for prostate radiotherapy. These results offer a robust solution for fast volumetric dosimetry in adaptive radiotherapy.

[LG-74] Identifying Causal Effects Using a Single Proxy Variable

链接: https://arxiv.org/abs/2604.09135
作者: Silvan Vollmer,Niklas Pfister,Sebastian Weichwald
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: Equal contribution between Pfister and Weichwald

点击查看摘要

Abstract:Unobserved confounding is a key challenge when estimating causal effects from a treatment on an outcome in scientific applications. In this work, we assume that we observe a single, potentially multi-dimensional proxy variable of the unobserved confounder and that we know the mechanism that generates the proxy from the confounder. Under a completeness assumption on this mechanism, which we call Single Proxy Identifiability of Causal Effects or simply SPICE, we prove that causal effects are identifiable. We extend the proxy-based causal identifiability results by Kuroki and Pearl (2014); Pearl (2010) to higher dimensions, more flexible functional relationships and a broader class of distributions. Further, we develop a neural network based estimation framework, SPICE-Net, to estimate causal effects, which is applicable to both discrete and continuous treatments.

[LG-75] Online Quantile Regression for Nonparametric Additive Models

链接: https://arxiv.org/abs/2604.08969
作者: Haoran Zhan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:This paper introduces a projected functional gradient descent algorithm (P-FGD) for training nonparametric additive quantile regression models in online settings. This algorithm extends the functional stochastic gradient descent framework to the pinball loss. An advantage of P-FGD is that it does not need to store historical data while maintaining O(J_t\ln J_t) computational complexity per step where J_t denotes the number of basis functions. Besides, we only need O(J_t) computational time for quantile function prediction at time t . These properties show that P-FGD is much better than the commonly used RKHS in online learning. By leveraging a novel Hilbert space projection identity, we also prove that the proposed online quantile function estimator (P-FGD) achieves the minimax optimal consistency rate O(t^-\frac2s2s+1) where t is the current time and s denotes the smoothness degree of the quantile function. Extensions to mini-batch learning are also established.

[LG-76] A novel hybrid approach for positive-valued DAG learning

链接: https://arxiv.org/abs/2604.08935
作者: Yao Zhao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages, 2 tables. Accepted at CLeaR 2026

点击查看摘要

Abstract:Causal discovery from observational data remains a fundamental challenge in machine learning and statistics, particularly when variables represent inherently positive quantities such as gene expression levels, asset prices, company revenues, or population counts, which often follow multiplicative rather than additive dynamics. We propose the Hybrid Moment-Ratio Scoring (H-MRS) algorithm, a novel method for learning directed acyclic graphs (DAGs) from positive-valued data by combining moment-based scoring with log-scale regression. The key idea is that for positive-valued variables, the moment ratio \frac\mathbbE[X_j^2]\mathbbE[(\mathbbE[X_j \mid S])^2] provides an effective criterion for causal ordering, where S denotes candidate parent sets. H-MRS integrates log-scale Ridge regression for moment-ratio estimation with a greedy ordering procedure based on raw-scale moment ratios, followed by Elastic Net-based parent selection to recover the final DAG structure. Experiments on synthetic log-linear data demonstrate competitive precision and recall. The proposed method is computationally efficient and naturally respects positivity constraints, making it suitable for applications in genomics and economics. These results suggest that combining log-scale modeling with raw-scale moment ratios provides a practical framework for causal discovery in positive-valued domains.

[LG-77] Policy-Aware Design of Large-Scale Factorial Experiments

链接: https://arxiv.org/abs/2604.08804
作者: Xin Wen,Xi Chen,Will Wei Sun,Yichen Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Digital firms routinely run many online experiments on shared user populations. When product decisions are compositional, such as combinations of interface elements, flows, messages, or incentives, the number of feasible interventions grows combinatorially, while available traffic remains limited. Overlapping experiments can therefore generate interaction effects that are poorly handled by decentralized A/B testing. We study how to design large-scale factorial experiments when the objective is not to estimate every treatment effect, but to identify a high-performing policy under a fixed experimentation budget. We propose a two-stage design that centralizes overlapping experiments into a single factorial problem and models expected outcomes as a low-rank tensor. In the first stage, the platform samples a subset of intervention combinations, uses tensor completion to infer performance on untested combinations, and eliminates weak factor levels using estimated marginal contributions. In the second stage, it applies sequential halving to the surviving combinations to select a final policy. We establish gap-independent simple-regret bounds and gap-dependent identification guarantees showing that the relevant complexity scales with the degrees of freedom of the low-rank tensor and the separation structure across factor levels, rather than the full factorial size. In an offline evaluation based on a product-bundling problem constructed from 100 million Taobao interactions, the proposed method substantially outperforms one-shot tensor completion and unstructured best-arm benchmarks, especially in low-budget and high-noise settings. These results show how centralized, policy-aware experimentation can make combinatorial product design operationally feasible at platform scale.

[LG-78] CERBERUS: A Three-Headed Decoder for Vertical Cloud Profiles ICLR

链接: https://arxiv.org/abs/2604.08772
作者: Emily K. deJong,Nipun Gunawardena,Kevin Smalley,Hassan Beydoun,Peter Caldwell
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: Accepted for oral presentation at 2026 ICLR workshop on Machine Learning for Remote Sensing

点击查看摘要

Abstract:Atmospheric clouds exhibit complex three-dimensional structure and microphysical details that are poorly constrained by the predominantly two-dimensional satellite observations available at global scales. This mismatch complicates data-driven learning and evaluation of cloud processes in weather and climate models, contributing to ongoing uncertainty in atmospheric physics. We introduce CERBERUS, a probabilistic inference framework for generating vertical radar reflectivity profiles from geostationary satellite brightness temperatures, near-surface meteorological variables, and temporal context. CERBERUS employs a three-headed encoder-decoder architecture to predict a zero-inflated (ZI) vertically-resolved distribution of radar reflectivity. Trained and evaluated using ground-based Ka-band radar observations at the ARM Southern Great Plains site, CERBERUS recovers coherent structures across cloud regimes, generalizes to withheld test periods, and provides uncertainty estimates that reflect physical ambiguity, particularly in multilayer and dynamically complex clouds. These results demonstrate the value of distribution-based learning targets for bridging observational scales, introducing a path toward model-relevant synthetic observations of clouds.

[LG-79] Weak Adversarial Neural Pushforward Method for the Wigner Transport Equation

链接: https://arxiv.org/abs/2604.08763
作者: Andrew Qing He,Wei Cai,Sihong Shao
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 9 pages, 1 algorithm

点击查看摘要

Abstract:We extend the Weak Adversarial Neural Pushforward Method to the Wigner transport equation governing the phase-space dynamics of quantum systems. The central contribution is a structural observation: integrating the nonlocal pseudo-differential potential operator against plane-wave test functions produces a Dirac delta that exactly inverts the Fourier transform defining the Wigner potential kernel, reducing the operator to a pointwise finite difference of the potential at two shifted arguments. This holds in arbitrary dimension, requires no truncation of the Moyal series, and treats the potential as a black-box function oracle with no derivative information. To handle the negativity of the Wigner quasi-probability distribution, we introduce a signed pushforward architecture that decomposes the solution into two non-negative phase-space distributions mixed with a learnable weight. The resulting method inherits the mesh-free, Jacobian-free, and scalable properties of the original framework while extending it to the quantum setting.

[LG-80] Active Learning for Generalizable Detonation Performance Prediction of Energetic Materials

链接: https://arxiv.org/abs/2604.08744
作者: R. Seaton Ullberg,Megan C. Davis,Jeremy N. Schroeder,Andrew H. Salij,M. J. Cawkwell,Christopher J. Snyder,Wilton J. M. Kort-Kamp,Ivana Matanovic
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:The discovery of new energetic materials is critical for advancing technologies from defense to private industry. However, experimental approaches remain slow and expensive while computational alternatives require accurate material property inputs that are often costly to obtain, limiting their ability to efficiently predict detonation performance across a vast chemical space. We address this challenge through an active learning strategy that integrates density functional theory calculations, thermochemical modeling, message-passing neural networks, and Bayesian optimization. The resulting high-throughput workflow iteratively expands the training dataset by selecting new molecules in a targeted manner that balances the exploration of broad chemical space with the exploitation of promising high-performing candidates. This approach yields the largest publicly available database of potential CHNO explosives drawn from an initial pool of more than 70 billion candidates and a generalizable surrogate model capable of accurately predicting detonation performance (R ^2 0.98). Feature importance analysis on this largest-to-date dataset reveals that oxygen balance is the dominant driver of detonation performance, complemented by contributions from local electronic structure, density, and the presence of specific functional groups. Cheminformatics analysis highlights how energetic materials with similar performance metrics tend to cluster in distinct chemical spaces offering a clearer direction for future synthesis studies. Together, the surrogate model, database, and resulting chemical insights provide a valuable foundation for high-throughput screening and targeted discovery of new energetic materials spanning diverse and previously unexplored regions of chemical space.

[LG-81] Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate

链接: https://arxiv.org/abs/2604.08742
作者: Yaxin Yu,Long Chen,Zeyi Xu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 27 pages, 4 figures

点击查看摘要

Abstract:Adam has achieved strong empirical success, but its theory remains incomplete even in the deterministic full-batch setting, largely because adaptive preconditioning and momentum are tightly coupled. In this work, a convergent reformulation of full-batch Adam is developed by combining variable and operator splitting with a curvature-aware gradient correction. This leads to a continuous-time Adam-HNAG flow with an exponentially decaying Lyapunov function, as well as two discrete methods: Adam-HNAG, and Adam-HNAG-s, a synchronous variant closer in form to Adam. Within a unified Lyapunov analysis framework, convergence guarantees are established for both methods in the convex smooth setting, including accelerated convergence. Numerical experiments support the theory and illustrate the different empirical behavior of the two discretizations. To the best of our knowledge, this provides the first convergence proof for Adam-type methods in convex optimization.

[LG-82] An Algorithm for Fast Assembling Large-Scale Defect-Free Atom Arrays

链接: https://arxiv.org/abs/2604.08669
作者: Tao Zhang,Xiaodi Li,Hui Zhai,Linghui Chen
类目: Quantum Gases (cond-mat.quant-gas); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:It is widely believed that tens of thousands of physical qubits are needed to build a practically useful quantum computer. Atom arrays formed by optical tweezers are among the most promising platforms for achieving this goal, owing to the excellent scalability and mobility of atomic qubits. However, assembling a defect-free atom array with ~ 10^4 qubits remains algorithmically challenging, alongside other hardware limitations. This is due to the computationally hard path-planning problems and the time-consuming generation of suffciently smooth trajectories for optical tweezer potentials by spatial light modulators (SLM). Here, we present a unified framework comprising two innovative components to fully address these algorithmic challenges: (1) a path-planning module that employs a supervised learning approach using a graph neural network combined with a modified auction decoder, and (2) a potential-generation module called the phase and profile-aware Weighted Gerchberg-Saxton algorithm. The inference time for the first module is nearly a size-independent constant overhead of ~ 5 ms, and the second module generates a potential frame with about 0.5 ms, a timescale shorter than the current commercial SLM refresh time. Altogether, our algorithm enables the assembly of an atom array with 10^4 qubits on a timescale much shorter than the typical vacuum lifetime of the trapped atoms.

[LG-83] Geometry-Induced Long-Range Correlations in Recurrent Neural Network Quantum States

链接: https://arxiv.org/abs/2604.08661
作者: Asif Bin Ayub,Amine Mohamed Aboussalah,Mohamed Hibat-Allah
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 16 pages, 4 figures, and 1 table

点击查看摘要

Abstract:Neural Quantum States based on autoregressive recurrent neural network (RNN) wave functions enable efficient sampling without Markov-chain autocorrelation, but standard RNN architectures are biased toward finite-length correlations and can fail on states with long-range dependencies. A common response is to adopt transformer-style self-attention, but this typically comes with substantially higher computational and memory overhead. Here we introduce dilated RNN wave functions, where recurrent units access distant sites through dilated connections, injecting an explicit long-range inductive bias while retaining a favorable \mathcalO(N \log N) forward pass scaling. We show analytically that dilation changes the correlation geometry and can induce power-law correlation scaling in a simplified linearized and perturbative setting. Numerically, for the critical 1D transverse-field Ising model, dilated RNNs reproduce the expected power-law connected two-point correlations in contrast to the exponential decay typical of conventional RNN ansätze. We further show that the dilated RNN accurately approximates the one-dimensional Cluster state, a paradigmatic example with long-range conditional correlations that has previously been reported to be challenging for RNN-based wave functions. These results highlight dilation as a simple geometric mechanism for building correlation-aware autoregressive neural quantum states.

[LG-84] High-dimensional inference for the γ-ray sky with differentiable programming

链接: https://arxiv.org/abs/2604.08648
作者: Siddharth Mishra-Sharma,Tracy R. Slatyer,Yitian Sun,Yuqing Wu
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 17 pages, 13 figures. Code available at this https URL

点击查看摘要

Abstract:We motivate the use of differentiable probabilistic programming techniques in order to account for the large model-space inherent to astrophysical \gamma -ray analyses. Targeting the longstanding Galactic Center \gamma -ray Excess (GCE) puzzle, we construct differentiable forward model and likelihood that make liberal use of GPU acceleration and vectorization in order to simultaneously account for a continuum of possible spatial morphologies consistent with the GCE emission in a fully probabilistic manner. Our setup allows for efficient inference over the large model space using variational methods. Beyond application to \gamma -ray data, a goal of this work is to showcase how differentiable probabilistic programming can be used as a tool to enable flexible analyses of astrophysical datasets.

[LG-85] Spectral-Transport Stability and Benign Overfitting in Interpolating Learning

链接: https://arxiv.org/abs/2604.08625
作者: Gustav Olaf Yunus Laitinen-Lundström Fredriksson-Imanov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 50 pages, 7 figures, 4 tables. Research article. Includes full proofs, model-specific corollaries, and synthetic supporting experiments. Submitted to Machine Learning

点击查看摘要

Abstract:We develop a theoretical framework for generalization in the interpolating regime of statistical learning. The central question is why highly overparameterized estimators can attain zero empirical risk while still achieving nontrivial predictive accuracy, and how to characterize the boundary between benign and destructive overfitting. We introduce a spectral-transport stability framework in which excess risk is controlled jointly by the spectral geometry of the data distribution, the sensitivity of the learning rule under single-sample replacement, and the alignment structure of label noise. This leads to a scale-dependent Fredriksson index that combines effective dimension, transport stability, and noise alignment into a single complexity parameter for interpolating estimators. We prove finite-sample risk bounds, establish a sharp benign-overfitting criterion through the vanishing of the index along admissible spectral scales, and derive explicit phase-transition rates under polynomial spectral decay. For a model-specific specialization, we obtain an explicit theorem for polynomial-spectrum linear interpolation, together with a proof of the resulting rate. The framework also clarifies implicit regularization by showing how optimization dynamics can select interpolating solutions of minimal spectral-transport energy. These results connect algorithmic stability, double descent, benign overfitting, operator-theoretic learning theory, and implicit bias within a unified structural account of modern interpolation.

[LG-86] Adjoint Matching through the Lens of the Stochastic Maximum Principle in Optimal Control

链接: https://arxiv.org/abs/2604.08580
作者: Carles Domingo-Enrich,Jiequn Han
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reward fine-tuning of diffusion and flow models and sampling from tilted or Boltzmann distributions can both be formulated as stochastic optimal control (SOC) problems, where learning an optimal generative dynamics corresponds to optimizing a control under SDE constraints. In this work, we revisit and generalize Adjoint Matching, a recently proposed SOC-based method for learning optimal controls, and place it on a rigorous footing by deriving it from the Stochastic Maximum Principle (SMP). We formulate a general Hamiltonian adjoint matching objective for SOC problems with control-dependent drift and diffusion and convex running costs, and show that its expected value has the same first variation as the original SOC objective. As a consequence, critical points satisfy the Hamilton–Jacobi–Bellman (HJB) stationarity conditions. In the important practical case of state- and control-independent diffusion, we recover the lean adjoint matching loss previously introduced in adjoint matching, which avoids second-order terms and whose critical points coincide with the optimal control under mild uniqueness assumptions. Finally, we show that adjoint matching can be precisely interpreted as a continuous-time method of successive approximations induced by the SMP, yielding a practical and implementable alternative to classical SMP-based algorithms, which are obstructed by intractable martingale terms in the stochastic setting. These results are also of independent interest to the stochastic control community, providing new implementable objectives and a viable pathway for SMP-based iterations in stochastic problems.

附件下载

点击下载今日全部论文列表

目录

概览 (2026-04-13)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载