本篇博文主要内容为 2026-02-16 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-02-16)

今日共更新495篇论文,其中:

  • 自然语言处理61篇(Computation and Language (cs.CL))
  • 人工智能126篇(Artificial Intelligence (cs.AI))
  • 计算机视觉93篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习154篇(Machine Learning (cs.LG))
  • 多智能体系统11篇(Multiagent Systems (cs.MA))
  • 信息检索20篇(Information Retrieval (cs.IR))
  • 人机交互31篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] RACE: Temporal Reasoning via Agent ic Context Evolution for Streaming Electronic Health Records (EHRs)

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理纵向患者轨迹时的可靠性问题,即在临床状态动态演变、事件时间不规则和异质性增强的情况下,模型性能随时间下降的问题。现有方法如微调或基于检索的增强策略存在计算开销高、隐私约束严或长上下文下不稳定等局限。其解决方案的关键在于提出TRACE(Temporal Reasoning via Agentic Context Evolution)框架,通过显式结构化并维护上下文信息而非扩展上下文窗口或更新参数,利用双记忆架构——静态全局协议(Global Protocol)编码机构临床规范与动态个体协议(Individual Protocol)追踪患者特异性状态,并由路由器(Router)、推理器(Reasoner)、审计员(Auditor)和守卫者(Steward)四个代理组件协同完成时间推理与状态演化,从而实现可解释、可审计且成本可控的长期临床推理。

链接: https://arxiv.org/abs/2602.12833
作者: Zhan Qu,Michael Färber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) encode extensive medical knowledge but struggle to apply it reliably to longitudinal patient trajectories, where evolving clinical states, irregular timing, and heterogeneous events degrade performance over time. Existing adaptation strategies rely on fine-tuning or retrieval-based augmentation, which introduce computational overhead, privacy constraints, or instability under long contexts. We introduce TRACE (Temporal Reasoning via Agentic Context Evolution), a framework that enables temporal clinical reasoning with frozen LLMs by explicitly structuring and maintaining context rather than extending context windows or updating parameters. TRACE operates over a dual-memory architecture consisting of a static Global Protocol encoding institutional clinical rules and a dynamic Individual Protocol tracking patient-specific state. Four agentic components, Router, Reasoner, Auditor, and Steward, coordinate over this structured memory to support temporal inference and state evolution. The framework maintains bounded inference cost via structured state compression and selectively audits safety-critical clinical decisions. Evaluated on longitudinal clinical event streams from MIMIC-IV, TRACE significantly improves next-event prediction accuracy, protocol adherence, and clinical safety over long-context and retrieval-augmented baselines, while producing interpretable and auditable reasoning traces.

[MA-1] Decentralized Optimal Equilibrium Learning in Stochastic Games via Single-bit Feedback

【速读】:该论文旨在解决在严重信息与通信约束下,随机博弈(stochastic games)中去中心化的均衡选择问题。传统方法仅关注收敛至某个均衡,但随机博弈通常存在多个具有显著不同福利性质的均衡,因此需要进一步实现最优均衡的选择。解决方案的关键在于设计一种基于“语义内容/不满信号”(semantic content/discontent signaling)机制的去中心化学习框架: agents 仅能观测全局状态轨迹和自身回报,并在每轮交互中交换一个随机化的比特反馈,通过该机制隐式地将去中心化学习动态与设计者指定的社会福利目标对齐。作者提出适用于一般随机博弈的“探索-承诺”(explore-and-commit)与在线变体算法,兼容异构的模型驱动或模型无关的学习方法,并建立了明确的有限时间 regret 保证,在温和条件下证明了对数期望 regret。

链接: https://arxiv.org/abs/2602.12830
作者: Seref Taha Kiremitci,Ahmed Said Donmez,Muhammed O. Sayin
机构: Bilkent University (比尔肯大学)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We study decentralized equilibrium selection in stochastic games under severe information and communication constraints. In such settings, convergence to equilibrium alone is insufficient, as stochastic games typically admit many equilibria with markedly different welfare properties. We address decentralized optimal equilibrium selection, where agents coordinate on equilibria that optimize a designer-specified social welfare objective while allowing heterogeneous tolerance to deviations from strict best responses. Agents observe only the global state trajectory and their realized rewards, and exchange a single randomized bit of feedback per agent per round. This semantic content/discontent signaling mechanism implicitly aligns decentralized learning dynamics with the global welfare objective. We develop explore-and-commit and online variants applicable to general stochastic games, accommodating heterogeneous model-based or model-free methods for solving the induced Markov decision processes, and establish explicit finite-time regret guarantees, showing logarithmic expected regret under mild conditions.

[MA-2] Multi-Agent Model-Based Reinforcement Learning with Joint State-Action Learned Embeddings

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在部分可观测且高度动态环境中的协调问题,核心挑战在于如何构建具有信息量的表示并实现数据高效的训练。解决方案的关键在于提出了一种基于模型的多智能体强化学习框架,该框架将联合状态-动作表示学习(Joint State-Action Representation Learning)与想象式轨迹推演(Imaginative Roll-outs)相统一。具体而言,作者设计了一个使用变分自编码器(Variational Auto-Encoders, VAE)训练的世界模型,并引入状态-动作学习嵌入(State-Action Learned Embedding, SALE),将其注入想象模块以预测合理的未来轨迹,同时融入联合智能体网络中,通过混合网络(Mixing Network)整合个体动作价值以估计联合动作价值函数。这种耦合机制使智能体能够借助SALE增强的动作价值理解自身决策对群体结果的影响,从而在有限的真实环境交互下提升长期规划与优化能力。

链接: https://arxiv.org/abs/2602.12520
作者: Zhizun Wang,David Meger
机构: McGill University (麦吉尔大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 22 pages

点击查看摘要

Abstract:Learning to coordinate many agents in partially observable and highly dynamic environments requires both informative representations and data-efficient training. To address this challenge, we present a novel model-based multi-agent reinforcement learning framework that unifies joint state-action representation learning with imaginative roll-outs. We design a world model trained with variational auto-encoders and augment the model using the state-action learned embedding (SALE). SALE is injected into both the imagination module that forecasts plausible future roll-outs and the joint agent network whose individual action values are combined through a mixing network to estimate the joint action-value function. By coupling imagined trajectories with SALE-based action values, the agents acquire a richer understanding of how their choices influence collective outcomes, leading to improved long-term planning and optimization under limited real-environment interactions. Empirical studies on well-established multi-agent benchmarks, including StarCraft II Micro-Management, Multi-Agent MuJoCo, and Level-Based Foraging challenges, demonstrate consistent gains of our method over baseline algorithms and highlight the effectiveness of joint state-action learned embeddings within a multi-agent model-based paradigm.

[MA-3] Bench-MFG: A Benchmark Suite for Learning in Stationary Mean Field Games

【速读】:该论文旨在解决当前Mean Field Games (MFGs)与强化学习(Reinforcement Learning, RL)交叉领域中缺乏标准化评估协议的问题,这一缺失导致现有研究依赖于孤立且简化的环境,难以客观评估算法的鲁棒性、泛化能力及失效模式。其解决方案的关键在于提出一个全面的MFG基准套件(Bench-MFG),聚焦于离散时间、离散状态空间的平稳设置以提升清晰度;同时引入问题类别的分类体系(涵盖无交互、单调博弈、势博弈和动力耦合博弈),并为每类提供典型环境实例;此外,提出MF-Garnets方法用于生成随机MFG实例,支持严格的统计测试。通过在这些环境中对多种学习算法(包括一种新颖的黑箱可 exploitability 最小化方法 MF-PSO)进行基准测试,作者进一步提炼出未来实验比较的标准指南。

链接: https://arxiv.org/abs/2602.12517
作者: Lorenzo Magnino,Jiacheng Shen,Matthieu Geist,Olivier Pietquin,Mathieu Laurière
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The intersection of Mean Field Games (MFGs) and Reinforcement Learning (RL) has fostered a growing family of algorithms designed to solve large-scale multi-agent systems. However, the field currently lacks a standardized evaluation protocol, forcing researchers to rely on bespoke, isolated, and often simplistic environments. This fragmentation makes it difficult to assess the robustness, generalization, and failure modes of emerging methods. To address this gap, we propose a comprehensive benchmark suite for MFGs (Bench-MFG), focusing on the discrete-time, discrete-space, stationary setting for the sake of clarity. We introduce a taxonomy of problem classes, ranging from no-interaction and monotone games to potential and dynamics-coupled games, and provide prototypical environments for each. Furthermore, we propose MF-Garnets, a method for generating random MFG instances to facilitate rigorous statistical testing. We benchmark a variety of learning algorithms across these environments, including a novel black-box approach (MF-PSO) for exploitability minimization. Based on our extensive empirical results, we propose guidelines to standardize future experimental comparisons. Code available at \hrefthis https URLthis https URL.

[MA-4] Building Large-Scale Drone Defenses from Small-Team Strategies

【速读】:该论文旨在解决大规模对抗性无人机蜂群(adversarial drone swarms)防御中多智能体优化方法难以扩展的问题。传统方法在团队规模扩大时效率急剧下降,无法有效协调大规模防御力量。解决方案的关键在于提出一种模块化框架,将已在小规模防御团队中验证有效的策略作为基础组件,通过动态规划(dynamic programming, DP)分解方法在多项式时间内高效组合成大规模防御队列;同时,通过迭代评估大团队表现与优化模块池的方式,克服单个单元在组合后性能衰减的问题,从而实现可扩展且高效的防御策略生成,并揭示出人工设计难以发现的协同行为。

链接: https://arxiv.org/abs/2602.12502
作者: Grant Douglas,Stephen Franklin,Claudia Szabo,Mingyu Guo
机构: University of Adelaide (阿德莱德大学)
类目: Multiagent Systems (cs.MA)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Defending against large adversarial drone swarms requires coordination methods that scale effectively beyond conventional multi-agent optimisation. In this paper, we propose to scale strategies proven effective in small defender teams by integrating them as modular components of larger forces using our proposed framework. A dynamic programming (DP) decomposition assembles these components into large teams in polynomial time, enabling efficient construction of scalable defenses without exhaustive evaluation. Because a unit that is strong in isolation may not remain strong when combined, we sample across multiple small-team candidates. Our framework iterates between evaluating large-team outcomes and refining the pool of modular components, allowing convergence on increasingly effective strategies. Experiments demonstrate that this partitioning approach scales to substantially larger scenarios while preserving effectiveness and revealing cooperative behaviours that direct optimisation cannot reliably discover.

[MA-5] heory of Mind Guided Strategy Adaptation for Zero-Shot Coordination AAMAS2026

【速读】:该论文旨在解决多智能体强化学习中智能体在零样本(zero-shot)场景下适应未见过队友的问题。现有方法通常采用两阶段训练流程:先构建多样化的伙伴智能体训练池,再训练一个最优响应(best-response)智能体以与整个训练池协作。然而,这类方法往往导致智能体收敛到静态的通用策略,而非具备根据队友特性动态调整的专用策略,从而限制了协同效率。论文提出一种基于心智理论(Theory-of-Mind)的自适应集成智能体(adaptive ensemble agent),其关键在于通过推理队友意图来选择最合适的策略,从策略集成中动态匹配最佳行为模式,从而实现更高协同性能。

链接: https://arxiv.org/abs/2602.12458
作者: Andrew Ni,Simon Stepputtis,Stefanos Nikolaidis,Michael Lewis,Katia P. Sycara,Woojun Kim
机构: Carnegie Mellon University (卡内基梅隆大学); Virginia Tech (弗吉尼亚理工学院); University of Southern California (南加州大学); University of Pittsburgh (匹兹堡大学)
类目: Multiagent Systems (cs.MA)
备注: Accepted at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

点击查看摘要

Abstract:A central challenge in multi-agent reinforcement learning is enabling agents to adapt to previously unseen teammates in a zero-shot fashion. Prior work in zero-shot coordination often follows a two-stage process, first generating a diverse training pool of partner agents, and then training a best-response agent to collaborate effectively with the entire training pool. While many previous works have achieved strong performance by devising better ways to diversify the partner agent pool, there has been less emphasis on how to leverage this pool to build an adaptive agent. One limitation is that the best-response agent may converge to a static, generalist policy that performs reasonably well across diverse teammates, rather than learning a more adaptive, specialist policy that can better adapt to teammates and achieve higher synergy. To address this, we propose an adaptive ensemble agent that uses Theory-of-Mind-based best-response selection to first infer its teammate’s intentions and then select the most suitable policy from a policy ensemble. We conduct experiments in the Overcooked environment to evaluate zero-shot coordination performance under both fully and partially observable settings. The empirical results demonstrate the superiority of our method over a single best-response baseline.

[MA-6] Agent Skills for Large Language Models : Architecture Acquisition Security and the Path Forward

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在实际部署中面临的静态能力固化问题,即传统单体模型将所有程序性知识编码于权重中,导致扩展性差、维护成本高且难以适应多样化任务需求。其核心解决方案是引入“技能抽象层”(skill abstraction layer),通过模块化设计使代理(agent)能够按需加载可组合的指令、代码和资源包,实现动态能力扩展而无需重新训练。关键创新在于提出一种基于渐进式披露(progressive disclosure)、可移植技能定义及与模型上下文协议(Model Context Protocol, MCP)集成的框架,并构建了四层权限治理机制以保障技能生态系统的安全性与可信度,从而推动下一代自主进化型智能体系统的发展。

链接: https://arxiv.org/abs/2602.12430
作者: Renjun Xu,Yang Yan
机构: Zhejiang University (浙江大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The transition from monolithic language models to modular, skill-equipped agents marks a defining shift in how large language models (LLMs) are deployed in practice. Rather than encoding all procedural knowledge within model weights, agent skills – composable packages of instructions, code, and resources that agents load on demand – enable dynamic capability extension without retraining. It is formalized in a paradigm of progressive disclosure, portable skill definitions, and integration with the Model Context Protocol (MCP). This survey provides a comprehensive treatment of the agent skills landscape, as it has rapidly evolved during the last few months. We organize the field along four axes: (i) architectural foundations, examining the this http URL specification, progressive context loading, and the complementary roles of skills and MCP; (ii) skill acquisition, covering reinforcement learning with skill libraries (SAGE), autonomous skill discovery (SEAgent), and compositional skill synthesis; (iii) deployment at scale, including the computer-use agent (CUA) stack, GUI grounding advances, and benchmark progress on OSWorld and SWE-bench; and (iv) security, where recent empirical analyses reveal that 26.1% of community-contributed skills contain vulnerabilities, motivating our proposed Skill Trust and Lifecycle Governance Framework – a four-tier, gate-based permission model that maps skill provenance to graduated deployment capabilities. We identify seven open challenges – from cross-platform skill portability to capability-based permission models – and propose a research agenda for realizing trustworthy, self-improving skill ecosystems. Unlike prior surveys that broadly cover LLM agents or tool use, this work focuses specifically on the emerging skill abstraction layer and its implications for the next generation of agentic systems. Project repo: this https URL.

[MA-7] Provably Convergent Actor-Critic in Risk-averse MARL

【速读】:该论文旨在解决无限时域一般和博弈(General-sum Markov Games, MGs)中学习平稳策略(stationary policies)这一基础性难题。现有方法在计算经典博弈论均衡的平稳形式时存在计算不可行性,尤其相较于单智能体强化学习或零和博弈而言更为困难。为弥合这一差距,作者提出基于风险规避量化响应均衡(Risk-averse Quantal Response Equilibria, RQE)的新解法,其核心在于利用RQE所具备的强正则性条件,使得该均衡在多智能体环境中可被有效学习。解决方案的关键创新是一个双时间尺度的Actor-Critic算法:其中快速时间尺度上的演员(actor)更新策略,慢速时间尺度上的评论家(critic)估计价值函数;通过理论证明,该算法在有限样本下实现全局收敛,且实验验证其收敛性能显著优于无风险敏感性的基线方法。

链接: https://arxiv.org/abs/2602.12386
作者: Yizhou Zhang,Eric Mazumdar
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning stationary policies in infinite-horizon general-sum Markov games (MGs) remains a fundamental open problem in Multi-Agent Reinforcement Learning (MARL). While stationary strategies are preferred for their practicality, computing stationary forms of classic game-theoretic equilibria is computationally intractable – a stark contrast to the comparative ease of solving single-agent RL or zero-sum games. To bridge this gap, we study Risk-averse Quantal response Equilibria (RQE), a solution concept rooted in behavioral game theory that incorporates risk aversion and bounded rationality. We demonstrate that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs. We propose a novel two-timescale Actor-Critic algorithm characterized by a fast-timescale actor and a slow-timescale critic. Leveraging the regularity of RQE, we prove that this approach achieves global convergence with finite-sample guarantees. We empirically validate our algorithm in several environments to demonstrate superior convergence properties compared to risk-neutral baselines.

[MA-8] GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

【速读】:该论文旨在解决当前AI安全评估中对多智能体(multi-agent)环境风险关注不足的问题,尤其是协调失败和冲突等潜在危害缺乏系统性衡量。现有基准主要聚焦于单智能体行为,难以反映真实场景下多个智能体交互时的复杂性与风险。其解决方案的关键在于构建GT-HarmBench——一个包含2,009个高风险场景的标准化基准,涵盖博弈论结构如囚徒困境(Prisoner’s Dilemma)、猎鹿博弈(Stag Hunt)和斗鸡博弈(Chicken),并基于MIT AI风险库中的真实情境设计。通过在15个前沿模型上测试发现,仅62%的情况下智能体会选择社会有益行动,且结果对提示框架和顺序敏感;进一步引入博弈论干预措施可将有益行为比例提升最多达18%,从而揭示了多智能体对齐(alignment)的重大可靠性差距,并提供了一个可复现、可扩展的测试平台用于未来研究。

链接: https://arxiv.org/abs/2602.12316
作者: Pepijn Cobben,Xuanqiang Angelo Huang,Thao Amelia Pham,Isabel Dahlgren,Terry Jingchen Zhang,Zhijing Jin
机构: ETH Zürich (苏黎世联邦理工学院); Berea College (贝瑞学院); University of Toronto (多伦多大学); Vector Institute (向量研究所); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 2,009 high-stakes scenarios spanning game-theoretic structures such as the Prisoner’s Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at this https URL.

[MA-9] OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization

【速读】:该论文旨在解决生成高性能CUDA内核(CUDA kernel)的难题,即在大量低级优化变换组合中寻找最优解时,面临硬件反馈噪声大、评估成本高以及难以系统性探索的问题。现有大型语言模型(Large Language Models, LLMs)虽能生成功能正确的CUDA代码,但难以自动实现与专家水平相当的性能优化。其解决方案的关键在于提出一个端到端框架OptiML,将内核优化建模为“带验证的搜索”问题:第一阶段由Mixture-of-Thoughts生成器(OptiML-G)根据自然语言意图或输入代码生成初始可执行程序;第二阶段通过基于蒙特卡洛树搜索(Monte Carlo Tree Search)的优化器(OptiML-X),在LLM驱动的编辑空间中迭代优化,利用Nsight Compute收集的硬件反馈构建硬件感知奖励函数,综合运行时间、硬件瓶颈代理指标和回归防护机制进行多目标评估,从而实现稳定且可解释的性能提升。

链接: https://arxiv.org/abs/2602.12305
作者: Arijit Bhattacharjee,Heng Ping,Son Vu Le,Paul Bogdan,Nesreen K. Ahmed,Ali Jannesari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Generating high-performance CUDA kernels remains challenging due to the need to navigate a combinatorial space of low-level transformations under noisy and expensive hardware feedback. Although large language models can synthesize functionally correct CUDA code, achieving competitive performance requires systematic exploration and verification of optimization choices. We present OptiML, an end-to-end framework that maps either natural-language intent or input CUDA code to performance-optimized CUDA kernels by formulating kernel optimization as search under verification. OptiML consists of two decoupled stages. When the input is natural language, a Mixture-of-Thoughts generator (OptiML-G) acts as a proposal policy over kernel implementation strategies, producing an initial executable program. A search-based optimizer (OptiML-X) then refines either synthesized or user-provided kernels using Monte Carlo Tree Search over LLM-driven edits, guided by a hardware-aware reward derived from profiler feedback. Each candidate transformation is compiled, verified, and profiled with Nsight Compute, and evaluated by a composite objective that combines runtime with hardware bottleneck proxies and guardrails against regressions. We evaluate OptiML in both synthesis-and-optimize and optimization-only settings on a diverse suite of CUDA kernels. Results show that OptiML consistently discovers verified performance improvements over strong LLM baselines and produces interpretable optimization trajectories grounded in profiler evidence.

[MA-10] Interference-Robust Non-Coherent Over-the-Air Computation for Decentralized Optimization

【速读】:该论文旨在解决非相干空中计算(Non-coherent over-the-air computation, NCOTA)在存在外部干扰时导致共识估计偏差及优化算法收敛性能下降的问题。NCOTA虽具备无需信道状态信息、无需调度且可扩展性强等优势,但其对干扰敏感的特性限制了实际应用。解决方案的关键在于提出一种干扰鲁棒的NCOTA(Interference-robust NCOTA, IR-NCOTA)方案:通过在所有节点间协调地进行帧参考系的随机旋转,并发送伪随机导频信号,将外部干扰转化为相对于旋转后坐标系呈圆对称分布且均值为零的形式,从而确保共识估计无偏,维持原生优化算法的收敛性保障。

链接: https://arxiv.org/abs/2602.12426
作者: Nicolò Michelusi
机构: 未知
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: To appear at IEEE ICC 2026

点击查看摘要

Abstract:Non-coherent over-the-air (NCOTA) computation enables low-latency and bandwidth-efficient decentralized optimization by exploiting the average energy superposition property of wireless channels. It has recently been proposed as a powerful tool for executing consensus-based optimization algorithms in fully decentralized systems. A key advantage of NCOTA is that it enables unbiased consensus estimation without channel state information at either transmitters or receivers, requires no transmission scheduling, and scales efficiently to dense network deployments. However, NCOTA is inherently susceptible to external interference, which can bias the consensus estimate and deteriorate the convergence of the underlying decentralized optimization algorithm. In this paper, we propose a novel interference-robust (IR-)NCOTA scheme. The core idea is to apply a coordinated random rotation of the frame of reference across all nodes, and transmit a pseudo-random pilot signal, allowing to transform external interference into a circularly symmetric distribution with zero mean relative to the rotated frame. This ensures that the consensus estimates remain unbiased, preserving the convergence guarantees of the underlying optimization algorithm. Through numerical results on a classification task, it is demonstrated that IR-NCOTA exhibits superior performance over the baseline NCOTA algorithm in the presence of external interference.

自然语言处理

[NLP-0] Semantic Chunking and the Entropy of Natural Language

【速读】: 该论文旨在解决自然语言中熵率(entropy rate)的定量建模问题,特别是解释为何印刷英文的熵率约为每字符1比特,即存在约80%的冗余度。传统观点认为这一冗余源于语言的统计规律性,但缺乏从结构上解析其多尺度语义组织机制的理论框架。解决方案的关键在于提出一种基于自相似分割(self-similar segmentation)的统计模型,该模型将文本逐层分解为语义一致的语块(semantic chunks),直至单个词级别,从而在语义层次上实现对自然语言结构的层级化建模。该模型不仅定量再现了真实文本的熵率,还揭示了熵率并非固定值,而是随语料库语义复杂度系统性上升,后者由模型中唯一的自由参数所表征。

链接: https://arxiv.org/abs/2602.13194
作者: Weishun Zhong,Doron Sivan,Tankut Can,Mikhail Katkov,Misha Tsodyks
机构: Institute for Advanced Study (高级研究院); Weizmann Institute of Science (魏茨曼科学研究所); Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
备注: 29 pages, 9 figures

点击查看摘要

Abstract:The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.

[NLP-1] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

【速读】: 该论文旨在解决当前视频语言模型(Video Language Models, VideoLMs)在处理视频时存在的两大问题:一是基于关键帧采样的方法因时间覆盖稀疏而可能遗漏宏观事件和微观细节;二是对每一帧全图及其token进行处理导致计算开销巨大。解决方案的关键在于利用视频编解码器原语(如运动矢量和残差),这些编码 primitives 天然捕捉了视频的冗余性和稀疏性,无需对大多数帧进行昂贵的全图像编码。作者设计了轻量级Transformer编码器来聚合这些编解码器特征,并通过预训练策略将它们与图像编码器嵌入对齐,从而加速端到端微调的收敛速度。该方法显著降低了首次token生成时间(最多减少86%)和token使用量(最多减少93%),同时在14个多样化的视频理解基准上保持或超越原有性能。

链接: https://arxiv.org/abs/2602.13191
作者: Sayan Deb Sarkar,Rémi Pautrat,Ondrej Miksik,Marc Pollefeys,Iro Armeni,Mahdi Rad,Mihai Dusmanu
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

[NLP-2] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在执行机器遗忘(Machine Unlearning)后,因后续进行后训练量化(Post-Training Quantization, PTQ)导致遗忘效果失效的问题。具体而言,低比特(如4-bit)量化会掩盖或消除标准全参数微调带来的更新,使模型恢复至未遗忘状态。解决方案的关键在于采用低秩适应(Low-Rank Adaptation, LoRA)策略:冻结基础模型参数,将遗忘操作集中于可训练的适配器模块中,从而确保有效更新在量化后仍能保留。实验表明,LoRA显著提升了4-bit量化下的遗忘性能与隐私保护水平,验证了其在量化部署场景中的有效性。

链接: https://arxiv.org/abs/2602.13151
作者: João Vitor Boer Abitante,Joana Meneguzzo Pasquali,Luan Fonseca Garcia,Ewerton de Oliveira,Thomas da Silva Paula,Rodrigo C. Barros,Lucas S. Kupssinskü
机构: 1. Universidade Federal do Rio Grande do Sul (巴西联邦里奥格兰德大学); 2. Instituto de Informática da Universidade Federal do Rio Grande do Sul (巴西联邦里奥格兰德大学信息学院); 3. Centro Universitário La Salle (拉萨尔大学中心); 4. Universidade Federal do Rio Grande do Sul (巴西联邦里奥格兰德大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) unlearning aims to remove targeted knowledge from a trained model, but practical deployments often require post-training quantization (PTQ) for efficient inference. However, aggressive low-bit PTQ can mask or erase unlearning updates, causing quantized models to revert to pre-unlearning behavior. We show that standard full-parameter fine-tuning often induce parameter changes that are too small to survive 4-bit quantization. We propose quantization-robust unlearning via low-rank adaptation (LoRA): we freeze the base model and concentrate unlearning into trainable adapters so that the effective update is preserved after quantization. On Llama-2-7B evaluated with MUSE dataset (BOOKS and NEWS), LoRA improves 4-bit utility by up to 7.93 points (NPO+GDR on BOOKS: 50.17 to 58.10) and yields higher 4-bit utility on NEWS for GA+GDR (40.06 to 44.82, increase of 4.76). LoRA also substantially reduces privacy leakage under 4-bit PTQ, e.g., for GA+KLR on BOOKS, PrivLeak moves from -25.68 to -5.86 (closer to ideal 0), while maintaining strong forgetting (VerMem and KnowMem near 0). Thus, using LoRA for Machine Unlearning is beneficial for scenarios where quantization is necessary for model deployment.

[NLP-3] OpenLID-v3: Improving the Precision of Closely Related Language Identification – An Experience Report EACL2026

【速读】: 该论文旨在解决现有语言识别(Language Identification, LID)工具在处理语系相近语言和区分有效自然语言与噪声数据时准确率不足的问题,这些问题会导致低资源语言的子集被污染,从而影响多语言语料库的质量。解决方案的关键在于对OpenLID分类器进行扩展:首先扩充训练数据以增强模型泛化能力;其次合并存在歧义的语言变体簇(如波斯尼亚语、克罗地亚语和塞尔维亚语),减少误判;最后引入专门用于标记噪声的标签,提升对非自然语言内容的识别能力。该改进版本称为OpenLID-v3,在多个基准测试中优于GlotLID,尤其在三组密切相关的语言群体(巴尔干地区语言、意大利北部与法国南部罗曼语方言、北欧语言)上表现显著提升。

链接: https://arxiv.org/abs/2602.13139
作者: Mariia Fedorova,Nikolay Arefyev,Maja Buljan,Jindřich Helcl,Stephan Oepen,Egil Rønningstad,Yves Scherrer
机构: University of Oslo (奥斯陆大学)
类目: Computation and Language (cs.CL)
备注: VarDial’26 workshop at the EACL 2026 conference

点击查看摘要

Abstract:Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on this https URL.

[NLP-4] From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media

【速读】: 该论文试图解决的问题是:不同语言使用场景(如新闻文本与社交媒体)中词汇新创(neology)机制的差异及其驱动因素是否一致。解决方案的关键在于,通过扩展先前基于静态词向量的分布语义分析方法,引入上下文嵌入(contextual embeddings)并应用于一个全新的推特(Twitter)语料库,验证了之前在历史出版物中发现的两个与词汇新创相关的因素在两种语境下依然成立,但指出话题流行度增长因子在推特上的作用可能弱于在出版文本中,从而揭示了不同语境下新词生成机制可能存在差异。

链接: https://arxiv.org/abs/2602.13123
作者: Maria Ryskina,Matthew R. Gormley,Kyle Mahowald,David R. Mortensen,Taylor Berg-Kirkpatrick,Vivek Kulkarni
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to LChange 2026

点击查看摘要

Abstract:Living languages are shaped by a host of conflicting internal and external evolutionary pressures. While some of these pressures are universal across languages and cultures, others differ depending on the social and conversational context: language use in newspapers is subject to very different constraints than language use on social media. Prior distributional semantic work on English word emergence (neology) identified two factors correlated with creation of new words by analyzing a corpus consisting primarily of historical published texts (Ryskina et al., 2020, arXiv:2001.07740). Extending this methodology to contextual embeddings in addition to static ones and applying it to a new corpus of Twitter posts, we show that the same findings hold for both domains, though the topic popularity growth factor may contribute less to neology on Twitter than in published writing. We hypothesize that this difference can be explained by the two domains favouring different neologism formation mechanisms.

[NLP-5] SCOPE: Selective Conformal Optimized Pairwise LLM Judging

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为评判者在成对评估中因校准不足和系统性偏差导致的可靠性问题。现有方法难以保证非弃权判断的错误率在可接受范围内,且缺乏有效的不确定性信号来指导选择性判断。解决方案的关键在于提出SCOPE框架,其核心是通过有限样本统计保证,在交换性假设下自适应校准接受阈值,使得非弃权判断的误差率不超过用户指定水平α;同时引入双向偏好熵(Bidirectional Preference Entropy, BPE),通过双向查询、概率聚合与顺序不变性约束生成更鲁棒的不确定性评分,从而显著提升选择性判断的准确性和覆盖率。实验表明,SCOPE在多个基准测试中均稳定满足风险约束,并实现比基线高2.4倍的判断接受率。

链接: https://arxiv.org/abs/2602.13110
作者: Sher Badshah,Ali Emami,Hassan Sajjad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level \alpha . To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at \alpha = 0.10 , \textscScope consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk \approx 0.097 to 0.099 ), while retaining substantial coverage, reaching 0.89 on RewardBench with Qwen-14B and 0.98 on RewardBench with Qwen-32B. Compared to naïve baselines, \textscScope accepts up to 2.4\times more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

[NLP-6] owards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts

【速读】: 该论文试图解决如何通过自然语言处理(Natural Language Processing, NLP)技术构建更可解释且泛化能力强的机器学习模型,用于自动评估第二语言写作水平的问题。其核心挑战在于现有研究缺乏将语言特征与语言能力发展明确关联的系统性方法。解决方案的关键在于精心筛选与语言复杂性和准确性直接相关的特征(如词汇、形态、表层结构及错误特征),而非依赖任务特定的输入变量,从而在保持高分类准确率(约0.9)的同时显著降低不同文本类型间的分类波动,提升了模型的稳定性与可解释性。

链接: https://arxiv.org/abs/2602.13102
作者: Kais Allkivi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Using NLP to analyze authentic learner language helps to build automated assessment and feedback tools. It also offers new and extensive insights into the development of second language production. However, there is a lack of research explicitly combining these aspects. This study aimed to classify Estonian proficiency examination writings (levels A2-C1), assuming that careful feature selection can lead to more explainable and generalizable machine learning models for language testing. Various linguistic properties of the training data were analyzed to identify relevant proficiency predictors associated with increasing complexity and correctness, rather than the writing task. Such lexical, morphological, surface, and error features were used to train classification models, which were compared to models that also allowed for other features. The pre-selected features yielded a similar test accuracy but reduced variation in the classification of different text types. The best classifiers achieved an accuracy of around 0.9. Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets. The results have been implemented in the writing evaluation module of an Estonian open-source language learning environment.

[NLP-7] Consistency of Large Reasoning Models Under Multi-Turn Attacks

【速读】: 该论文旨在解决生成式 AI(Generative AI)中推理模型在多轮对抗攻击下的鲁棒性问题,即尽管当前前沿推理模型在复杂任务上表现优异,但其在持续对抗压力下的稳定性尚未得到充分研究。解决方案的关键在于通过系统性评估九种前沿推理模型的脆弱性模式,识别出五类典型失败机制(Self-Doubt、Social Conformity、Suggestion Hijacking、Emotional Susceptibility、Reasoning Fatigue),并发现传统基于置信度的防御方法(如 Confidence-Aware Response Generation, CARG)对推理模型无效——原因在于推理过程延长导致模型过度自信,而随机置信度嵌入反而优于针对性提取策略。这表明推理能力本身并不自动带来鲁棒性,且现有置信度驱动的防御机制需针对推理模型进行根本性重构。

链接: https://arxiv.org/abs/2602.13093
作者: Yubo Li,Ramayya Krishnan,Rema Padman
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.

[NLP-8] Exploring a New Competency Modeling Process with Large Language Models

【速读】: 该论文旨在解决传统胜任力建模(Competency Modeling)方法依赖专家手动分析大量访谈文本所导致的成本高、随机性强、模糊性大及可复现性差的问题。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的全新胜任力建模流程,通过将专家实践分解为结构化的计算组件,利用LLMs从原始文本中提取行为与心理描述,并借助嵌入相似度映射至预定义的胜任力库;进一步引入可学习参数以自适应融合不同信息源,从而动态确定行为与心理信号的相对重要性;同时设计离线评估机制,实现无需额外大规模数据采集即可系统性选择模型,显著提升了建模过程的透明性、数据驱动性和可验证性。

链接: https://arxiv.org/abs/2602.13084
作者: Silin Du,Manqing Xin,Raymond Jia Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Competency modeling is widely used in human resource management to select, develop, and evaluate talent. However, traditional expert-driven approaches rely heavily on manual analysis of large volumes of interview transcripts, making them costly and prone to randomness, ambiguity, and limited reproducibility. This study proposes a new competency modeling process built on large language models (LLMs). Instead of merely automating isolated steps, we reconstruct the workflow by decomposing expert practices into structured computational components. Specifically, we leverage LLMs to extract behavioral and psychological descriptions from raw textual data and map them to predefined competency libraries through embedding-based similarity. We further introduce a learnable parameter that adaptively integrates different information sources, enabling the model to determine the relative importance of behavioral and psychological signals. To address the long-standing challenge of validation, we develop an offline evaluation procedure that allows systematic model selection without requiring additional large-scale data collection. Empirical results from a real-world implementation in a software outsourcing company demonstrate strong predictive validity, cross-library consistency, and structural robustness. Overall, our framework transforms competency modeling from a largely qualitative and expert-dependent practice into a transparent, data-driven, and evaluable analytical process.

[NLP-9] LCSB: Layer-Cyclic Selective Backpropagation for Memory-Efficient On-Device LLM Fine-Tuning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在内存受限设备(如移动设备)上进行一阶微调时的计算效率问题,特别是传统记忆高效反向传播(Memory-efficient Backpropagation, MeBP)方法因需在每一步对所有Transformer层执行反向计算而导致的高开销,其中权重解压缩占反向时间的32–42%。解决方案的关键在于提出分层循环选择性反向传播(Layer-Cyclic Selective Backpropagation, LCSB),其核心思想是:利用残差连接(residual connections)保障梯度通过恒等路径传递,同时借助AdamW优化器动量机制对未选中的层提供隐式更新。LCSB被形式化为LoRA参数空间上的块坐标下降(Block Coordinate Descent),从而获得收敛性的理论保障,并在多个模型和任务中实现最高1.40倍加速且质量损失小于2%,并在4-bit量化场景下展现出更强的稳定性,暗示选择性梯度计算具有隐式正则化效应。

链接: https://arxiv.org/abs/2602.13073
作者: Juneyoung Park,Eunbeen Yoon,Seongwan Kim. Jaeho Lee
机构: Opt-AI Inc.
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under the review, 13 pages

点击查看摘要

Abstract:Memory-efficient backpropagation (MeBP) has enabled first-order fine-tuning of large language models (LLMs) on mobile devices with less than 1GB memory. However, MeBP requires backward computation through all transformer layers at every step, where weight decompression alone accounts for 32–42% of backward time. We propose Layer-Cyclic Selective Backpropagation (LCSB), which computes gradients for only a subset of layers per step. Our key insight is that residual connections guarantee gradient flow through identity paths, while AdamW momentum provides implicit updates for non-selected layers. We interpret LCSB as Block Coordinate Descent on the LoRA parameter space, providing theoretical justification for convergence. LCSB achieves up to 1.40 \times speedup with less than 2% quality degradation across five models and three tasks. Surprisingly, in 4-bit quantized settings, LCSB exhibits superior stability: a 3B model that completely diverges under full backpropagation converges smoothly with LCSB, suggesting an implicit regularization effect from selective gradient computation.

[NLP-10] Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

【速读】: 该论文旨在解决移动设备上大语言模型(Large Language Models, LLMs)的本地微调(on-device fine-tuning)所面临的严重内存限制问题,现有方法要么因存储完整梯度而占用高内存(如MeBP),要么采用低内存但梯度估计噪声大的近似策略(如MeZO),导致性能或可行性受限。其解决方案的关键在于提出一种名为Memory-efficient Structured Backpropagation (MeSP) 的新方法,通过手动推导利用LoRA(Low-Rank Adaptation)低秩结构的反向传播路径,巧妙地在前向过程中不保存中间投影结果 $ h = xA $,而在反向传播时以极低成本重新计算该值(因低秩 $ r \ll d_{\text{in}} $),从而实现与MeBP完全一致的梯度计算,同时显著降低峰值内存占用——实验证明在Qwen2.5模型(0.5B–3B参数规模)上平均减少49%内存,且有效避免了MeZO梯度估计与真实梯度几乎无相关性(余弦相似度≈0.001)的问题,使原本无法在资源受限设备上进行的微调任务成为可能。

链接: https://arxiv.org/abs/2602.13069
作者: Juneyoung Park,Yuri Hong,Seongwan Kim,Jaeho Lee
机构: Opt-AI Inc.
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under the review, 11 pages

点击查看摘要

Abstract:On-device fine-tuning enables privacy-preserving personalization of large language models, but mobile devices impose severe memory constraints, typically 6–12GB shared across all workloads. Existing approaches force a trade-off between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). We propose Memory-efficient Structured Backpropagation (MeSP), which bridges this gap by manually deriving backward passes that exploit LoRA’s low-rank structure. Our key insight is that the intermediate projection h = xA can be recomputed during backward at minimal cost since rank r \ll d_in , eliminating the need to store it. MeSP achieves 49% average memory reduction compared to MeBP on Qwen2.5 models (0.5B–3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO’s gradient estimates show near-zero correlation with true gradients (cosine similarity \approx 0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.

[NLP-11] raceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution

【速读】: 该论文旨在解决结构化表格问答(Table QA)中缺乏细粒度归因的问题,即现有系统虽能给出正确答案,但难以明确指出支持答案的具体单元格(cell),从而限制了在高风险场景下的可信度。其解决方案的关键在于提出TraceBack框架,该框架通过模块化多智能体机制实现可扩展的单元格级归因:首先对表格进行剪枝以保留相关行和列,接着将问题分解为语义一致的子问题,并逐个将答案片段与其支撑单元格对齐,从而捕捉显性和隐性推理证据。此外,作者还构建了CITEBench基准与FairScore无参考指标,用于系统评估归因精度与召回率,有效支持可解释且可扩展的表格问答评估。

链接: https://arxiv.org/abs/2602.13059
作者: Tejas Anvekar,Junha Park,Rajat Jha,Devanshu Gupta,Poojah Ganesan,Puneeth Mathur,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular multi-agent framework for scalable, cell-level attribution in single-table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub-questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference-less metric that compares atomic facts derived from predicted cells and answers to estimate attribution precision and recall without human cell labels. Experiments show that TraceBack substantially outperforms strong baselines across datasets and granularities, while FairScore closely tracks human judgments and preserves relative method rankings, supporting interpretable and scalable evaluation of table-based QA.

[NLP-12] Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 模型在检测认知障碍(如痴呆和轻度认知障碍 MCI)时对英国少数族裔多语言人群存在的偏见问题,尤其是在非英语母语者中可能产生的误诊风险。研究发现,尽管自动语音识别(ASR)系统未表现出显著群体差异,但基于声学与语言特征的分类和回归模型在记忆、流畅性和阅读任务中对多语言群体存在明显偏差,尤其当模型训练数据来自公开的 DementiaBank 数据集时;此外,这些群体更易被错误归类为存在认知衰退,且具有特定口音(南约克郡口音)的个体被误判为病情更严重。解决方案的关键在于开发更具泛化能力、并经过偏见缓解处理的新一代 AI 模型,以提升其在多元文化背景下的临床可靠性与公平性。

链接: https://arxiv.org/abs/2602.13047
作者: Madhurananda Pahar,Caitlin Illingworth,Dorota Braun,Bahman Mirheidari,Lise Sproson,Daniel Blackburn,Heidi Christensen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conversational speech often reveals early signs of cognitive decline, such as dementia and MCI. In the UK, one in four people belongs to an ethnic minority, and dementia prevalence is expected to rise most rapidly among Black and Asian communities. This study examines the trustworthiness of AI models, specifically the presence of bias, in detecting healthy multilingual English speakers among the cognitively impaired cohort, to make these tools clinically beneficial. For experiments, monolingual participants were recruited nationally (UK), and multilingual speakers were enrolled from four community centres in Sheffield and Bradford. In addition to a non-native English accent, multilinguals spoke Somali, Chinese, or South Asian languages, who were further divided into two Yorkshire accents (West and South) to challenge the efficiency of the AI tools thoroughly. Although ASR systems showed no significant bias across groups, classification and regression models using acoustic and linguistic features exhibited bias against multilingual speakers, particularly in memory, fluency, and reading tasks. This bias was more pronounced when models were trained on the publicly available DementiaBank dataset. Moreover, multilinguals were more likely to be misclassified as having cognitive decline. This study is the first of its kind to discover that, despite their strong overall performance, current AI models show bias against multilingual individuals from ethnic minority backgrounds in the UK, and they are also more likely to misclassify speakers with a certain accent (South Yorkshire) as living with a more severe cognitive decline. In this pilot study, we conclude that the existing AI tools are therefore not yet reliable for diagnostic use in these populations, and we aim to address this in future work by developing more generalisable, bias-mitigated models.

[NLP-13] Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在基于可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards, RLVR)中,采样温度(sampling temperature)设置不合理导致探索-利用权衡(exploration-exploitation trade-off)失衡的问题。现有方法通常采用静态或启发式调整策略,且与任务层面的奖励解耦,难以适应不同推理阶段的不确定性。解决方案的关键在于提出一种分层强化学习框架——内省式大语言模型(Introspective LLM),其核心机制是在每个解码步骤中,基于模型隐藏状态动态选择采样温度,并通过坐标上升法联合优化温度策略与token生成策略,从而实现从下游奖励信号中自动学习适应性温度控制,显著提升数学推理性能并展现出与推理不确定性一致的可解释探索行为。

链接: https://arxiv.org/abs/2602.13035
作者: Yixiao Zhou,Yang Li,Dongzhou Cheng,Hehe Fan,Yu Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration–exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.

[NLP-14] Buy versus Build an LLM : A Decision Framework for Governments

【速读】: 该论文旨在解决政府在部署大型语言模型(Large Language Models, LLMs)时面临的战略选择问题,即在公共部门应用中应采取“购买”(buy)、“自建”(build)还是混合(hybrid)路径。其核心挑战在于如何平衡主权控制、安全性、成本效益、资源能力、文化适配性和可持续性等多维因素,尤其是在主流LLM提供商多为外国企业、且模型输出日益被用于关键公共决策与舆论引导的背景下。解决方案的关键在于构建一个系统性的战略框架,通过评估不同路径在上述维度上的表现,帮助政策制定者依据本国具体需求和治理目标,理性决策是否采用商业服务、自主开发或协同共建模式——其中“自建”并非要求政府单独行动,而是可通过公共研究机构、高校、国有企业、合资企业或国家生态体系共同推进本土能力发展。

链接: https://arxiv.org/abs/2602.13033
作者: Jiahao Lu,Ziwei Xu,William Tjhi,Junnan Li,Antoine Bosselut,Pang Wei Koh,Mohan Kankanhalli
机构: National University of Singapore(新加坡国立大学); AI Singapore(新加坡人工智能研究所); EPFL(瑞士联邦理工学院洛桑分校); University of Washington(华盛顿大学); Allen Institute for AI(艾伦人工智能研究所); Salesforce AI Research(Salesforce人工智能研究中心)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: The short version of this document is published as an ACM TechBrief, and this document is published as an ACM Technology Policy Council white paper

点击查看摘要

Abstract:Large Language Models (LLMs) represent a new frontier of digital infrastructure that can support a wide range of public-sector applications, from general purpose citizen services to specialized and sensitive state functions. When expanding AI access, governments face a set of strategic choices over whether to buy existing services, build domestic capabilities, or adopt hybrid approaches across different domains and use cases. These are critical decisions especially when leading model providers are often foreign corporations, and LLM outputs are increasingly treated as trusted inputs to public decision-making and public discourse. In practice, these decisions are not intended to mandate a single approach across all domains; instead, national AI strategies are typically pluralistic, with sovereign, commercial and open-source models coexisting to serve different purposes. Governments may rely on commercial models for non-sensitive or commodity tasks, while pursuing greater control for critical, high-risk or strategically important applications. This paper provides a strategic framework for making this decision by evaluating these options across dimensions including sovereignty, safety, cost, resource capability, cultural fit, and sustainability. Importantly, “building” does not imply that governments must act alone: domestic capabilities may be developed through public research institutions, universities, state-owned enterprises, joint ventures, or broader national ecosystems. By detailing the technical requirements and practical challenges of each pathway, this work aims to serve as a reference for policy-makers to determine whether a buy or build approach best aligns with their specific national needs and societal goals. Comments: The short version of this document is published as an ACM TechBrief, and this document is published as an ACM Technology Policy Council white paper Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Social and Information Networks (cs.SI) ACMclasses: K.4.1; K.1; K.4.2; K.4.3; K.5.2; K.6.1; J.1 Cite as: arXiv:2602.13033 [cs.CY] (or arXiv:2602.13033v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2602.13033 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-15] Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark Framework and Analysis

【速读】: 该论文旨在解决图像编辑模型评估中存在的粒度粗略和可解释性不足的问题,传统指标往往无法准确反映人类感知与用户意图,尤其在可控性、编辑定位性和指令忠实性等方面存在明显缺陷。其解决方案的关键在于提出一种基于多模态大语言模型(Multimodal Large Language Model, MLLM)作为评判者的细粒度评估框架,将常见的评价维度分解为十二个可解释的因素,覆盖图像保真度、编辑质量与指令一致性三个层面,并构建了一个融合人工标注、MLLM评估、模型输出及传统指标的新型基准测试体系。实证表明,该框架在细粒度上与人工评估高度一致,且显著优于传统指标对过编辑或语义模糊结果的识别能力,从而为图像编辑方法的研究、比较与优化提供了可靠且可扩展的评估基础。

链接: https://arxiv.org/abs/2602.13028
作者: Runzhou Liu(1),Hailey Weingord(2),Sejal Mittal(2),Prakhar Dungarwal(2),Anusha Nandula(2),Bo Ni(3),Samyadeep Basu(4),Hongjie Chen(5),Nesreen K. Ahmed(6),Li Li(7),Jiayi Zhang(8),Koustava Goswami(4),Subhojyoti Mukherjee(4),Branislav Kveton(4),Puneet Mathur(4),Franck Dernoncourt(4),Yue Zhao(7),Yu Wang(9),Ryan A. Rossi(4),Zhengzhong Tu(10),Hongru Du(1) ((1) University of Virginia, (2) Columbia University, (3) Vanderbilt University, (4) Adobe Research, (5) Dolby Laboratories, (6) Cisco Research, (7) University of Southern California, (8) University of Wisconsin-Madison, (9) University of Oregon, (10) Texas Aamp;M University)
机构: University of Virginia (弗吉尼亚大学); Columbia University (哥伦比亚大学); Vanderbilt University (范德比尔特大学); Adobe Research (Adobe 研究院); Dolby Laboratories (杜比实验室); Cisco Research (思科研究院); University of Southern California (南加州大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Oregon (俄勒冈大学); Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.

[NLP-16] Know More Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models

【速读】: 该论文旨在解决当前知识增强方法在提升大语言模型(Large Language Models, LLMs)性能时,忽视模型内部认知状态与外部知识之间存在“知识-信心差距”(knowledge-confidence gaps)的问题,即模型可能对错误信息过度自信或对正确知识缺乏确定性,从而导致不可靠的推理输出。解决方案的关键在于提出一种新型元认知(meta-cognitive)框架,通过区分干预(differentiated intervention)与对齐(alignment)机制,利用模型内部的认知信号将知识空间划分为已掌握(mastered)、混淆(confused)和缺失(missing)三类区域,并引入认知一致性机制(cognitive consistency mechanism),以同步主观置信度与客观准确性,实现校准的知识边界,从而提升知识增强的可靠性与认知合理性。

链接: https://arxiv.org/abs/2602.12996
作者: Hao Chen,Ye He,Yuchun Fan,Yukun Yan,Zhenghao Liu,Qingfu Zhu,Maosong Sun,Wanxiang Che
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Our approach leverages internal cognitive signals to partition the knowledge space into mastered, confused, and missing regions, guiding targeted knowledge expansion. Furthermore, we introduce a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy, ensuring calibrated knowledge boundaries. Extensive experiments demonstrate the our framework consistently outperforms strong baselines, validating its rationality in not only enhancing knowledge capabilities but also fostering cognitive behaviors that better distinguish knowns from unknowns.

[NLP-17] Evaluating the Homogeneity of Keyphrase Prediction Models LREC2026

【速读】: 该论文旨在解决当前关键短语(keyphrase)预测模型评估中缺乏对“同质性”(homogeneity)考量的问题,即模型在处理内容相关但文本表述不同的文档时,是否能稳定地生成一致的关键短语。传统关键短语提取方法(keyphrase extraction)依赖于文本中已出现的词汇,而生成式模型(keyphrase generation models)具备预测文本中未出现的关键短语(absent keyphrases)的能力,理论上应提升同质性。然而,论文通过提出一种新的评估方法发现,生成式模型的这种“缺席关键短语生成能力”反而可能削弱模型的同质性表现,且关键短语提取方法在同质性指标上与生成式模型相当甚至更优。其解决方案的核心在于设计了一套量化同质性的评估框架,并利用该框架验证了生成式模型在实际应用中并不必然优于提取式模型,从而揭示了现有基准测试的局限性并为未来模型优化提供了新方向。

链接: https://arxiv.org/abs/2602.12989
作者: Maël Houbre,Florian Boudin,Beatrice Daille
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to LREC 2026

点击查看摘要

Abstract:Keyphrases which are useful in several NLP and IR applications are either extracted from text or predicted by generative models. Contrarily to keyphrase extraction approaches, keyphrase generation models can predict keyphrases that do not appear in a document’s text called absent keyphrases. This ability means that keyphrase generation models can associate a document to a notion that is not explicitly mentioned in its text. Intuitively, this suggests that for two documents treating the same subjects, a keyphrase generation model is more likely to be homogeneous in their indexing i.e. predict the same keyphrase for both documents, regardless of those keyphrases appearing in their respective text or not; something a keyphrase extraction model would fail to do. Yet, homogeneity of keyphrase prediction models is not covered by current benchmarks. In this work, we introduce a method to evaluate the homogeneity of keyphrase prediction models and study if absent keyphrase generation capabilities actually help the model to be more homogeneous. To our surprise, we show that keyphrase extraction methods are competitive with generative models, and that the ability to generate absent keyphrases can actually have a negative impact on homogeneity. Our data, code and prompts are available on huggingface and github.

[NLP-18] SciAgent Gym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

【速读】: 该论文旨在解决当前科学推理基准普遍忽视智能体(agent)协调复杂工具以执行领域特定科学工作流的能力这一问题。现有评估体系难以全面测试模型在多步骤、长程科学任务中的工具使用能力,导致对真实科研场景的适配性不足。解决方案的关键在于提出两个核心组件:一是SciAgentGym,一个包含1780个跨四大自然科学领域的专用工具的可扩展交互环境;二是SciAgentBench,一种分层评估套件,用于系统性地检验从基础操作到复杂工作流的代理性能。进一步地,为提升模型在多步工具调用中的稳定性,作者提出SciForge数据合成方法,将工具动作空间建模为依赖图(dependency graph),生成逻辑感知的训练轨迹,并通过微调显著提升了模型表现,如SciAgent-8B在低于Qwen3-VL-235B-Instruct规模的情况下实现更优性能并展现出跨学科迁移能力。

链接: https://arxiv.org/abs/2602.12984
作者: Yujiong Shen,Yajie Yang,Zhiheng Xi,Binze Hu,Huayu Sha,Jiazheng Zhang,Qiyuan Peng,Junlin Shang,Jixuan Huang,Yutao Fan,Jingqi Tong,Shihan Dou,Ming Zhang,Lei Bai,Zhenfei Yin,Tao Gui,Xingjun Ma,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang
机构: Fudan NLP Group (复旦大学自然语言处理组)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents’ ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

[NLP-19] ProbeLLM : Automating Principled Diagnosis of LLM Failures

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在快速演进过程中,静态评估方法难以捕捉其系统性失败模式的问题。现有自动化探测方法通常只能发现孤立的错误案例,缺乏对探索过程的结构化控制,且难以揭示模型弱点的本质组织结构。解决方案的关键在于提出ProbeLLM——一个基准无关的自动化探测框架,它将探测任务建模为分层蒙特卡洛树搜索(hierarchical Monte Carlo Tree Search),在全局探索新失败区域与局部精炼重复错误模式之间智能分配有限的探测预算;同时通过工具增强的生成与验证机制确保测试用例的可验证性,并利用故障感知嵌入(failure-aware embeddings)和边界感知归纳(boundary-aware induction)将发现的失败归类为可解释的失败模式(failure modes),从而实现从零散案例到结构性弱点发现的跃迁。

链接: https://arxiv.org/abs/2602.12966
作者: Yue Huang,Zhengzhe Jiang,Yuchen Ma,Yu Jiang,Xiangqi Wang,Yujun Zhou,Yuexing Hao,Kehan Guo,Pin-Yu Chen,Stefan Feuerriegel,Xiangliang Zhang
机构: 未知
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.

[NLP-20] Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models EACL2026

【速读】: 该论文旨在解决阿拉伯语方言识别(Arabic Dialect Identification, ADI)任务中因缺乏大规模多标签标注数据而导致模型难以有效处理多方言共存场景的问题。传统方法将ADI建模为单标签分类任务,但实际中同一句子可能在多个方言中均可接受,导致负样本选择不当,限制了模型泛化能力。解决方案的关键在于:首先利用GPT-4o与二元方言可接受性分类器自动构建多标签标注数据集,并通过阿拉伯语方言程度等级(Arabic Level of Dialectness, ALDi)指导标注聚合;其次采用基于课程学习(curriculum learning)策略训练BERT-based多标签分类器,其训练顺序依据方言复杂度和标签基数进行优化,从而显著提升模型性能,在MLADI基准上达到0.69的宏F1分数,优于此前最优系统(0.55)。

链接: https://arxiv.org/abs/2602.12937
作者: Ali Mekky,Mohamed El Zeftawy,Lara Hassan,Amr Keleg,Preslav Nakov
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at the 12th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2026

点击查看摘要

Abstract:Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at this https URL.

[NLP-21] When Words Dont Mean What They Say: Figurative Understanding in Bengali Idioms LREC2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言中理解隐喻性语言(figurative language)的难题,特别是针对孟加拉语(Bengali)这一代表性低资源语言。其解决方案的关键在于构建了一个大规模、文化 grounded 的成语数据集(idiom dataset),包含10,361条孟加拉语成语,并通过专家共识流程建立了一个涵盖语义、句法、文化和宗教维度的19维标注体系,从而为计算语言学提供了结构化资源;同时,基于该数据集建立了基准测试(benchmark),评估了30个先进多语言和指令微调LLMs在推断隐喻含义任务上的表现,揭示了当前模型在跨语言与跨文化推理能力上的显著不足(准确率未超过50%,远低于人类水平83.4%)。该工作为提升低资源语言中隐喻理解和文化嵌入能力奠定了基础。

链接: https://arxiv.org/abs/2602.12921
作者: Adib Sakhawat,Shamim Ara Parveen,Md Ruhul Amin,Shamim Al Mahmud,Md Saiful Islam,Tahera Khatun
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures. Accepted for presentation at LREC 2026 (Language Resources and Evaluation Conference)

点击查看摘要

Abstract:Figurative language understanding remains a significant challenge for Large Language Models (LLMs), especially for low-resource languages. To address this, we introduce a new idiom dataset, a large-scale, culturally-grounded corpus of 10,361 Bengali idioms. Each idiom is annotated under a comprehensive 19-field schema, established and refined through a deliberative expert consensus process, that captures its semantic, syntactic, cultural, and religious dimensions, providing a rich, structured resource for computational linguistics. To establish a robust benchmark for Bangla figurative language understanding, we evaluate 30 state-of-the-art multilingual and instruction-tuned LLMs on the task of inferring figurative meaning. Our results reveal a critical performance gap, with no model surpassing 50% accuracy, a stark contrast to significantly higher human performance (83.4%). This underscores the limitations of existing models in cross-linguistic and cultural reasoning. By releasing the new idiom dataset and benchmark, we provide foundational infrastructure for advancing figurative language understanding and cultural grounding in LLMs for Bengali and other low-resource languages.

[NLP-22] ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset Benchmark LREC2026

【速读】: 该论文旨在解决越南语医疗场景中代码切换(Code-switching, CS)现象对自动语音识别(ASR)系统性能造成的挑战,尤其是低资源语言如越南语在混合使用英语医学术语时的识别准确率问题。其关键解决方案是构建首个面向越南语医疗代码切换的34小时语音数据集(ViMedCSS),并系统评估多种先进ASR模型及微调策略,发现基于越南语优化的模型在通用段落上表现更优,而多语言预训练有助于捕捉英语插入项;最终通过结合两者策略,在整体识别与代码切换准确性之间取得最佳平衡,为低资源多语言ASR系统的领域适配提供了有效路径。

链接: https://arxiv.org/abs/2602.12911
作者: Tung X. Nguyen,Nhu Vo,Giang-Son Nguyen,Duy Mai Hoang,Chien Dinh Huynh,Inigo Jauregi Unanue,Massimo Piccardi,Wray Buntine,Dung D. Le
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026

点击查看摘要

Abstract:Code-switching (CS), which is when Vietnamese speech uses English words like drug names or procedures, is a common phenomenon in Vietnamese medical communication. This creates challenges for Automatic Speech Recognition (ASR) systems, especially in low-resource languages like Vietnamese. Current most ASR systems struggle to recognize correctly English medical terms within Vietnamese sentences, and no benchmark addresses this challenge. In this paper, we construct a 34-hour \textbfVietnamese \textbfMedical \textbfCode-\textbfSwitching \textbfSpeech dataset (ViMedCSS) containing 16,576 utterances. Each utterance includes at least one English medical term drawn from a curated bilingual lexicon covering five medical topics. Using this dataset, we evaluate several state-of-the-art ASR models and examine different specific fine-tuning strategies for improving medical term recognition to investigate the best approach to solve in the dataset. Experimental results show that Vietnamese-optimized models perform better on general segments, while multilingual pretraining helps capture English insertions. The combination of both approaches yields the best balance between overall and code-switched accuracy. This work provides the first benchmark for Vietnamese medical code-switching and offers insights into effective domain adaptation for low-resource, multilingual ASR systems.

[NLP-23] RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training

【速读】: 该论文旨在解决预训练多模态大语言模型(Multi-modal Large Language Models, MLLMs)在后训练阶段缺乏高效评估框架的问题,从而难以诊断其性能瓶颈。现有方法依赖监督微调后的测试,成本高昂且无法解耦感知与推理能力的评估;同时,常用预训练指标无法量化模型在感知和推理上的独立发展水平,且现有基准数据集规模有限或与预训练目标不一致。为此,作者提出RADAR框架,其核心在于两个关键组件:一是“软判别得分”(Soft Discrimination Score),通过量化模型对正确答案相对于干扰项的偏好程度,在无需微调的情况下稳健追踪能力发展;二是“多模态混合基准”(Multi-Modal Mixture Benchmark),一个包含15K+样本的零样本评估基准,整合权威数据集并新增高质量样本,系统性覆盖感知与推理能力的评估范围,填补当前基准的空白。RADAR揭示了预训练MLLMs中感知与推理能力发展的不对称性,并为针对性优化提供了依据。

链接: https://arxiv.org/abs/2602.12892
作者: Yunshuang Nie,Bingqian Lin,Minzhe Niu,Kun Xiang,Jianhua Han,Guowei Huang,Xingyue Quan,Hang Xu,Bokui Chen,Xiaodan Liang
机构: Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Peng Cheng Laboratory (鹏城实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Shanghai Jiao Tong University (上海交通大学); Yinwang Intelligent Technology Co., Ltd. (英伟达智能科技有限公司); Huawei’s 2012 Lab (华为2012实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs. Meanwhile, common pre-training metrics cannot quantify a model’s perception and reasoning abilities in a disentangled manner. Furthermore, existing evaluation benchmarks are typically limited in scale or misaligned with pre-training objectives. Thus, we propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training. RADAR involves two key components: (1) Soft Discrimination Score, a novel metric for robustly tracking ability development without fine-tuning, based on quantifying nuanced gradations of the model preference for the correct answer over distractors; and (2) Multi-Modal Mixture Benchmark, a new 15K+ sample benchmark for comprehensively evaluating pre-trained MLLMs’ perception and reasoning abilities in a 0-shot manner, where we unify authoritative benchmark datasets and carefully collect new datasets, extending the evaluation scope and addressing the critical gaps in current benchmarks. With RADAR, we comprehensively reveal the asymmetric development of perceptual and reasoning capabilities in pretrained MLLMs across diverse factors, including data volume, model size, and pretraining strategy. Our RADAR underscores the need for a decomposed perspective on pre-training ability bottlenecks, informing targeted interventions to advance MLLMs efficiently. Our code is publicly available at this https URL.

[NLP-24] BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

【速读】: 该论文旨在解决大语言模型在符号推理与时间复合推理能力评估中的标准化问题,现有方法多依赖于零散的提示或主观评测,缺乏客观、可控的基准测试体系。其解决方案的关键在于构建BaziQA-Benchmark——一个基于200道专业设计的多选题构成的标准化测试集,这些题目源自全球命理师竞赛(2021–2025),每道题均需对固定符号图(symbolic chart)和交互式时间条件进行结构化推理。该基准支持跨年份、跨领域及跨模型家族的客观评分与对比分析,并引入轻量级结构化推理协议(Structured Reasoning Protocol),在不引入领域知识的前提下约束推理顺序,从而系统性地探测模型的时间敏感性和推理行为。实验表明,尽管当前模型表现优于随机猜测,但在精确时间定位和多条件符号判断上仍存在系统性失败,且对时间组合与推理顺序高度敏感。

链接: https://arxiv.org/abs/2602.12889
作者: Jiangxi Chen,Qian Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021–2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference this http URL further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.

[NLP-25] Semantic Communities and Boundary-Spanning Lyrics in K-pop: A Graph-Based Unsupervised Analysis

【速读】: 该论文旨在解决大规模歌词语料库在数据驱动分析中面临的三大挑战:缺乏可靠标注、多语言内容以及高度的风格重复性。现有方法多依赖监督分类、流派标签或粗粒度文档级表示,难以揭示潜在的语义结构。其解决方案的关键在于提出一种基于图结构的无监督框架,通过构建歌词级别的语义相似性图并应用社区检测算法,无需流派、艺术家或语言监督即可发现稳定的微主题社区;同时利用图论中的桥接指标识别跨越边界的歌曲,并分析其结构特性,从而揭示出边界歌词具有更高词汇多样性与更低重复率,挑战了“钩子强度或重复性驱动跨主题连通性”的传统假设。该方法具备语言无关性,适用于未标注的文化文本语料库。

链接: https://arxiv.org/abs/2602.12881
作者: Oktay Karakuş
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large-scale lyric corpora present unique challenges for data-driven analysis, including the absence of reliable annotations, multilingual content, and high levels of stylistic repetition. Most existing approaches rely on supervised classification, genre labels, or coarse document-level representations, limiting their ability to uncover latent semantic structure. We present a graph-based framework for unsupervised discovery and evaluation of semantic communities in K-pop lyrics using line-level semantic representations. By constructing a similarity graph over lyric texts and applying community detection, we uncover stable micro-theme communities without genre, artist, or language supervision. We further identify boundary-spanning songs via graph-theoretic bridge metrics and analyse their structural properties. Across multiple robustness settings, boundary-spanning lyrics exhibit higher lexical diversity and lower repetition compared to core community members, challenging the assumption that hook intensity or repetition drives cross-theme connectivity. Our framework is language-agnostic and applicable to unlabeled cultural text corpora.

[NLP-26] MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

【速读】: 该论文旨在解决当前用于评估大型语言模型(LLM)在精神科诊断决策能力上的基准测试存在的局限性,特别是现有基准多依赖社交媒体数据,难以准确衡量模型对DSM-5(《精神障碍诊断与统计手册》第五版)标准的遵循程度和鉴别诊断能力。其解决方案的关键在于构建了MentalKG——一个由精神科医生构建并验证的知识图谱,该图谱编码了23种精神障碍的DSM-5诊断标准及鉴别诊断规则,并以此作为黄金标准逻辑骨架,生成了24,750个结构化、信息完备度与诊断复杂度系统变化的合成临床病例,从而实现低噪声且可解释的评估体系。

链接: https://arxiv.org/abs/2602.12871
作者: Hoyun Song,Migyeong Kang,Jisu Shin,Jihyun Kim,Chanbi Park,Hangyeol Yoo,Jihyun An,Alice Oh,Jinyoung Han,KyungTae Lim
机构: Korea Advanced Institute of Science and Technology (KAIST); Sungkyunkwan University; Dongguk University Medical Center; Seoul National University of Science and Technology (Seoultech); Samsung Medical Center
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evaluation gaps not captured by existing benchmarks.

[NLP-27] AIWizards at MULTIPRIDE: A Hierarchical Approach to Slur Reclamation Detection

【速读】: 该论文旨在解决污名化词汇再 reclaim(再定义)的问题,即同一词汇在不同社会身份和语境下可能表现为仇恨言论或群体内部的自我认同表达,这对现有的仇恨言论检测系统构成挑战。解决方案的关键在于提出一种分层建模方法:首先通过弱监督的大语言模型(LLM)对用户进行模糊标签标注,推断其属于LGBTQ+群体的可能性;随后训练一个类似BERT的模型学习与LGBTQ+身份相关的潜在表征;最后将该潜在空间与新初始化的下游任务模型融合,以整合用户导向的社会语言学信号与预训练仇恨言论检测模型的特征表示。此方法实现了模块化且可扩展的架构,在意大利语和西班牙语数据上达到与强基线相当的性能,同时为融入更细粒度的社会语境提供了可行路径。

链接: https://arxiv.org/abs/2602.12818
作者: Luca Tedeschini,Matteo Fasulo
机构: Villanova.ai S.P.A (Villanova.ai公司); Swiss Data Science Center (瑞士数据科学中心); ETH Zürich (苏黎世联邦理工学院); Department of Computer Science and Engineering (DISI) (计算机科学与工程系); University of Bologna (博洛尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting reclaimed slurs represents a fundamental challenge for hate speech detection systems, as the same lexcal items can function either as abusive expressions or as in-group affirmations depending on social identity and context. In this work, we address Subtask B of the MultiPRIDE shared task at EVALITA 2026 by proposing a hierarchical approach to modeling the slur reclamation process. Our core assumption is that members of the LGBTQ+ community are more likely, on average, to employ certain slurs in a eclamatory manner. Based on this hypothesis, we decompose the task into two stages. First, using a weakly supervised LLM-based annotation, we assign fuzzy labels to users indicating the likelihood of belonging to the LGBTQ+ community, inferred from the tweet and the user bio. These soft labels are then used to train a BERT-like model to predict community membership, encouraging the model to learn latent representations associated with LGBTQ+ identity. In the second stage, we integrate this latent space with a newly initialized model for the downstream slur reclamation detection task. The intuition is that the first model encodes user-oriented sociolinguistic signals, which are then fused with representations learned by a model pretrained for hate speech detection. Experimental results on Italian and Spanish show that our approach achieves performance statistically comparable to a strong BERT-based baseline, while providing a modular and extensible framework for incorporating sociolinguistic context into hate speech modeling. We argue that more fine-grained hierarchical modeling of user identity and discourse context may further improve the detection of reclaimed language. We release our code at this https URL.

[NLP-28] Left-right asymmetry in predicting brain activity from LLM s representations emerges with their formal linguistic competence

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中,其内部激活与人类大脑活动(如fMRI信号)之间的预测能力为何在左右半球呈现不对称性的问题。解决方案的关键在于通过多阶段训练的OLMo-2 7B模型与英语受试者的fMRI数据对比分析,发现这种左-右不对称性并非普遍存在于所有认知任务中,而是特异性地与模型的形式语言能力(formal linguistic competence)同步演化——即模型对语法正确句子的概率赋值能力或生成符合语法规则文本的能力提升,而非算术、世界知识推理或更复杂的语义理解任务。这一结论在Pythia模型族和法语语境下亦得到验证,表明左-右不对称性本质上反映了LLM在掌握语言结构模式方面的进展。

链接: https://arxiv.org/abs/2602.12811
作者: Laurent Bonnasse-Gahot,Christophe Pallier
机构: Centre d’Analyse et de Mathématique Sociales (社会科学分析与数学中心); CNRS, EHESS (法国国家科学研究中心, 法国高等实验学院); Cognitive Neuroimaging Unit (认知神经成像单位); CNRS, INSERM, CEA (法国国家科学研究中心, 法国国家健康与医学研究院, 法国原子能和替代能源委员会), Neurospin Center (Neurospin 中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:When humans and large language models (LLMs) process the same text, activations in the LLMs correlate with brain activity measured, e.g., with functional magnetic resonance imaging (fMRI). Moreover, it has been shown that, as the training of an LLM progresses, the performance in predicting brain activity from its internal activations improves more in the left hemisphere than in the right one. The aim of the present work is to understand which kind of competence acquired by the LLMs underlies the emergence of this left-right asymmetry. Using the OLMo-2 7B language model at various training checkpoints and fMRI data from English participants, we compare the evolution of the left-right asymmetry in brain scores alongside performance on several benchmarks. We observe that the asymmetry co-emerges with the formal linguistic abilities of the LLM. These abilities are demonstrated in two ways: by the model’s capacity to assign a higher probability to an acceptable sentence than to a grammatically unacceptable one within a minimal contrasting pair, or its ability to produce well-formed text. On the opposite, the left-right asymmetry does not correlate with the performance on arithmetic or Dyck language tasks; nor with text-based tasks involving world knowledge and reasoning. We generalize these results to another family of LLMs (Pythia) and another language, namely French. Our observations indicate that the left-right asymmetry in brain predictivity matches the progress in formal linguistic competence (knowledge of linguistic patterns).

[NLP-29] RAT-Bench: A Comprehensive Benchmark for Text Anonymization

【速读】: 该论文旨在解决当前文本去标识化工具在防止再识别风险方面的有效性不明确的问题。尽管现有工具(如Microsoft的Presidio或Anthropic的PII purifier)能够移除特定直接标识符(Direct Identifiers),但其在复杂场景下是否真正降低再识别风险仍缺乏系统评估。解决方案的关键在于提出RAT-Bench基准,该基准基于再识别风险量化评估文本去标识化工具的效果,通过结合美国人口统计数据生成包含多种直接与间接标识符(Indirect Identifiers)的合成文本,并利用大语言模型(LLM)作为攻击者模拟推理能力,从而准确衡量不同工具在真实场景中的隐私保护水平,同时考虑标识符的差异性影响。结果表明,基于LLM的去标识化方法虽计算成本较高,但在隐私-效用权衡上表现更优,且跨语言性能稳定,为未来工具设计提供了实证依据和改进方向。

链接: https://arxiv.org/abs/2602.12806
作者: Nataša Krčo,Zexi Yao,Matthieu Meeus,Yves-Alexandre de Montjoye
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft’s Presidio or Anthropic’s PII purifier. These tools have traditionally been evaluated on their ability to remove specific identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce RAT-Bench, a comprehensive benchmark for text anonymization tools based on re-identification risk. Using U.S. demographic statistics, we generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. We evaluate a range of NER- and LLM-based text anonymization tools and, based on the attributes an LLM-based attacker is able to correctly infer from the anonymized text, we report the risk of re-identification in the U.S. population, while properly accounting for the disparate impact of identifiers. We find that, while capabilities vary widely, even the best tools are far from perfect in particular when direct identifiers are not written in standard ways and when indirect identifiers enable re-identification. Overall we find LLM-based anonymizers, including new iterative anonymizers, to provide a better privacy-utility trade-off albeit at a higher computational cost. Importantly, we also find them to work well across languages. We conclude with recommendations for future anonymization tools and will release the benchmark and encourage community efforts to expand it, in particular to other geographies.

[NLP-30] Aspect-Based Sentiment Analysis for Future Tourism Experiences: A BERT-MoE Framework for Persian User Reviews

【速读】: 该论文旨在解决低资源语言(波斯语)在旅游领域用户评论中进行方面感知情感分析(Aspect-Based Sentiment Analysis, ABSA)的挑战,尤其是模型效率低、路由崩溃(routing collapse)等问题。其解决方案的关键在于提出一种基于BERT的混合模型,结合Top-K路由机制与辅助损失函数,以缓解路由崩溃并提升计算效率;同时,在包含58,473条预处理评论的数据集上实现了90.6%的加权F1分数,显著优于基线模型(如标准BERT为89.25%),且GPU功耗降低39%,支持可持续AI部署。

链接: https://arxiv.org/abs/2602.12778
作者: Hamidreza Kazemi Taskooh,Taha Zare Harofte
机构: Iran University of Science and Technology (伊朗科学技术大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, 12 figures, 4 tables

点击查看摘要

Abstract:This study advances aspect-based sentiment analysis (ABSA) for Persian-language user reviews in the tourism domain, addressing challenges of low-resource languages. We propose a hybrid BERT-based model with Top-K routing and auxiliary losses to mitigate routing collapse and improve efficiency. The pipeline includes: (1) overall sentiment classification using BERT on 9,558 labeled reviews, (2) multi-label aspect extraction for six tourism-related aspects (host, price, location, amenities, cleanliness, connectivity), and (3) integrated ABSA with dynamic routing. The dataset consists of 58,473 preprocessed reviews from the Iranian accommodation platform Jabama, manually annotated for aspects and sentiments. The proposed model achieves a weighted F1-score of 90.6% for ABSA, outperforming baseline BERT (89.25%) and a standard hybrid approach (85.7%). Key efficiency gains include a 39% reduction in GPU power consumption compared to dense BERT, supporting sustainable AI deployment in alignment with UN SDGs 9 and 12. Analysis reveals high mention rates for cleanliness and amenities as critical aspects. This is the first ABSA study focused on Persian tourism reviews, and we release the annotated dataset to facilitate future multilingual NLP research in tourism.

[NLP-31] owards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks LREC2026

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中标准评估方法的局限性问题,即现有评估通常仅表明系统A在平均性能上优于系统B,但缺乏对模型改进方向的具体指导,且无法保证模型在分布外数据上的表现稳定性。其解决方案的关键在于提出一种基于错误分析的序列标注任务评估方法,该方法不依赖于大规模真实场景下的分布内数据采集,而是通过人工设计一组语言学动机明确的小规模测试集,全面覆盖系统在实际应用中可能遇到的各类跨度属性(如形状、长度、大小写、句法位置等)。这一方法能够提供定量与定性结合的诊断信息,帮助识别系统性弱点,并具备预测模型在外部数据上表现的能力(中位相关系数达0.85),从而实现诊断性、可操作性和预测性的统一。

链接: https://arxiv.org/abs/2602.12759
作者: Elena Alvarez-Mellado,Julio Gonzalo
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026

点击查看摘要

Abstract:Standard evaluation in NLP typically indicates that system A is better on average than system B, but it provides little info on how to improve performance and, what is worse, it should not come as a surprise if B ends up being better than A on outside data. We propose an evaluation methodology for sequence labeling tasks grounded on error analysis that provides both quantitative and qualitative information on where systems must be improved and predicts how models will perform on a different distribution. The key is to create test sets that, contrary to common practice, do not rely on gathering large amounts of real-world in-distribution scraped data, but consists in handcrafting a small set of linguistically motivated examples that exhaustively cover the range of span attributes (such as shape, length, casing, sentence position, etc.) a system may encounter in the wild. We demonstrate this methodology on a benchmark for anglicism identification in Spanish. Our methodology provides results that are diagnostic (because they help identify systematic weaknesses in performance), actionable (because they can inform which model is better suited for a given scenario) and predictive: our method predicts model performance on external datasets with a median correlation of 0.85.

[NLP-32] Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting ICASSP2026

【速读】: 该论文旨在解决自监督语音模型在跨语言泛化能力不足以及持续训练过程中遗忘已有知识的问题。其核心解决方案是提出Lamer-SSL框架,关键在于引入一种层感知的LoRA专家混合模块(Layer-Aware MixturE of LoRA Experts, Lamer),通过灵活平衡共享与语言特异性表示,并将更多专家分配至语义信息更丰富的深层网络;同时结合最小数据量的回放策略(replay strategy)保留先前知识,从而在仅需训练2.14%参数的情况下实现对新语言的有效扩展并维持旧语言性能。

链接: https://arxiv.org/abs/2602.12746
作者: Jing Xu,Minglin Wu,Xueyuan Chen,Xixin Wu,Helen Meng
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Despite their impressive performance, self-supervised speech models often struggle to generalize to new languages and tend to forget previously acquired knowledge during continual training. To address this, we propose Lamer-SSL, a parameter-efficient framework that integrates a Layer-Aware MixturE of LoRA Experts (Lamer) module with a replay strategy. The Lamer module enables flexible balancing between shared and language-specific representations, while layer-aware expert allocation assigns more experts to deeper layers where semantic information is richer. Meanwhile, the replay strategy retains prior knowledge using minimal data, mitigating forgetting during continual training. Experiments on automatic speech recognition (ASR) and language identification (LID) demonstrate that Lamer-SSL extends self-supervised models to new languages effectively while maintaining strong performance on previously learned languages with only 2.14% parameters being trainable.

[NLP-33] VimRAG : Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

【速读】: 该论文旨在解决多模态检索增强推理(Multimodal Retrieval-augmented Reasoning)中长期上下文任务处理困难的问题,尤其是在迭代推理场景下,传统基于线性交互历史的检索增强生成(Retrieval-augmented Generation, RAG)方法难以有效应对信息稀疏但token密集的视觉数据。其解决方案的关键在于提出VimRAG框架,通过将推理过程建模为动态有向无环图(Dynamic Directed Acyclic Graph),结构化地组织代理状态与检索到的多模态证据;在此基础上引入图调制的视觉记忆编码机制(Graph-Modulated Visual Memory Encoding),依据节点拓扑位置动态评估记忆重要性,从而在关键证据上分配高分辨率token,同时压缩或丢弃冗余信息;进一步设计图引导的策略优化策略(Graph-Guided Policy Optimization),通过剪枝与冗余动作相关的记忆节点,实现步骤级有效性与轨迹级奖励的解耦,促进细粒度的信用分配,显著提升模型在多模态RAG基准上的性能表现。

链接: https://arxiv.org/abs/2602.12735
作者: Qiuchen Wang,Shihang Wang,Yu Zeng,Qiang Zhang,Fanrui Zhang,Zhuoning Guo,Bosi Zhang,Wenxuan Huang,Lin Chen,Zehui Chen,Pengjun Xie,Ruixue Ding
机构: Tongyi Lab, Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks. The code is available at this https URL.

[NLP-34] ReFilter: Improving Robustness of Retrieval-Augmented Generation via Gated Filter

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因检索候选数量增加而导致的冗余信息干扰和推理成本上升问题。现有内部融合方法(如基于查询的融合、参数化融合和基于潜在空间的融合)在小规模检索时表现良好,但在大规模检索场景下难以有效过滤无关或重复内容,从而影响模型性能与效率。其解决方案的关键在于提出一种新颖的基于潜在空间的融合框架 ReFilter,该框架通过三个核心组件实现细粒度的token级过滤与融合:上下文编码器用于提取上下文特征,门控过滤器对每个token进行加权以评估其相关性,以及token融合模块将加权后的特征注入大语言模型(Large Language Models, LLMs)的隐藏状态中,从而在保持高证据覆盖率的同时显著提升生成质量与推理效率。

链接: https://arxiv.org/abs/2602.12709
作者: Yixin Chen,Ying Xiong,Shangyu Wu,Xiangrui Ke,Nan Guan,Chun Jason Xue
机构: City University of Hong Kong(香港城市大学); MBZUAI(穆罕默德·本·扎耶德人工智能大学); University of Waterloo(滑铁卢大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has become a dominant paradigm for grounding large language models (LLMs) with external evidence in knowledge-intensive question answering. A core design choice is how to fuse retrieved samples into the LLMs, where existing internal fusion approaches broadly fall into query-based fusion, parametric fusion, and latent-based fusion. Despite their effectiveness at modest retrieval scales, these methods often fail to scale gracefully as the number of retrieved candidates k increases: Larger k improves evidence coverage, yet realistic top-k retrieval inevitably contains irrelevant or redundant content and increases the inference cost. To address these limitations, we propose ReFilter, a novel latent-based fusion framework that performs token-level filtering and fusion. ReFilter consists of three key components: a context encoder for encoding context features, a gated filter for weighting each token, and a token fusion module for integrating the weighted token feature into the LLM’s hidden states. Our experiments across four general-domain QA benchmarks show that ReFilter consistently achieves the best average performance under both in-domain adaptation and out-of-domain transfer. ReFilter further generalizes to five biomedical QA benchmarks in zero-shot transfer without domain fine-tuning, reaching 70.01% average accuracy with Qwen2.5-14B-Instruct. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.12709 [cs.CL] (or arXiv:2602.12709v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.12709 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-35] MedXIAOHE: A Comprehensive Recipe for Building Medical MLLM s

【速读】: 该论文旨在解决当前医疗视觉-语言基础模型在真实临床场景中通用医学理解与推理能力不足的问题,尤其针对知识覆盖不全(如罕见疾病)和推理过程缺乏可验证性等挑战。解决方案的关键在于提出一种实体感知的持续预训练框架,通过组织异构医学语料库来扩展知识广度并缩小长尾差距;同时引入强化学习与工具增强的代理训练机制,以支持多步骤诊断推理并生成具有可追溯决策路径的专家级医学推理结果,从而提升模型在实际应用中的可靠性与准确性。

链接: https://arxiv.org/abs/2602.12705
作者: Baorong Shi,Bo Cui,Boyuan Jiang,Deli Yu,Fang Qian,Haihua Yang,Huichao Wang,Jiale Chen,Jianfei Pan,Jieqiong Cao,Jinghao Lin,Kai Wu,Lin Yang,Shengsheng Yao,Tao Chen,Xiaojun Xiao,Xiaozhong Ji,Xu Wang,Yijun He,Zhixiong Yang
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.

[NLP-36] mathcalX-KD: General Experiential Knowledge Distillation for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)知识蒸馏(Knowledge Distillation, KD)中忽视教师模型原始学习环境的问题。现有方法通常仅关注模仿教师的行为,而忽略了其在训练过程中所依赖的奖励机制和经验背景,导致学生模型难以真正复现教师的知识本质。解决方案的关键在于提出一种名为“经验知识蒸馏”(Experiential Knowledge Distillation, X\mathcal{X}-KD)的新框架,该框架基于近似变分奖励模仿学习(Approximated Variational Reward Imitation Learning, AVRIL),联合建模教师的原始奖励函数并执行策略蒸馏,从而促使学生策略与教师原始奖励函数保持一致性。理论推导表明,X\mathcal{X}-KD 在形式上等价于监督学习框架,适用于序列级和基于差异度量的蒸馏方法,兼具简洁性与灵活性,并在摘要生成、机器翻译和算术推理任务上显著优于通用蒸馏和MiniLLM基线方法,同时实现了更优的性能-多样性权衡与数据效率。

链接: https://arxiv.org/abs/2602.12674
作者: Yuang Cai,Yuyu Yuan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) for Large Language Models (LLMs) has become increasingly important as models grow in size and complexity. While existing distillation approaches focus on imitating teacher behavior, they often overlook the original learning environment that shaped the teacher’s knowledge. Inspired by the experiential learning theory and inverse reinforcement learning, we propose Experiential Knowledge Distillation ( \mathcalX -KD), a novel and general framework that enables student models to learn in the teacher’s original learning environment. \mathcalX -KD adopts the Approximated Variational Reward Imitation Learning (AVRIL) framework to jointly model the teacher’s original reward function and perform policy distillation, encouraging consistency between the student policy and the original reward function. Our derivation demonstrates that \mathcalX -KD follows the supervised learning framework and applies to both sequence-level and divergence-based distillation methods, underlining the simplicity and flexibility of our approach. Empirical results show that \mathcalX -KD outperforms the generalized KD and MiniLLM baselines on abstractive summarization, machine translation, and arithmetic reasoning tasks. Additionally, \mathcalX -KD achieves better performance-diversity trade-off and data efficiency than baseline KD approaches.

[NLP-37] hink Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为自主代理在多轮决策任务中因固定认知模式而导致的效率低下问题。现有方法要么采用即时响应的非思考模式,要么对所有步骤统一进行深度推理,无法根据任务阶段动态调整认知深度,从而在长时程任务中造成资源浪费或策略失误。解决方案的关键在于提出CogRouter框架,其核心创新是基于ACT-R理论设计了四个层级的认知水平,并通过两阶段训练实现认知深度的动态适配:首先使用认知感知监督微调(Cognition-aware Supervised Fine-tuning, CoSFT)建立稳定的层级行为模式,再利用认知感知策略优化(Cognition-aware Policy Optimization, CoPO)进行步级信用分配,通过置信度感知的优势重加权机制确保每一步选择的认知深度能最大化动作置信度。此机制使代理能够智能识别何时需战略规划、何时仅需执行,显著提升任务成功率与计算效率。

链接: https://arxiv.org/abs/2602.12662
作者: Ruihan Yang,Fanghua Ye,Xiang We,Ruoqing Zhao,Kang Luo,Xinbo Xu,Bo Zhao,Ruotian Ma,Shanyi Wang,Zhaopeng Tu,Xiaolong Li,Deqing Yang,Linus
机构: Fudan University (复旦大学); Tencent Hunyuan (腾讯混元)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents for multi-turn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we introduce CogRouter, a framework that trains agents to dynamically adapt cognitive depth at each step. Grounded in ACT-R theory, we design four hierarchical cognitive levels ranging from instinctive responses to strategic planning. Our two-stage training approach includes Cognition-aware Supervised Fine-tuning (CoSFT) to instill stable level-specific patterns, and Cognition-aware Policy Optimization (CoPO) for step-level credit assignment via confidence-aware advantage reweighting. The key insight is that appropriate cognitive depth should maximize the confidence of the resulting action. Experiments on ALFWorld and ScienceWorld demonstrate that CogRouter achieves state-of-the-art performance with superior efficiency. With Qwen2.5-7B, it reaches an 82.3% success rate, outperforming GPT-4o (+40.3%), OpenAI-o3 (+18.3%), and GRPO (+14.0%), while using 62% fewer tokens.

[NLP-38] Learning Ordinal Probabilistic Reward from Preferences ICLR2026

【速读】: 该论文旨在解决现有奖励模型(Reward Models, RMs)在对齐大语言模型(Large Language Models, LLMs)与人类价值观和意图时存在的两大局限:生成式奖励模型(Generative Reward Models, GRMs)依赖昂贵的逐点标注监督,而判别式奖励模型(Discriminative Reward Models, DRMs)输出的相对评分缺乏校准且无概率解释。为此,作者提出一种新型奖励建模范式——概率奖励模型(Probabilistic Reward Model, PRM),其核心在于将奖励视为随机变量并学习每个响应质量的概率分布,从而实现对绝对质量的建模。为使该范式具备实用性,进一步提出离散化实现方案——序数概率奖励模型(Ordinal Probabilistic Reward Model, OPRM),通过将质量分数离散化为有限的序数等级,并引入区域洪水微调(Region Flooding Tuning, RgFT)策略,利用质量等级标注引导模型将概率质量集中在对应子区域,显著提升了奖励模型的准确性(提升2.9%~7.4%)与数据效率,同时兼顾相对排序与绝对质量的建模能力。

链接: https://arxiv.org/abs/2602.12660
作者: Longze Chen,Lu Wang,Renke Shan,Ze Gong,Run Luo,Jiaming Li,Jing Luo,Qiyao Wang,Min Yang
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences (中国科学院大学); Ritzz-AI
类目: Computation and Language (cs.CL)
备注: 28 pages, 5 figures, ICLR 2026

点击查看摘要

Abstract:Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation. To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM). Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response. To make this paradigm practical, we present its closed-form, discrete realization: the Ordinal Probabilistic Reward Model (OPRM), which discretizes the quality score into a finite set of ordinal ratings. Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT). It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions. Experiments on various reward model benchmarks show that our method improves accuracy by \textbf2.9%\sim\textbf7.4% compared to prior reward models, demonstrating strong performance and data efficiency. Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.

[NLP-39] Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

【速读】: 该论文旨在解决奖励最大化强化学习(Reward-maximizing RL)方法在提升大语言模型(Large Language Models, LLMs)推理能力时导致输出多样性下降的问题,同时克服现有基于GFlowNets的方法在分布匹配训练中样本效率低下的局限。其解决方案的关键在于重新诠释GFlowNet训练过程中产生的分区函数(partition function)——将其视为每个提示(prompt)的期望奖励(即在线准确率)信号,而非仅作为归一化因子。基于此理论洞察,作者提出Partition Function-Guided RL(PACED-RL),通过两个核心机制提升样本效率:一是利用准确率估计优先选择信息量大的提示进行训练;二是采用基于准确率估计误差的回放缓冲区重采样策略。这两个组件均复用GFlowNet训练阶段已生成的信息,从而将计算开销分摊至原有优化流程中,实现高效、高精度的LLM分布匹配训练。

链接: https://arxiv.org/abs/2602.12642
作者: Dohyung Kim,Minbeom Kim,Jeonghye Kim,Sangmook Lee,Sojeong Rhee,Kyomin Jung
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.

[NLP-40] CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation LREC2026

【速读】: 该论文旨在解决生成式 AI(Generative AI)在法律文本生成中面临的风格质量评估难题,即现有方法难以准确衡量模型输出是否符合法律写作的专业语体规范(stylistic norms),而传统参考文本依赖的指标会混淆语义准确性与风格一致性,纯大语言模型(LLM-as-a-judge)评估则存在不透明和不稳定的问题。解决方案的关键在于提出一种名为 CLASE(Chinese LegAlese Stylistic Evaluation)的混合评估方法,其核心创新是结合两类得分:1)基于语言特征的显式评分;2)由经验驱动的 LLM-as-a-judge 评分。这两类分数均通过对比真实法律文档与其 LLM 修复版本的成对样本进行学习,从而在无需参考文本的前提下,以可解释的方式捕捉法律文本的表层特征与隐含风格规范,显著提升与人工判断的一致性,并提供改进建议。

链接: https://arxiv.org/abs/2602.12639
作者: Yiran Rex Ma,Yuxiao Ye,Huiyuan Xie
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026

点击查看摘要

Abstract:Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: this https URL).

[NLP-41] Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在Ascend NPU上进行高效推理时面临的精度与计算效率之间的权衡问题。随着模型规模增长,低比特浮点格式(如MXFP和NVFP4)为提升能效提供了新可能,但如何在保持模型精度的同时实现高效率仍具挑战。论文提出的解决方案是设计并评估HiFloat(包括HiF8和HiF4)这一系列专为Ascend NPU优化的定点与浮点混合格式,其关键在于:(1) 通过层级化缩放机制(hierarchical scaling),HiF4在4-bit场景下有效避免了整数格式常见的精度坍塌问题;(2) 在权重-激活和KV缓存等任务中验证了浮点格式对高方差数据的优越性,且与当前主流后训练量化(post-training quantization)框架完全兼容,从而实现了高精度、高效率的LLM推理部署。

链接: https://arxiv.org/abs/2602.12635
作者: Pengxiang Zhao,Hui-Ling Zhen,Xing Li,Han Bao,Weizhe Lin,Zhiyuan Yang,Ziwei Yu,Xin Wang,Mingxuan Yuan,Xianzhi Yu,Zhenhua Dong
机构: Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4’s hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.

[NLP-42] Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理大量视觉标记(vision tokens)时产生的高昂计算成本问题,尤其针对现有剪枝方法存在的局限性——即要么在LLM之前操作导致泛化能力受限,要么依赖启发式规则且不兼容FlashAttention。其解决方案的关键在于提出一种名为注意力驱动自压缩(Attention-Driven Self-Compression, ADSC)的新方法:通过利用LLM自身注意力机制,在选定层上对视觉标记进行均匀下采样以形成瓶颈,从而引导模型将信息重新组织并压缩到剩余标记中,无需额外评分计算、辅助模块或注意力机制修改,且完全兼容FlashAttention。该方法在保持高性能的同时显著降低浮点运算量(FLOPs)和KV缓存峰值内存占用,且在高压缩比下仍具备鲁棒性。

链接: https://arxiv.org/abs/2602.12618
作者: Omer Faruk Deniz,Ruiyu Mao,Ruochen Li,Yapeng Tian,Latifur Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 2025 IEEE International Conference on Big Data (BigData)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM’s attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.

[NLP-43] HyperMLP: An Integrated Perspective for Sequence Modeling

【速读】: 该论文旨在解决传统自注意力机制(self-attention)中对概率性查询-键查找(probabilistic query-key lookup)的依赖所导致的结构复杂性和表达能力局限问题。其核心挑战在于如何在保持高效计算的同时,提升模型对上下文动态信息的选择能力和表达灵活性。解决方案的关键在于提出一种全新的视角:将自回归注意力头(autoregressive attention head)建模为一个由上下文历史动态生成权重的两层多层感知机(MLP),其中注意力分数构成不断增长的隐状态表示,而ReLU或门控线性单元(GLU)等标准激活函数自然实现基于输入条件的、上下文相关的记忆池选择,而非固定概率分布。基于此思想,作者进一步设计了HyperMLP和HyperGLU,通过反向偏移(reverse-offset, lag)布局在特征空间和序列空间中同时学习动态混合机制,从而在参数预算相当的情况下显著优于基于Softmax的注意力基线模型。

链接: https://arxiv.org/abs/2602.12601
作者: Jiecheng Lu,Shihao Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.

[NLP-44] Discovering Semantic Latent Structures in Psychological Scales: A Response-Free Pathway to Efficient Simplification

【速读】: 该论文旨在解决心理量表精炼过程中对大规模响应数据的依赖问题,传统方法如因子分析、项目反应理论和网络心理测量虽严谨但受限于样本量及跨文化可比性。其解决方案的关键在于引入一种基于主题建模(topic modeling)的语义结构分析框架,通过上下文句子嵌入(contextual sentence embeddings)对题目进行编码,并利用密度聚类(density-based clustering)自动发现潜在语义因子,无需预设因子数量;进而通过类别加权提取可解释的主题表示,实现语义相邻聚类的合并,并结合成员资格标准筛选代表性题目,从而在保持心理测量学质量的前提下显著缩短量表长度(平均减少60.5%)。该方法为量表简化提供了一种无需响应数据的语义驱动路径,且结果与原因子结构高度一致,具备良好的可解释性和实用性。

链接: https://arxiv.org/abs/2602.12575
作者: Bo Wang,Yuxuan Zhang,Yueqin Hu,Hanchao Hou,Kaiping Peng,Shiguang Ni
机构: Tsinghua University (清华大学); Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 78 pages, 20 figures

点击查看摘要

Abstract:Psychological scale refinement traditionally relies on response-based methods such as factor analysis, item response theory, and network psychometrics to optimize item composition. Although rigorous, these approaches require large samples and may be constrained by data availability and cross-cultural comparability. Recent advances in natural language processing suggest that the semantic structure of questionnaire items may encode latent construct organization, offering a complementary response-free perspective. We introduce a topic-modeling framework that operationalizes semantic latent structure for scale simplification. Items are encoded using contextual sentence embeddings and grouped via density-based clustering to discover latent semantic factors without predefining their number. Class-based term weighting derives interpretable topic representations that approximate constructs and enable merging of semantically adjacent clusters. Representative items are selected using membership criteria within an integrated reduction pipeline. We benchmarked the framework across DASS, IPIP, and EPOCH, evaluating structural recovery, internal consistency, factor congruence, correlation preservation, and reduction efficiency. The proposed method recovered coherent factor-like groupings aligned with established constructs. Selected items reduced scale length by 60.5% on average while maintaining psychometric adequacy. Simplified scales showed high concordance with original factor structures and preserved inter-factor correlations, indicating that semantic latent organization provides a response-free approximation of measurement structure. Our framework formalizes semantic structure as an inspectable front-end for scale construction and reduction. To facilitate adoption, we provide a visualization-supported tool enabling one-click semantic analysis and structured simplification.

[NLP-45] Constraint-Rectified Training for Efficient Chain-of-Thought

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)推理中因推理路径过长而导致的高推理成本与冗余步骤(即“过度思考”,overthinking)问题,同时保持答案准确性。现有方法如基于长度感知奖励设计或提示校准虽试图平衡推理长度与精度,但常因启发式策略导致准确率显著下降且对超参数敏感。论文提出一种基于参考约束优化的后训练框架CRT(Constraint-Rectified Training),其关键在于通过参考引导的约束优化机制,在推理长度最小化和准确性修正之间交替进行——仅当性能低于参考基准时才触发修正,从而实现稳定、可解释的冗余推理修剪。此外,CRT引入两阶段训练策略,先发现最短可靠推理模式,再在学习到的长度预算下精细化提升准确性,有效防止冗余推理重现。实验表明,CRT在显著降低token消耗的同时维持了鲁棒的答案质量,并揭示其通过减少内部语言冗余提升效率的新机制。

链接: https://arxiv.org/abs/2602.12526
作者: Qinhang Wu,Sen Lin,Ming Zhang,Yingbin Liang,Ness B. Shroff
机构: The Ohio State University (俄亥俄州立大学); University of Houston (休斯顿大学); Google(谷歌)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), especially when combined with reinforcement learning (RL) based post-training methods. While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking. Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy, either through length-aware reward design or prompt-based calibration. However, these heuristic-based approaches may suffer from severe accuracy drop and be very sensitive to hyperparameters. To address these problems, we introduce CRT (Constraint-Rectified Training), a principled post-training framework based on reference-guarded constrained optimization, yielding a more stable and interpretable formulation for efficient reasoning. CRT alternates between minimizing reasoning length and rectifying accuracy only when performance falls below the reference, enabling stable and effective pruning of redundant reasoning. We further extend CRT with a two-stage training scheme that first discovers the shortest reliable reasoning patterns and then refines accuracy under a learnt length budget, preventing the re-emergence of verbose CoT. Our comprehensive evaluation shows that this framework consistently reduces token usage while maintaining answer quality at a robust and reliable level. Further analysis reveals that CRT improves reasoning efficiency not only by shortening responses but also by reducing internal language redundancy, leading to a new evaluation metric. Moreover, CRT-based training naturally yields a sequence of intermediate checkpoints that span a spectrum of explanation lengths while preserving correctness, enabling fine-grained control over reasoning verbosity without retraining.

[NLP-46] RBCorr: Response Bias Correction in Language Models

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在固定选项问答任务中普遍存在的响应偏差(response bias)问题,这种偏差表现为对特定选项的非理性偏好,从而影响模型性能评估的准确性。解决方案的关键在于提出一种简单且低代价的响应偏差校正策略(\textttRBCorr),该方法基于对输出选项的对数概率(LogProbs)进行调整,通过动态修正模型预测分布来消除偏差,实验证明其在多种模型架构、数据集和提示格式下均能有效提升小规模语言模型的表现,并使封闭式评测基准上的性能更贴近模型的真实能力。

链接: https://arxiv.org/abs/2602.12445
作者: Om Bhatt,Anna A. Ivanova
机构: University of California, Irvine (加州大学欧文分校); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages (8 pages main text), 4 figures

点击查看摘要

Abstract:Language models (LMs) are known to be prone to response biases, which present as option preference biases in fixed-response questions. It is therefore imperative to develop low-cost and effective response bias correction methods to improve LM performance and enable more accurate evaluations of model abilities. Here, we propose a simple response bias correction strategy ( \textttRBCorr ) and test it on 12 open-weight language models using yes-no, entailment, and multiple choice questions. We show that response bias is prevalent in LMs pre-correction and that \textttRBCorr effectively eliminates bias and boosts model performance. We also explore the generalizability of bias behavior across models, datasets, and prompt formats, showing that LogProbs-based correction is highly dependent on all three of these aspects. Overall, \textttRBCorr is an easy-to-use method that can boost the performance of smaller LMs and ensure that LM performance on closed-response benchmarks aligns more closely with their true capabilities.

[NLP-47] RankLLM : Weighted Ranking of LLM s by Quantifying Question Difficulty ICLR2026

【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)评估基准无法区分题目难度的问题,从而限制了对模型能力的精细刻画。其解决方案的关键在于提出RankLLM框架,该框架以题目难度为核心差异化指标,通过模型与题目之间的双向分数传播机制实现对模型能力与题目难度的联合建模:模型因正确回答问题而获得能力评分,题目则因挑战高能力模型而提升难度评分。此设计使RankLLM在35,550个跨领域问题上对30个模型进行评估时,达到90%的人类判断一致性,并显著优于IRT等强基线方法,同时具备良好的稳定性、快速收敛性和高计算效率。

链接: https://arxiv.org/abs/2602.12424
作者: Ziqian Zhang,Xingjian Hu,Yue Huang,Kai Zhang,Ruoxi Chen,Yixin Liu,Qingsong Wen,Kaidi Xu,Xiangliang Zhang,Neil Zhenqiang Gong,Lichao Sun
机构: Lehigh University (莱赫igh大学); University of Notre Dame (圣母大学); Zhejiang Wanli University (浙江万里大学); Squirrel Ai Learning (松鼠AI); City University of Hong Kong (香港城市大学); Duke University (杜克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages, 9 figures. Accepted by ICLR 2026

点击查看摘要

Abstract:Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models’ capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM’s core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question’s difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.

[NLP-48] Sparse Autoencoders are Capable LLM Jailbreak Mitigators

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)面临的越狱攻击(Jailbreak Attacks)问题,这类攻击通过特定上下文诱导模型生成有害内容,严重威胁其安全性。解决方案的关键在于提出一种基于激活空间稀疏自动编码器(Sparse Autoencoder, SAE)的防御机制——上下文条件Delta引导(Context-Conditioned Delta Steering, CC-Delta)。该方法通过对比同一有害请求在有无越狱上下文时的token级表示,识别出与越狱相关的稀疏特征,并利用统计检验筛选关键特征,在推理阶段对SAE潜在空间中的均值进行偏移引导(mean-shift steering),从而实现高效且稳健的越狱抑制。实验表明,相较于传统在密集激活空间中操作的防御方法,CC-Delta在多个指令微调模型和多种越狱攻击场景下均展现出更优的安全性-实用性权衡,尤其在分布外攻击中优势显著,证明了稀疏特征空间引导在越狱防御中的有效性。

链接: https://arxiv.org/abs/2602.12418
作者: Yannick Assogba,Jacopo Cortellazzi,Javier Abad,Pau Rodriguez,Xavier Suau,Arno Blaas
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 pages, 14 figures, 3 tables

点击查看摘要

Abstract:Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.

[NLP-49] propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)预训练数据筛选中依赖单一质量评分所导致的局限性问题,如多维质量属性混杂、过滤灵活性差及缺乏可解释性。其解决方案的关键在于提出 propella-1 系列小型多语言语言模型(0.6B–4B 参数),能够对文本文档进行跨 18 个属性的结构化标注,涵盖核心内容、分类、质量与价值、受众与目的、安全合规性和地理相关性六大类别,并支持 57 种语言,输出符合预定义模式的 JSON 格式注释。该方法实现了多维度、可组合的数据质量分析,显著优于仅依赖单一分数的传统方式。

链接: https://arxiv.org/abs/2602.12414
作者: Maximilian Idahl,Benedikt Droste,Björn Plüster,Jan Philipp Harries
机构: ellamind; Leibniz University Hannover (汉诺威莱布尼茨大学)
类目: Computation and Language (cs.CL)
备注: Release: this https URL

点击查看摘要

Abstract:Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.

[NLP-50] Evolving Beyond Snapshots: Harmonizing Structure and Sequence via Entity State Tuning for Temporal Knowledge Graph Forecasting

【速读】: 该论文旨在解决时间知识图谱(Temporal Knowledge Graph, TKG)预测中因现有方法多为无状态(stateless)而导致的长期依赖衰减与记忆断续问题。具体而言,传统方法在每个时间戳重新计算实体表示时仅依赖有限的查询窗口,造成“瞬时遗忘”(episodic amnesia),难以建模跨时间快照的长期演化规律。解决方案的关键在于提出一种编码器无关的框架——实体状态调优(Entity State Tuning, EST),其核心机制包括:1)构建全局状态缓冲区以持久存储实体状态;2)通过闭环设计将结构证据与序列信号逐步对齐;3)采用双轨演化机制,在保持稳定性的同时实现状态的持续更新,从而有效平衡可塑性(plasticity)与稳定性(stability)。实验表明,EST能显著提升多种骨干模型的性能,验证了状态持久化对长程TKG预测的重要性。

链接: https://arxiv.org/abs/2602.12389
作者: Siyuan Li,Yunjia Wu,Yiyong Xiao,Pingyang Huang,Peize Li,Ruitong Liu,Yan Wen,Te Sun,Fangyi Pei
机构: Dalian University of Technology (大连理工大学); Tencent Music (腾讯音乐); King’s College London (伦敦大学国王学院); Peng Cheng Laboratory (鹏城实验室); Peking University (北京大学); Beijing Institute of Technology (北京理工大学); Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Temporal knowledge graph (TKG) forecasting requires predicting future facts by jointly modeling structural dependencies within each snapshot and temporal evolution across snapshots. However, most existing methods are stateless: they recompute entity representations at each timestamp from a limited query window, leading to episodic amnesia and rapid decay of long-term dependencies. To address this limitation, we propose Entity State Tuning (EST), an encoder-agnostic framework that endows TKG forecasters with persistent and continuously evolving entity states. EST maintains a global state buffer and progressively aligns structural evidence with sequential signals via a closed-loop design. Specifically, a topology-aware state perceiver first injects entity-state priors into structural encoding. Then, a unified temporal context module aggregates the state-enhanced events with a pluggable sequence backbone. Subsequently, a dual-track evolution mechanism writes the updated context back to the global entity state memory, balancing plasticity against stability. Experiments on multiple benchmarks show that EST consistently improves diverse backbones and achieves state-of-the-art performance, highlighting the importance of state persistence for long-horizon TKG forecasting. The code is published at this https URL

[NLP-51] Grandes Modelos de Linguagem Multimodais (MLLM s): Da Teoria à Prática

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实际应用中的技术落地问题,包括如何有效整合视觉、语言等不同模态的信息,并构建可扩展的多模态处理流程。其解决方案的关键在于系统性地阐述MLLMs的核心基础与代表性模型,并提供实用的预处理方法、提示工程(prompt engineering)策略以及基于LangChain和LangGraph的多模态流水线构建技术,从而为研究人员和开发者提供可操作的技术路径与工具支持。

链接: https://arxiv.org/abs/2602.12302
作者: Neemias da Silva,Júlio C. W. Scholz,John Harrison,Marina Borges,Paulo Ávila,Frances A Santos,Myriam Delgado,Rodrigo Minetto,Thiago H Silva
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: in Portuguese language. Accepted book chapter - Webmedia 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) combine the natural language understanding and generation capabilities of LLMs with perception skills in modalities such as image and audio, representing a key advancement in contemporary AI. This chapter presents the main fundamentals of MLLMs and emblematic models. Practical techniques for preprocessing, prompt engineering, and building multimodal pipelines with LangChain and LangGraph are also explored. For further practical study, supplementary material is publicly available online: this https URL. Finally, the chapter discusses the challenges and highlights promising trends.

[NLP-52] Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction

【速读】: 该论文旨在解决端到端自动语音识别(ASR)系统在处理领域特定短语(如命名实体)时容易出现识别错误的问题,此类错误可能对下游任务造成灾难性影响。解决方案的关键在于提出一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的命名实体纠错框架,其核心包括两个组成部分:一是使用重述语言模型(Rephrasing Language Model, RLM)进行命名实体识别,并结合音素级编辑距离实现候选实体检索;二是设计一种自教式推理模型(Self-Taught Reasoning Model with Adaptive Chain-of-Thought, A-STAR),可根据任务难度动态调整推理深度,从而更充分地利用大语言模型(LLM)的复杂推理能力。实验表明,该方法在AISHELL-1和同音词数据集上分别实现了17.96%和34.42%的命名实体字符错误率相对降低。

链接: https://arxiv.org/abs/2602.12287
作者: Junjie An,Jingguang Tian,Tianyi Wang,Yu Gao,Xiaofeng Mou,Yi Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:End-to-end automatic speech recognition (ASR) systems frequently misrecognize domain-specific phrases like named entities, which can cause catastrophic failures in downstream tasks. A new family of named entity correction methods based on large language models (LLMs) has recently emerged. However, these approaches have yet to fully exploit the sophisticated reasoning capabilities inherent to LLMs. To bridge this gap, we propose a novel retrieval-augmented generation framework for correcting named entity errors in ASR. Our approach consists of two key components: (1) a rephrasing language model (RLM) for named entity recognition, followed by candidate retrieval using a phonetic-level edit distance; and (2) a novel self-taught reasoning model with adaptive chain-of-thought (A-STAR) that dynamically adjusts the depth of its reasoning based on task difficulty. Experiments on the AISHELL-1 and Homophone datasets demonstrate the effectiveness of our method, which achieves relative reductions in the named entity character error rate of 17.96% and 34.42%, respectively, compared to a strong baseline.

[NLP-53] From Biased Chatbots to Biased Agents : Examining Role Assignment Effects on LLM Agent Robustness AAAI2026

【速读】: 该论文试图解决的问题是:当前大语言模型(Large Language Models, LLMs)作为自主代理(agent)在真实世界任务中部署时,由人口统计学特征引发的个性设定(persona)是否会影响其行为表现及决策可靠性,尤其是这种影响是否会导致性能下降甚至引入隐式偏见。解决方案的关键在于通过系统性案例研究验证了任务无关的人格设定线索可显著降低LLM代理在战略推理、规划和技术操作等多领域任务中的性能,最高可达26.2%的下降,并揭示出该现象普遍存在于不同任务类型和模型架构中,从而指出当前LLM代理系统存在因人格条件化和简单提示注入导致的行为不可靠性这一被忽视的脆弱性。

链接: https://arxiv.org/abs/2602.12285
作者: Linbo Cao,Lihao Sun,Yang Yue
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the AAAI 2026 TrustAgent Workshop. 6 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of actions with real-world impacts beyond text generation. While persona-induced biases in text generation are well documented, their effects on agent task performance remain largely unexplored, even though such effects pose more direct operational risks. In this work, we present the first systematic case study showing that demographic-based persona assignments can alter LLM agents’ behavior and degrade performance across diverse domains. Evaluating widely deployed models on agentic benchmarks spanning strategic reasoning, planning, and technical operations, we uncover substantial performance variations - up to 26.2% degradation, driven by task-irrelevant persona cues. These shifts appear across task types and model architectures, indicating that persona conditioning and simple prompt injections can distort an agent’s decision-making reliability. Our findings reveal an overlooked vulnerability in current LLM agentic systems: persona assignments can introduce implicit biases and increase behavioral volatility, raising concerns for the safe and robust deployment of LLM agents.

[NLP-54] A Lightweight LLM Framework for Disaster Humanitarian Information Classification

【速读】: 该论文旨在解决在资源受限的紧急环境中部署大语言模型(Large Language Models, LLMs)进行灾难推文分类时面临的计算成本高、效率低的问题。其核心解决方案是提出一种轻量级、低成本的框架,采用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术,特别是LoRA(Low-Rank Adaptation)及其优化版本QLoRA,在仅训练约2%参数的情况下实现高达79.62%的人道主义信息分类准确率,相较零样本方法提升37.79%;同时通过构建统一的双任务基准数据集(整合HumAID数据集并标准化),系统评估了提示策略、LoRA与检索增强生成(Retrieval-Augmented Generation, RAG)的效果,发现RAG因检索示例中的标签噪声反而降低性能,从而确立了一条可在有限算力下构建可靠危机情报系统的可复现路径。

链接: https://arxiv.org/abs/2602.12284
作者: Han Jinzhen,Kim Jisung,Yang Jong Soo,Yun Hong Sik
机构: Sungkyunkwan University (成均馆大学); University of Leeds (利兹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Timely classification of humanitarian information from social media is critical for effective disaster response. However, deploying large language models (LLMs) for this task faces challenges in resource-constrained emergency settings. This paper develops a lightweight, cost-effective framework for disaster tweet classification using parameter-efficient fine-tuning. We construct a unified experimental corpus by integrating and normalizing the HumAID dataset (76,484 tweets across 19 disaster events) into a dual-task benchmark: humanitarian information categorization and event type identification. Through systematic evaluation of prompting strategies, LoRA fine-tuning, and retrieval-augmented generation (RAG) on Llama 3.1 8B, we demonstrate that: (1) LoRA achieves 79.62% humanitarian classification accuracy (+37.79% over zero-shot) while training only ~2% of parameters; (2) QLoRA enables efficient deployment with 99.4% of LoRA performance at 50% memory cost; (3) contrary to common assumptions, RAG strategies degrade fine-tuned model performance due to label noise from retrieved examples. These findings establish a practical, reproducible pipeline for building reliable crisis intelligence systems with limited computational resources.

[NLP-55] Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR ICASSP2026

【速读】: 该论文旨在解决自动语音识别(ASR)中依赖外部语音编码器或预训练大语言模型(LLM)所带来的复杂性和资源消耗问题,提出一种无需这些组件的纯解码器架构。其核心解决方案是设计了一种模态感知的稀疏专家混合(MoE)机制,通过将语音和文本分别分配到独立的专家池并采用硬路由与Top-1选择策略,在混合因果性的Conformer块中实现高效处理:语音部分使用双向结构,文本部分使用因果结构。该方法在训练时结合CTC损失与标签平滑交叉熵损失,从而在保持参数效率的同时显著提升识别准确率,实验证明其在LibriSpeech和Common Voice数据集上均优于强基线模型。

链接: https://arxiv.org/abs/2602.12546
作者: Jaeyoung Lee,Masato Mimura
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules.

[NLP-56] Alignment or Integration? Rethinking Multimodal Fusion in DNA-language Foundation Models

【速读】: 该论文旨在解决DNA序列与自然语言在生成式AI(Generative AI)框架下如何高效融合的问题,核心挑战在于现有方法多采用嵌入层对齐(embedding-level alignment)的晚期融合策略,导致基因组序列的细粒度结构信息被压缩为固定维度表示,限制了模型对基因组特征的精细化推理能力。解决方案的关键在于提出两种新的融合机制:一是通过序列级对比预训练强化嵌入对齐的SeqCLIP方法,二是直接将基因组k-mer(k-mer)整合进语言模型词汇表的OneVocab方法;实验证明,早期词汇级集成(vocabulary-level integration)能生成更具表达力和任务有效性DNA-语言联合表示,优于传统嵌入级对齐策略。

链接: https://arxiv.org/abs/2602.12286
作者: Yanan Li,Christina Yi Jin,Yuan Jin,Manli Luo,Tie Xu,Shuai Jiao,Wei He,Qing Zhang
机构: Zhejiang Lab(浙江实验室)
类目: Genomics (q-bio.GN); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fusing DNA foundation models with large language models (LLMs) for DNA-language reasoning raises a fundamental question: at what level should genomic sequences and natural language interact? Most existing approaches encode DNA sequences and text separately and rely on embedding-level alignment to connect the two modalities. Such late-stage fusion compresses rich genomic sequences into fixed representations, limiting the model’s ability to reason over fine-grained, token-level genomic structure. In this work, we propose two new methods for DNA-language fusion, i.e., a semantic alignment method SeqCLIP and a vocabulary-level integration method OneVocab. SeqCLIP strengthens embedding-level alignment via sequence-level contrastive pre-training, and OneVocab directly integrates genomic k -mers into the language model’s existing vocabulary. Comprehensive experiments on classification and reasoning tasks show that, while various alignment strategies improve embedding-level fusion, early vocabulary-level integration yields more expressive and effective representations for DNA-language modeling.

信息检索

[IR-0] Fix Before Search: Benchmarking Agent ic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation

【速读】:该论文旨在解决多模态检索增强生成(Multimodal Retrieval-Augmented Generation, MRAG)系统中视觉查询预处理(Visual Query Pre-processing, V-QPP)缺失的问题。现有MRAG流程通常将视觉输入视为静态且无噪声的,但现实中的视觉查询常存在几何失真、质量退化或语义模糊等“不完美”问题,导致检索失败甚至端到端性能崩溃。解决方案的关键在于提出V-QPP-Bench——首个专注于视觉查询预处理的综合性基准,将V-QPP建模为一个代理决策任务,要求多模态大语言模型(MLLMs)自主诊断视觉缺陷并调用感知工具进行查询优化。实验表明,通过监督微调训练,小型模型可达到甚至超越大型商用模型的性能,验证了该基准在提升MRAG鲁棒性方面的价值。

链接: https://arxiv.org/abs/2602.13179
作者: Jiankun Zhang,Shenglai Zeng,Kai Guo,Xinnan Dai,Hui Liu,Jiliang Tang,Yi Chang
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a key paradigm for grounding MLLMs with external knowledge. While query pre-processing (e.g., rewriting) is standard in text-based RAG, existing MRAG pipelines predominantly treat visual inputs as static and immutable, implicitly assuming they are noise-free. However, real-world visual queries are often ``imperfect’’ – suffering from geometric distortions, quality degradation, or semantic ambiguity – leading to catastrophic retrieval failures. To address this gap, we propose V-QPP-Bench, the first comprehensive benchmark dedicated to Visual Query Pre-processing (V-QPP). We formulate V-QPP as an agentic decision-making task where MLLMs must autonomously diagnose imperfections and deploy perceptual tools to refine queries. Our extensive evaluation across 46,700 imperfect queries and diverse MRAG paradigms reveals three critical insights: (1) Vulnerability – visual imperfections severely degrade both retrieval recall and end-to-end MRAG performance; (2) Restoration Potential \ Bottleneck – while oracle preprocessing recovers near-perfect performance, off-the-shelf MLLMs struggle with tool selection and parameter prediction without specialized training; and (3) Training Enhancement – supervised fine-tuning enables compact models to achieve comparable or superior performance to larger proprietary models, demonstrating the benchmark’s value for developing robust MRAG systems The code is available at this https URL

[IR-1] Asynchronous Verified Semantic Caching for Tiered LLM Architectures

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在搜索、辅助和代理工作流中因频繁调用导致的推理成本高与延迟大的问题,核心在于优化语义缓存(semantic caching)策略以提升缓存命中率并保障响应准确性。现有生产级缓存通常采用静态-动态分层设计,但两者共享单一嵌入相似度阈值,造成保守阈值漏掉可复用的安全响应,而激进阈值则可能引入语义错误。解决方案的关键在于提出Krites——一种异步、由LLM判断的缓存策略:当查询最近的静态缓存项略低于阈值时,Krites异步调用LLM裁判器验证其适用性;若通过,则将该响应提升至动态缓存,从而扩展静态缓存的有效覆盖范围,且不影响关键路径延迟。实验证明,Krites在对话和搜索类负载下可使被静态缓存直接命中或经验证提升的请求比例提升最多达3.9倍。

链接: https://arxiv.org/abs/2602.13165
作者: Asmit Kumar Singh,Haozhe Wang,Laxmi Naga Santosh Attaluri,Tak Chiam,Weihua Zhu
机构: Apple(苹果)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce \textbfKrites, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt. Approved matches are promoted into the dynamic cache, allowing future repeats and paraphrases to reuse curated static answers and expanding static reach over time. In trace-driven simulations on conversational and search workloads, Krites increases the fraction of requests served with curated static answers (direct static hits plus verified promotions) by up to \textbf3.9 times for conversational traffic and search-style queries relative to tuned baselines, with unchanged critical path latency.

[IR-2] Awakening Dormant Users: Generative Recommendation with Counterfactual Functional Role Reasoning

【速读】:该论文旨在解决大规模式电商平台上“沉睡用户”(即低转化但保持活跃的用户)激活难题,现有方法因仅依赖单步估计商品的内在价值(如即时点击概率),忽略了商品在用户转化路径中所起的工具性作用——即特定交互行为可作为触发点,引导潜在意图演化并推动后续决策。解决方案的关键在于提出RoleGen框架,其核心创新是将一个基于大语言模型(LLM)的转化轨迹推理器(Conversion Trajectory Reasoner)与生成式行为骨干网络(Generative Behavioral Backbone)相结合:推理器显式建模商品在不同上下文中的功能角色(Functional Role),通过反事实推理模拟多样化的转化路径以缓解兴趣衰减;生成式骨干则通过“推理-执行-反馈-反思”的闭环优化策略确保生成内容的合理性与有效性,从而显著提升沉睡用户的召回率和订单量。

链接: https://arxiv.org/abs/2602.13134
作者: Huishi Luo,Shuokai Li,Hanchen Yang,Zhongbo Sun,Haojie Ding,Boheng Zhang,Zijia Cai,Renliang Qian,Fan Yang,Tingting Gao,Chenyi Lei,Wenwu Ou,Fuzhen Zhuang
机构: Beihang University (北京航空航天大学); Kuaishou Technology (快手科技)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Awakening dormant users, who remain engaged but exhibit low conversion, is a pivotal driver for incremental GMV growth in large-scale e-commerce platforms. However, existing approaches often yield suboptimal results since they typically rely on single-step estimation of an item’s intrinsic value (e.g., immediate click probability). This mechanism overlooks the instrumental effect of items, where specific interactions act as triggers to shape latent intent and drive subsequent decisions along a conversion trajectory. To bridge this gap, we propose RoleGen, a novel framework that synergizes a Conversion Trajectory Reasoner with a Generative Behavioral Backbone. Specifically, the LLM-based Reasoner explicitly models the context-dependent Functional Role of items to reconstruct intent evolution. It further employs counterfactual inference to simulate diverse conversion paths, effectively mitigating interest collapse. These reasoned candidate items are integrated into the generative backbone, which is optimized via a collaborative “Reasoning-Execution-Feedback-Reflection” closed-loop strategy to ensure grounded execution. Extensive offline experiments and online A/B testing on the Kuaishou e-commerce platform demonstrate that RoleGen achieves a 6.2% gain in Recall@1 and a 7.3% increase in online order volume, confirming its effectiveness in activating the dormant user base.

[IR-3] RGAlign-Rec: Ranking-Guided Alignment for Latent Query Reasoning in Recommendation Systems

【速读】:该论文旨在解决现代电商聊天机器人中主动意图预测(proactive intent prediction)的两大核心挑战:一是用户离散特征与聊天机器人知识库中语义意图之间的语义鸿沟(semantic gap),二是通用大语言模型(LLM)输出与任务特定排序目标之间的目标错位(objective misalignment)。为此,作者提出RGAlign-Rec框架,其关键在于构建一个闭环对齐机制,将基于LLM的语义推理模块与查询增强型(Query-Enhanced, QE)排序模型相结合,并引入排序引导对齐(Ranking-Guided Alignment, RGA)多阶段训练范式,利用下游排序信号作为反馈来优化LLM的潜在推理过程。这一设计有效实现了语义理解与排序目标的一致性同步,显著提升了真实场景下主动推荐系统的预测准确性和服务质量。

链接: https://arxiv.org/abs/2602.12968
作者: Junhua Liu,Yang Jihao,Cheng Chang,Kunrong LI,Bin Fu,Kwan Hui Lim
机构: Forth AI(福尔特人工智能); Shopee(虾皮); Singapore University of Technology and Design(新加坡科技设计大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Proactive intent prediction is a critical capability in modern e-commerce chatbots, enabling “zero-query” recommendations by anticipating user needs from behavioral and contextual signals. However, existing industrial systems face two fundamental challenges: (1) the semantic gap between discrete user features and the semantic intents within the chatbot’s Knowledge Base, and (2) the objective misalignment between general-purpose LLM outputs and task-specific ranking utilities. To address these issues, we propose RGAlign-Rec, a closed-loop alignment framework that integrates an LLM-based semantic reasoner with a Query-Enhanced (QE) ranking model. We also introduce Ranking-Guided Alignment (RGA), a multi-stage training paradigm that utilizes downstream ranking signals as feedback to refine the LLM’s latent reasoning. Extensive experiments on a large-scale industrial dataset from Shopee demonstrate that RGAlign-Rec achieves a 0.12% gain in GAUC, leading to a significant 3.52% relative reduction in error rate, and a 0.56% improvement in Recall@3. Online A/B testing further validates the cumulative effectiveness of our framework: the Query-Enhanced model (QE-Rec) initially yields a 0.98% improvement in CTR, while the subsequent Ranking-Guided Alignment stage contributes an additional 0.13% gain. These results indicate that ranking-aware alignment effectively synchronizes semantic reasoning with ranking objectives, significantly enhancing both prediction accuracy and service quality in real-world proactive recommendation systems.

[IR-4] JARVIS: An Evidence-Grounded Retrieval System for Interpretable Deceptive Reviews Adjudication

【速读】:该论文旨在解决电商生态系统中虚假评论(deceptive reviews)的识别难题,尤其针对现有检测方法在泛化能力不足和可解释性差方面的局限。其解决方案的关键在于提出JARVIS框架,通过混合稠密-稀疏多模态检索获取语义相似证据,利用共享实体扩展关系信号并构建异构证据图(heterogeneous evidence graph),再由大语言模型进行基于证据的裁决,从而实现高精度、可解释的风险评估。

链接: https://arxiv.org/abs/2602.12941
作者: Nan Lu,Leyang Li,Yurong Hu,Rui Lin,Shaoyi Xu
机构: Beijing Jiaotong University(北京交通大学); JD.com(京东)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Deceptive reviews, refer to fabricated feedback designed to artificially manipulate the perceived quality of products. Within modern e-commerce ecosystems, these reviews remain a critical governance challenge. Despite advances in review-level and graph-based detection methods, two pivotal limitations remain: inadequate generalization and lack of interpretability. To address these challenges, we propose JARVIS, a framework providing Judgment via Augmented Retrieval and eVIdence graph Structures. Starting from the review to be evaluated, it retrieves semantically similar evidence via hybrid dense-sparse multimodal retrieval, expands relational signals through shared entities, and constructs a heterogeneous evidence graph. Large language model then performs evidence-grounded adjudication to produce interpretable risk assessments. Offline experiments demonstrate that JARVIS enhances performance on our constructed review dataset, achieving a precision increase from 0.953 to 0.988 and a recall boost from 0.830 to 0.901. In the production environment, our framework achieves a 27% increase in the recall volume and reduces manual inspection time by 75%. Furthermore, the adoption rate of the model-generated analysis reaches 96.4%.

[IR-5] WISE: A Multimodal Search Engine for Visual Scenes Audio Objects Faces Speech and Metadata WWW

【速读】:该论文旨在解决多模态数据(如图像、视频和音频)在实际应用中难以高效检索的问题,尤其针对非机器学习专业用户缺乏易用工具的痛点。解决方案的关键在于构建一个名为WISE的开源音视频搜索引擎,其核心创新是集成多种模态的检索能力(如自然语言查询、反向图像搜索、人脸识别、音频事件检索及语音转录文本搜索),并通过向量搜索技术实现对百万级图像或数千小时视频的高效检索,同时支持跨模态组合查询以提升检索精度。其模块化架构还便于扩展新模型,且可本地部署以保障隐私与安全性,已在多个真实场景中得到验证。

链接: https://arxiv.org/abs/2602.12819
作者: Prasanna Sridhar,Horace Lee,David M. S. Pinto,Andrew Zisserman,Abhishek Dutta
机构: University of Oxford(牛津大学)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: Software: this https URL , Online demos: this https URL , Example Queries: this https URL

点击查看摘要

Abstract:In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. WISE supports natural-language and reverse-image queries at both the scene level (e.g. empty street) and object level (e.g. horse) across images and videos; face-based search for specific individuals; audio retrieval of acoustic events using text (e.g. wood creak) or an audio file; search over automatically transcribed speech; and filtering by user-provided metadata. Rich insights can be obtained by combining queries across modalities – for example, retrieving German trains from a historical archive by applying the object query “train” and the metadata query “Germany”, or searching for a face in a place. By employing vector search techniques, WISE can scale to support efficient retrieval over millions of images or thousands of hours of video. Its modular architecture facilitates the integration of new models. WISE can be deployed locally for private or sensitive collections, and has been applied to various real-world use cases. Our code is open-source and available at this https URL.

[IR-6] SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

【速读】:该论文旨在解决现有 spoken query retrieval(语音查询检索)评估数据集在复杂声学扰动下缺乏鲁棒性测试能力的问题,即当前数据集多局限于简单查询和受限噪声条件,难以全面评估系统在真实复杂环境中的性能表现。解决方案的关键在于提出 SQuTR 基准测试平台,其核心包括一个大规模、多样化的语音查询数据集(涵盖 37,317 条来自六个常用英文与中文文本检索数据集的查询)以及统一的评估协议;通过合成语音(使用 200 名真实说话者的声音特征)并混合 17 类真实环境噪声,在受控信噪比(SNR)条件下实现从安静到高噪声场景的可复现鲁棒性测试,从而为端到端与级联式检索系统提供系统性的性能对比与诊断分析基础。

链接: https://arxiv.org/abs/2602.12783
作者: Yuejie Li,Ke Yang,Yueying Hua,Berlin Chen,Jianhao Nie,Yueping He,Caixin Kang
机构: Huazhong University of Science and Technology (华中科技大学); The University of Hong Kong (香港大学); Soochow University (苏州大学); University of Science and Technology of China (中国科学技术大学); Wuhan University (武汉大学); Tsinghua University (清华大学); The University of Tokyo (东京大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex acoustic perturbations. To address this limitation, we present SQuTR, a robustness benchmark for spoken query retrieval that includes a large-scale dataset and a unified evaluation protocol. SQuTR aggregates 37,317 unique queries from six commonly used English and Chinese text retrieval datasets, spanning multiple domains and diverse query types. We synthesize speech using voice profiles from 200 real speakers and mix 17 categories of real-world environmental noise under controlled SNR levels, enabling reproducible robustness evaluation from quiet to highly noisy conditions. Under the unified protocol, we conduct large-scale evaluations on representative cascaded and end-to-end retrieval systems. Experimental results show that retrieval performance decreases as noise increases, with substantially different drops across systems. Even large-scale retrieval models struggle under extreme noise, indicating that robustness remains a critical bottleneck. Overall, SQuTR provides a reproducible testbed for benchmarking and diagnostic analysis, and facilitates future research on robustness in spoken query to text retrieval.

[IR-7] raining Dense Retrievers with Multiple Positive Passages

【速读】:该论文旨在解决检索器(retriever)训练中因稀疏的单正例标注导致的假负样本噪声和次优监督问题,尤其是在引入大规模密集多正例相关性标签(multi-positive relevance labels)后,如何有效利用这些增强信号提升检索性能的问题。其解决方案的关键在于提出一个统一的对比学习框架,系统性地分析了多种优化目标(如Joint Likelihood、Summed Marginal Likelihood 和 Log-Sum-Exp Pairwise 损失),并通过理论与实证结合的方式揭示不同损失函数在概率质量分配上的梯度行为差异。实验表明,LSEPair 损失在多种真实场景下展现出最强的鲁棒性和性能表现,而其他方法对正例质量敏感,从而为基于生成式 AI (Generative AI) 增强的检索器训练提供了可落地的设计原则。

链接: https://arxiv.org/abs/2602.12727
作者: Benben Wang,Minghao Tang,Hengran Zhang,Jiafeng Guo,Keping Bi
机构: Xidian University (西安电子科技大学); State Key Laboratory of AI Safety (人工智能安全国家重点实验室); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Modern knowledge-intensive systems, such as retrieval-augmented generation (RAG), rely on effective retrievers to establish the performance ceiling for downstream modules. However, retriever training has been bottlenecked by sparse, single-positive annotations, which lead to false-negative noise and suboptimal supervision. While the advent of large language models (LLMs) makes it feasible to collect comprehensive multi-positive relevance labels at scale, the optimal strategy for incorporating these dense signals into training remains poorly understood. In this paper, we present a systematic study of multi-positive optimization objectives for retriever training. We unify representative objectives, including Joint Likelihood (JointLH), Summed Marginal Likelihood (SumMargLH), and Log-Sum-Exp Pairwise (LSEPair) loss, under a shared contrastive learning framework. Our theoretical analysis characterizes their distinct gradient behaviors, revealing how each allocates probability mass across positive document sets. Empirically, we conduct extensive evaluations on Natural Questions, MS MARCO, and the BEIR benchmark across two realistic regimes: homogeneous LLM-annotated data and heterogeneous mixtures of human and LLM labels. Our results show that LSEPair consistently achieves superior robustness and performance across settings, while JointLH and SumMargLH exhibit high sensitivity to the quality of positives. Furthermore, we find that the simple strategy of random sampling (Rand1LH) serves as a reliable baseline. By aligning theoretical insights with empirical findings, we provide practical design principles for leveraging dense, LLM-augmented supervision to enhance retriever effectiveness.

[IR-8] Self-EvolveRec: Self-Evolving Recommender Systems with LLM -based Directional Feedback

【速读】:该论文旨在解决传统推荐系统自动化设计方法(如神经架构搜索,Neural Architecture Search, NAS)因受限于人类先验定义的固定搜索空间而难以实现创新的问题,以及当前基于大语言模型(Large Language Model, LLM)的代码演化框架仅依赖标量指标(如NDCG、命中率)无法提供定性洞察或改进方向的局限。其解决方案的关键在于提出Self-EvolveRec框架,通过引入用户模拟器(User Simulator)获取定性反馈,并结合模型诊断工具(Model Diagnosis Tool)进行定量内部验证,构建了一个具有方向性的反馈闭环;同时提出“诊断工具-模型共演化”策略,使评估标准随推荐架构演进而动态适应,从而在推荐性能与用户满意度上显著优于现有NAS和LLM驱动的基线方法。

链接: https://arxiv.org/abs/2602.12612
作者: Sein Kim,Sangwu Park,Hongseok Kang,Wonjoong Kim,Jimin Seo,Yeonjun In,Kanghoon Yoon,Chanyoung Park
机构: KAIST(韩国科学技术院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional methods for automating recommender system design, such as Neural Architecture Search (NAS), are often constrained by a fixed search space defined by human priors, limiting innovation to pre-defined operators. While recent LLM-driven code evolution frameworks shift fixed search space target to open-ended program spaces, they primarily rely on scalar metrics (e.g., NDCG, Hit Ratio) that fail to provide qualitative insights into model failures or directional guidance for improvement. To address this, we propose Self-EvolveRec, a novel framework that establishes a directional feedback loop by integrating a User Simulator for qualitative critiques and a Model Diagnosis Tool for quantitative internal verification. Furthermore, we introduce a Diagnosis Tool - Model Co-Evolution strategy to ensure that evaluation criteria dynamically adapt as the recommendation architecture evolves. Extensive experiments demonstrate that Self-EvolveRec significantly outperforms state-of-the-art NAS and LLM-driven code evolution baselines in both recommendation performance and user satisfaction. Our code is available at this https URL.

[IR-9] RQ-GMM: Residual Quantized Gaussian Mixture Model for Multimodal Semantic Discretization in CTR Prediction

【速读】:该论文旨在解决多模态内容在点击率(Click-Through Rate, CTR)预测中直接使用预训练模型连续嵌入(continuous embeddings)时,因优化目标不一致和联合训练收敛速度差异导致性能不佳的问题。现有离散化方法虽能提升效果,但存在码本(codebook)利用率低、重建精度差及语义区分度不足的局限。其解决方案的关键在于提出RQ-GMM(Residual Quantized Gaussian Mixture Model),通过引入概率建模来更好地捕捉多模态嵌入空间的统计结构,结合高斯混合模型(Gaussian Mixture Model, GMM)与残差量化(residual quantization),显著提升了码本利用率和重建准确性,从而实现更优的CTR预测性能。

链接: https://arxiv.org/abs/2602.12593
作者: Ziye Tong,Jiahao Liu,Weimin Zhang,Hongji Ruan,Derick Tang,Zhanpeng Zeng,Qinsong Zeng,Peng Zhang,Tun Lu,Ning Gu
机构: Tencent(腾讯); Fudan University (复旦大学); Beijing Jiaotong University (北京交通大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Multimodal content is crucial for click-through rate (CTR) prediction. However, directly incorporating continuous embeddings from pre-trained models into CTR models yields suboptimal results due to misaligned optimization objectives and convergence speed inconsistency during joint training. Discretizing embeddings into semantic IDs before feeding them into CTR models offers a more effective solution, yet existing methods suffer from limited codebook utilization, reconstruction accuracy, and semantic discriminability. We propose RQ-GMM (Residual Quantized Gaussian Mixture Model), which introduces probabilistic modeling to better capture the statistical structure of multimodal embedding spaces. Through Gaussian Mixture Models combined with residual quantization, RQ-GMM achieves superior codebook utilization and reconstruction accuracy. Experiments on public datasets and online A/B tests on a large-scale short-video platform serving hundreds of millions of users demonstrate substantial improvements: RQ-GMM yields a 1.502% gain in Advertiser Value over strong baselines. The method has been fully deployed, serving daily recommendations for hundreds of millions of users.

[IR-10] CAPTS: Channel-Aware Preference-Aligned Trigger Selection for Multi-Channel Item-to-Item Retrieval

【速读】:该论文旨在解决大规模工业推荐系统中多通道召回(multi-channel retrieval)场景下两个核心问题:一是偏差的价值归属问题(biased value attribution),即传统方法基于触发项(trigger)的即时反馈来评估其价值,而非衡量其作为下游Item-to-Item(I2I)召回种子的实际效用;二是跨通道路由不协调问题(uncoordinated multi-channel routing),即各通道独立选择触发项,在共享触发名额限制下导致重复覆盖、资源浪费。解决方案的关键在于提出一个统一且灵活的框架CAPTS(Channel-Aware, Preference-Aligned Trigger Selection),其核心创新为:1)引入价值归属模块(Value Attribution Module, VAM),通过前瞻监督机制将后续在每个I2I通道中由触发项驱动的用户参与度归因于该触发项本身,从而实现更准确的触发项评估;2)设计通道自适应触发路由模块(Channel-Adaptive Trigger Routing, CATR),协调触发项与各通道之间的分配关系,以最大化整体多通道召回的综合价值。该方案在Kwai平台上的离线实验和在线A/B测试中均验证了有效性,显著提升了多通道召回效果并带来+0.351%的平均设备停留时长增益。

链接: https://arxiv.org/abs/2602.12564
作者: Xiaoyou Zhou,Yuqi Liu,Zhao Liu,Xiao Lv,Bo Chen,Ruiming Tang,Guorui Zhou
机构: Kuaishou Technology(快手科技)
类目: Information Retrieval (cs.IR)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Large-scale industrial recommender systems commonly adopt multi-channel retrieval for candidate generation, combining direct user-to-item (U2I) retrieval with two-hop user-to-item-to-item (U2I2I) pipelines. In U2I2I, the system selects a small set of historical interactions as triggers to seed downstream item-to-item (I2I) retrieval across multiple channels. In production, triggers are often selected using rule-based policies or learned scorers and tuned in a channel-by-channel manner. However, these practices face two persistent challenges: biased value attribution that values triggers by on-trigger feedback rather than their downstream utility as retrieval seeds, and uncoordinated multi-channel routing where channels select triggers independently under a shared quota, increasing cross-channel overlap. To address these challenges, we propose Channel-Aware, Preference-Aligned Trigger Selection (CAPTS), a unified and flexible framework that treats multi-channel trigger selection as a learnable routing problem. CAPTS introduces a Value Attribution Module (VAM) that provides look-ahead supervision by crediting each trigger with the subsequent engagement generated by items retrieved from it on each I2I channel, and a Channel-Adaptive Trigger Routing (CATR) module that coordinates trigger-to-channel assignment to maximize the overall value of multi-channel retrieval. Extensive offline experiments and large-scale online A/B tests on Kwai, Kuaishou’s international short-video platform, show that CAPTS consistently improves multi-channel recall offline and delivers a +0.351% lift in average time spent per device online.

[IR-11] Reasoning to Rank: An End-to-End Solution for Exploiting Large Language Models for Recommendation

【速读】:该论文旨在解决推荐系统中如何有效优化模型以提升推荐实用性的问题,尤其是在利用大语言模型(Large Language Models, LLMs)进行推荐时,如何实现对推理过程的端到端优化。其解决方案的关键在于提出了一种名为“Reasoning to Rank”的端到端训练框架,该框架将推荐实用性优化内化为LLM逐步推理的学习过程:通过在用户-物品层面进行推理以避免位置偏差,并采用强化学习实现对整个推理流程的直接优化,从而显著提升推荐效果。

链接: https://arxiv.org/abs/2602.12530
作者: Kehan Zheng,Deyao Hong,Qian Li,Jun Zhang,Huan Yu,Jie Jiang,Hongning Wang
机构: Tsinghua University (清华大学); Tencent Inc. (腾讯公司)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recommender systems are tasked to infer users’ evolving preferences and rank items aligned with their intents, which calls for in-depth reasoning beyond pattern-based scoring. Recent efforts start to leverage large language models (LLMs) for recommendation, but how to effectively optimize the model for improved recommendation utility is still under explored. In this work, we propose Reasoning to Rank, an end-to-end training framework that internalizes recommendation utility optimization into the learning of step-by-step reasoning in LLMs. To avoid position bias in LLM reasoning and enable direct optimization of the reasoning process, our framework performs reasoning at the user-item level and employs reinforcement learning for end-to-end training of the LLM. Experiments on three Amazon datasets and a large-scale industrial dataset showed consistent gains over strong conventional and LLM-based solutions. Extensive in-depth analyses validate the necessity of the key components in the proposed framework and shed lights on the future developments of this line of work.

[IR-12] DiffuRank: Effective Document Reranking with Diffusion Language Models

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的文档重排序(document reranking)方法依赖自回归生成机制所导致的效率低、灵活性差的问题。具体而言,自回归模型在逐token解码过程中存在高延迟,且固定从左到右的生成顺序易引发早期预测错误并难以修正。为克服这些局限,论文提出DiffuRank框架,其核心创新在于引入扩散语言模型(diffusion language models, dLLMs),利用其非自回归、可并行解码的特性实现更高效、可控的重排序策略。关键在于设计三种基于dLLMs的重排序方法:点对点(pointwise)相关性估计、基于logits的列表级(listwise)联合评估以及基于排列的列表级解码,并配套相应的训练方案以充分发挥dLLMs的优势。实验表明,该方法在零样本和微调场景下均能获得与自回归LLMs相当甚至更优的性能,验证了扩散机制作为文档重排序新范式的可行性。

链接: https://arxiv.org/abs/2602.12528
作者: Qi Liu,Kun Ai,Jiaxin Mao,Yanzhao Zhang,Mingxin Li,Dingkun Long,Pengjun Xie,Fengbin Zhu,Ji-Rong Wen
机构: Renmin University of China(中国人民大学); Alibaba Group(阿里巴巴集团); National University of Singapore(新加坡国立大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: The code is available at this https URL

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have inspired new paradigms for document reranking. While this paradigm better exploits the reasoning and contextual understanding capabilities of LLMs, most existing LLM-based rerankers rely on autoregressive generation, which limits their efficiency and flexibility. In particular, token-by-token decoding incurs high latency, while the fixed left-to-right generation order causes early prediction errors to propagate and is difficult to revise. To address these limitations, we explore the use of diffusion language models (dLLMs) for document reranking and propose DiffuRank, a reranking framework built upon dLLMs. Unlike autoregressive models, dLLMs support more flexible decoding and generation processes that are not constrained to a left-to-right order, and enable parallel decoding, which may lead to improved efficiency and controllability. Specifically, we investigate three reranking strategies based on dLLMs: (1) a pointwise approach that uses dLLMs to estimate the relevance of each query-document pair; (2) a logit-based listwise approach that prompts dLLMs to jointly assess the relevance of multiple documents and derives ranking lists directly from model logits; and (3) a permutation-based listwise approach that adapts the canonical decoding process of dLLMs to the reranking tasks. For each approach, we design corresponding training methods to fully exploit the advantages of dLLMs. We evaluate both zero-shot and fine-tuned reranking performance on multiple benchmarks. Experimental results show that dLLMs achieve performance comparable to, and in some cases exceeding, that of autoregressive LLMs with similar model sizes. These findings demonstrate the promise of diffusion-based language models as a compelling alternative to autoregressive architectures for document reranking.

[IR-13] Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search SIGIR2026

【速读】:该论文旨在解决多向量视觉检索(multi-vector visual retrieval)系统在实际应用中因每页文档生成数千个向量而导致的索引与搜索成本过高问题,从而限制了其可扩展性。解决方案的关键在于提出一种无需训练的、模型感知的静态空间池化策略,通过轻量级滑动窗口平均等方法对patch嵌入进行压缩,生成紧凑的tile级和全局表示用于快速候选生成,随后使用完整的多向量嵌入进行精确的MaxSim重排序。该设计将每页存储的向量数量从数千减少到数十,实现向量间比较次数的二次级降低,显著提升吞吐量(约4倍QPS),同时在ViDoRe v2基准上保持NDCG和Recall@5/10指标几乎不变,且不依赖后训练、适配器或知识蒸馏。

链接: https://arxiv.org/abs/2602.12510
作者: Ara Yeroyan
机构: Independent Researcher(独立研究员)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 pages, 3 figures. Submitted to SIGIR 2026 Demonstrations Track. Project website: this https URL

点击查看摘要

Abstract:Multi-vector visual retrievers (e.g., ColPali-style late interaction models) deliver strong accuracy, but scale poorly because each page yields thousands of vectors, making indexing and search increasingly expensive. We present Visual RAG Toolkit, a practical system for scaling visual multi-vector retrieval with training-free, model-aware pooling and multi-stage retrieval. Motivated by Matryoshka Embeddings, our method performs static spatial pooling - including a lightweight sliding-window averaging variant - over patch embeddings to produce compact tile-level and global representations for fast candidate generation, followed by exact MaxSim reranking using full multi-vector embeddings. Our design yields a quadratic reduction in vector-to-vector comparisons by reducing stored vectors per page from thousands to dozens, notably without requiring post-training, adapters, or distillation. Across experiments with interaction-style models such as ColPali and ColSmol-500M, we observe that over the limited ViDoRe v2 benchmark corpus 2-stage retrieval typically preserves NDCG and Recall @ 5/10 with minimal degradation, while substantially improving throughput (approximately 4x QPS); with sensitivity mainly at very large k. The toolkit additionally provides robust preprocessing - high resolution PDF to image conversion, optional margin/empty-region cropping and token hygiene (indexing only visual tokens) - and a reproducible evaluation pipeline, enabling rapid exploration of two-, three-, and cascaded retrieval variants. By emphasizing efficiency at common cutoffs (e.g., k = 10), the toolkit lowers hardware barriers and makes state-of-the-art visual retrieval more accessible in practice. Comments: 4 pages, 3 figures. Submitted to SIGIR 2026 Demonstrations Track. Project website: this https URL Subjects: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) ACMclasses: H.3.3 Cite as: arXiv:2602.12510 [cs.IR] (or arXiv:2602.12510v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.12510 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-14] Latent Customer Segmentation and Value-Based Recommendation Leverag ing a Two-Stage Model with Missing Labels

【速读】:该论文旨在解决企业营销中客户转化效率低下的问题,核心挑战在于如何精准识别真正受营销活动驱动的客户,避免因误判高参与度但非 campaign 引导的有机用户为理想目标群体而导致资源浪费和品牌价值稀释。解决方案的关键在于提出一种两阶段多模型架构:第一阶段利用多分类神经网络区分受活动影响、自然参与及低参与度客户;第二阶段引入基于缺失标签框架的二元标签校正模型,进一步识别真实campaign驱动意图,从而实现对用户行为的意图感知型细分(intent-aware segmentation)。该方法通过分离主动触发与自发行为,显著提升了营销活动的定位精度与转化效率,在A/B测试中使关键指标提升超过100个基点。

链接: https://arxiv.org/abs/2602.12485
作者: Keerthi Gopalakrishnan,Tianning Dong,Chia-Yen Ho,Yokila Arora,Topojoy Biswas,Jason Cho,Sushant Kumar,Kannan Achan
机构: Walmart Global Tech(沃尔玛全球科技)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The success of businesses depends on their ability to convert consumers into loyal customers. A customer’s value proposition is a primary determinant in this process, requiring a balance between affordability and long-term brand equity. Broad marketing campaigns can erode perceived brand value and reduce return on investment, while existing economic algorithms often misidentify highly engaged customers as ideal targets, leading to inefficient engagement and conversion outcomes. This work introduces a two-stage multi-model architecture employing Self-Paced Loss to improve customer categorization. The first stage uses a multi-class neural network to distinguish customers influenced by campaigns, organically engaged customers, and low-engagement customers. The second stage applies a binary label correction model to identify true campaign-driven intent using a missing-label framework, refining customer segmentation during training. By separating prompted engagement from organic behavior, the system enables more precise campaign targeting, reduces exposure costs, and improves conversion efficiency. A/B testing demonstrates over 100 basis points improvement in key success metrics, highlighting the effectiveness of intent-aware segmentation for value-driven marketing strategies. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2602.12485 [cs.IR] (or arXiv:2602.12485v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.12485 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Companion Proceedings of the ACM Web Conference 2025 (WWW Companion 25), ACM, 2025 Related DOI: https://doi.org/10.1145/3701716.3715243 Focus to learn more DOI(s) linking to related resources

[IR-15] An Industrial-Scale Sequential Recommender for LinkedIn Feed Ranking

【速读】:该论文旨在解决LinkedIn信息流中内容推荐的排序问题,即如何在满足严格生产约束的前提下,提升用户参与度并优化推荐效果。现有基于DCNv2的排序模型存在性能瓶颈,难以支撑大规模实时推荐需求。解决方案的关键在于提出Feed Sequential Recommender (Feed-SR),一个基于Transformer的序列化排序模型,通过精心设计的建模策略、高效的训练技术以及服务端优化手段,在保证低延迟和高吞吐的同时,显著提升了用户停留时长(在线A/B测试中提升+2.10%),成为当前LinkedIn Feed的主要推荐模型。

链接: https://arxiv.org/abs/2602.12354
作者: Lars Hertel,Gaurav Srivastava,Syed Ali Naqvi,Satyam Kumar,Yue Zhang,Borja Ocejo,Benjamin Zelditch,Adrian Englhardt,Hailing Cheng,Andy Hu,Antonio Alonso,Daming Li,Siddharth Dangi,Chen Zhu,Mingzhou Zhou,Wanning Li,Tao Huang,Fedor Borisyuk,Ganesh Parameswaran,Birjodh Singh Tiwana,Sriram Sankar,Qing Lan,Julie Choi,Souvik Ghosh
机构: LinkedIn Inc.(LinkedIn公司)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:LinkedIn Feed enables professionals worldwide to discover relevant content, build connections, and share knowledge at scale. We present Feed Sequential Recommender (Feed-SR), a transformer-based sequential ranking model for LinkedIn Feed that replaces a DCNv2-based ranker and meets strict production constraints. We detail the modeling choices, training techniques, and serving optimizations that enable deployment at LinkedIn scale. Feed-SR is currently the primary member experience on LinkedIn’s Feed and shows significant improvements in member engagement (+2.10% time spent) in online A/B tests compared to the existing production model. We also describe our deployment experience with alternative sequential and LLM-based ranking architectures and why Feed-SR provided the best combination of online metrics and production efficiency.

[IR-16] Agent icShop: Benchmarking Agent ic Product Curation for Personalized Web Shopping WWW2026

【速读】:该论文旨在解决当前agentic系统在开放网络环境下进行个性化产品推荐时评估不足的问题,特别是现有基准测试在购物场景覆盖范围和用户偏好适配性方面的局限。解决方案的关键在于提出首个针对开放网络环境中个性化产品推荐的基准测试AgenticShop,其核心创新在于构建了真实购物场景、多样化的用户画像,并采用可验证的清单驱动式个性化评估框架,从而全面衡量agentic系统在复杂多变的在线购物环境中的适应能力与推荐效果。

链接: https://arxiv.org/abs/2602.12315
作者: Sunghwan Kim,Ryang Heo,Yongsik Seo,Jinyoung Yeo,Dongha Lee
机构: Yonsei University (延世大学); ParamitaAI
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted at WWW 2026

点击查看摘要

Abstract:The proliferation of e-commerce has made web shopping platforms key gateways for customers navigating the vast digital marketplace. Yet this rapid expansion has led to a noisy and fragmented information environment, increasing cognitive burden as shoppers explore and purchase products online. With promising potential to alleviate this challenge, agentic systems have garnered growing attention for automating user-side tasks in web shopping. Despite significant advancements, existing benchmarks fail to comprehensively evaluate how well agentic systems can curate products in open-web settings. Specifically, they have limited coverage of shopping scenarios, focusing only on simplified single-platform lookups rather than exploratory search. Moreover, they overlook personalization in evaluation, leaving unclear whether agents can adapt to diverse user preferences in realistic shopping contexts. To address this gap, we present AgenticShop, the first benchmark for evaluating agentic systems on personalized product curation in open-web environment. Crucially, our approach features realistic shopping scenarios, diverse user profiles, and a verifiable, checklist-driven personalization evaluation framework. Through extensive experiments, we demonstrate that current agentic systems remain largely insufficient, emphasizing the need for user-side systems that effectively curate tailored products across the modern web.

[IR-17] Beyond Musical Descriptors: Extracting Preference-Bearing Intent in Music Queries

【速读】:该论文旨在解决当前音乐推荐系统中忽视用户意图的问题,即现有标注的音乐描述数据集多聚焦于表面特征,而未充分考虑用户使用特定音乐描述词时的实际偏好意图(preference-bearing roles)。其解决方案的关键在于构建了一个手动标注的语料库 MusicRecoIntent,包含2,291条Reddit音乐请求,对音乐描述词在七类属性上标注为正向、负向或参考性偏好角色,从而实现细粒度用户意图建模;同时通过评估大语言模型(LLM)提取这些描述符的能力,揭示其在显式描述识别上的有效性与对依赖上下文描述理解的局限性,为改进基于LLM的音乐理解系统提供基准和方向。

链接: https://arxiv.org/abs/2602.12301
作者: Marion Baranes,Romain Hennequin,Elena V. Epure
机构: Deezer Research (Deezer 研究所); Idiap Research Institute (Idiap 研究所)
类目: ound (cs.SD); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted at NLP4MusA 2026 (4th Workshop on NLP for Music and Audio)

点击查看摘要

Abstract:Although annotated music descriptor datasets for user queries are increasingly common, few consider the user’s intent behind these descriptors, which is essential for effectively meeting their needs. We introduce MusicRecoIntent, a manually annotated corpus of 2,291 Reddit music requests, labeling musical descriptors across seven categories with positive, negative, or referential preference-bearing roles. We then investigate how reliably large language models (LLMs) can extract these music descriptors, finding that they do capture explicit descriptors but struggle with context-dependent ones. This work can further serve as a benchmark for fine-grained modeling of user intent and for gaining insights into improving LLM-based music understanding systems.

[IR-18] Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation

【速读】:该论文旨在解决多模态推荐系统中两个核心问题:一是现有基于语义ID的离散化方法(如RQ-VAE)在跨模态共享语义与模态特异性细节之间缺乏解耦,导致冗余或信息坍塌;二是标准Transformer架构将语义ID视为扁平序列,忽视了用户交互、物品和token之间的层次结构,从而放大序列长度与噪声,使注意力偏向局部细节而非整体语义。解决方案的关键在于提出Hi-SAM框架,其包含两项创新设计:(1) 解耦语义分词器(Disentangled Semantic Tokenizer, DST),通过几何感知对齐和粗粒度到细粒度的量化策略统一多模态信息,共享码本提取共识,模态特定码本从残差中恢复细节,并以互信息最小化约束实现解耦;(2) 层次记忆锚定Transformer(Hierarchical Memory-Anchor Transformer, HMAT),利用层次化旋转位置编码(Hierarchical RoPE)分离物品间与物品内位置编码空间,并引入锚点token(Anchor Tokens)压缩物品为紧凑记忆,保留当前物品细节的同时仅通过压缩摘要访问历史信息,从而重建层级结构并提升长序列建模效率。

链接: https://arxiv.org/abs/2602.11799
作者: Pingjun Pan,Tingting Zhou,Peiyao Lu,Tingting Fei,Hongxiang Chen,Chuanjiang Luo
机构: NetEase Cloud Music (网易云音乐)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectively discretize this information into compact tokens. However, two challenges persist: (1) Suboptimal Tokenization: existing methods (e.g., RQ-VAE) lack disentanglement between shared cross-modal semantics and modality-specific details, causing redundancy or collapse; (2) Architecture-Data Mismatch: vanilla Transformers treat semantic IDs as flat streams, ignoring the hierarchy of user interactions, items, and tokens. Expanding items into multiple tokens amplifies length and noise, biasing attention toward local details over holistic semantics. We propose Hi-SAM, a Hierarchical Structure-Aware Multi-modal framework with two designs: (1) Disentangled Semantic Tokenizer (DST): unifies modalities via geometry-aware alignment and quantizes them via a coarse-to-fine strategy. Shared codebooks distill consensus while modality-specific ones recover nuances from residuals, enforced by mutual information minimization; (2) Hierarchical Memory-Anchor Transformer (HMAT): splits positional encoding into inter- and intra-item subspaces via Hierarchical RoPE to restore hierarchy. It inserts Anchor Tokens to condense items into compact memory, retaining details for the current item while accessing history only through compressed summaries. Experiments on real-world datasets show consistent improvements over SOTA baselines, especially in cold-start scenarios. Deployed on a large-scale social platform serving millions of users, Hi-SAM achieved a 6.55% gain in the core online metric.

[IR-19] Nationwide Hourly Population Estimating at the Neighborhood Scale in the United States Using Stable-Attendance Anchor Calibration

【速读】:该论文旨在解决传统人口数据因静态特性无法捕捉由日常移动行为驱动的人口时空动态变化的问题,以及如何将智能手机获取的具有不完整感知、空间异质设备渗透率和不稳定观测过程的机遇性移动数据转化为准确的人口估计。其解决方案的关键在于提出一种稳定出席锚点校准(Stable-Attendance Anchor Calibration, SAAC)框架,通过将人口估计建模为基于平衡的人口核算问题,结合居住人口与从设备事件中推断的时间变化进出流动;并利用高稳定性出席地点(如高中)作为校准锚点,估算观测缩放因子以纠正移动事件记录不足的问题,从而实现从观测设备事件到细粒度小时级人口存在的稳定转换。

链接: https://arxiv.org/abs/2602.12291
作者: Huan Ning,Zhenlong Li,Manzhu Yu,Xiao Huang,Shiyan Zhang,Shan Qiao
机构: 未知
类目: Applications (stat.AP); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Traditional population datasets are largely static and therefore unable to capture the strong temporal dynamics of human presence driven by daily mobility. Recent smartphone-based mobility data offer unprecedented spatiotemporal coverage, yet translating these opportunistic observations into accurate population estimates remains challenging due to incomplete sensing, spatially heterogeneous device penetration, and unstable observation processes. We propose a Stable-Attendance Anchor Calibration (SAAC) framework to reconstruct hourly population presence at the Census block group level across the United States. SAAC formulates population estimation as a balance-based population accounting problem, combining residential population with time-varying inbound and outbound mobility inferred from device-event observations. To address observation bias and identifiability limitations, the framework leverages locations with highly regular attendance as calibration anchors, using high schools in this study. These anchors enable estimation of observation scaling factors that correct for under-recorded mobility events. By integrating anchor-based calibration with an explicit sampling model, SAAC enables consistent conversion from observed device events to population presence at fine temporal resolution. The inferred population patterns are consistent with established empirical findings in prior mobility and urban population studies. SAAC provides a generalizable framework for transforming large-scale, biased digital trace data into interpretable dynamic population products, with implications for urban science, public health, and human mobility research. The hourly population estimates can be accessed at: this https URL.

人机交互

[HC-0] he Fuzzy Front Ends: Reflections on the Never-Ending Story of Visualization Co-Design

【速读】:该论文旨在解决当前可视化研究中缺乏有效指导来实施协同设计(Co-design)方法的问题,尤其是在与社区参与相关的可视化项目中。其解决方案的关键在于通过为期两年半的持续迭代式协同设计实践,与当地艺术社区共同开发出贴合实际需求的可视化原型,并在此过程中识别和反思三个“模糊前端”(fuzzy front end)阶段——即初期探索、理解分歧与目标重构等不确定性强的环节。作者借助漫画风格的视觉叙事呈现这一动态过程,以期为可视化领域的社区参与式设计提供可借鉴的经验与反思框架。

链接: https://arxiv.org/abs/2602.13182
作者: Wei Wei,Foroozan Daneshzand,Zezhong Wang,Erica Mattson,Charles Perin,Sheelagh Carpendale
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Co-design is an increasingly popular approach in HCI and visualization, yet there is little guidance on how to effectively apply this method in visualization contexts. In this paper, we visually present our experience of a two-and-a-half-year co-design project with the local arts community. Focusing on facilitating community exploration and sense-making around arts funding distribution, the project involved a series of co-design sessions between visualization researchers and members of the arts community. Through these iterative sessions, we built shared understanding and developed visualization prototypes tailored to community needs. However, the practice is far from complete, and we found ourselves continually returning to the “fuzzy front end” of the co-design process. We share this ongoing story through comic-style visuals and reflect on three fuzzy front ends that we encountered during the project. By sharing these experiences with the visualization community, we hope to offer insights that others can draw on in their own community-engaged co-design work.

[HC-1] Preference-Guided Prompt Optimization for Text-to-Image Generation

【速读】:该论文旨在解决生成式 AI(Generative AI)在实际应用中因用户提示(prompt)难以精确控制而导致的内容生成不可预测、优化效率低的问题。现有方法要么依赖大量人工干预,要么基于数值函数优化,不适用于以二元偏好反馈为主的用户中心型任务,且缺乏迭代收敛性保障。其解决方案的关键在于提出 APPO(Preference-guided Prompt Optimization)算法,通过仅需用户提供二元偏好反馈即可实现高效优化,同时自适应地平衡利用用户反馈(exploitation)与探索新方向(exploration),从而在较少迭代次数内达成高质量生成结果,显著降低认知负荷并提升人机协作效率。

链接: https://arxiv.org/abs/2602.13131
作者: Zhipeng Li,Yi-Chi Liao,Christian Holz
机构: ETH Zürich(苏黎世联邦理工学院)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative models are increasingly powerful, yet users struggle to guide them through prompts. The generative process is difficult to control and unpredictable, and user instructions may be ambiguous or under-specified. Prior prompt refinement tools heavily rely on human effort, while prompt optimization methods focus on numerical functions and are not designed for human-centered generative tasks, where feedback is better expressed as binary preferences and demands convergence within few iterations. We present APPO, a preference-guided prompt optimization algorithm. Instead of iterating prompts, users only provide binary preferential feedback. APPO adaptively balances its strategies between exploiting user feedback and exploring new directions, yielding effective and efficient optimization. We evaluate APPO on image generation, and the results show APPO enables achieving satisfactory outcomes in fewer iterations with lower cognitive load than manual prompt editing. We anticipate APPO will advance human-AI collaboration in generative tasks by leveraging user preferences to guide complex content creation.

[HC-2] Automating UI Optimization through Multi-Agent ic Reasoning

【速读】:该论文旨在解决用户界面(User Interface, UI)自适应优化中依赖人工检查布局和使用群体平均参数定义优化目标的问题。传统方法往往无法精准响应用户的个性化需求,且缺乏对多目标优化问题的动态建模能力。解决方案的关键在于提出了一种名为AutoOptimization的多目标优化框架,其核心创新是通过优先级驱动的帕累托前沿搜索(Pareto frontier search),将用户口头偏好转化为可执行的优化目标函数,并自动参数化这些函数以匹配用户指令;同时,框架集成多个智能代理(agent)顺序协作,分别完成用户意图解析、优化问题配置与结果验证,从而实现从用户输入到最终最优UI布局的端到端自动化适配。

链接: https://arxiv.org/abs/2602.13126
作者: Zhipeng Li,Christoph Gebhardt,Yi-Chi Liao,Christian Holz
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We present AutoOptimization, a novel multi-objective optimization framework for adapting user interfaces. From a user’s verbal preferences for changing a UI, our framework guides a prioritization-based Pareto frontier search over candidate layouts. It selects suitable objective functions for UI placement while simultaneously parameterizing them according to the user’s instructions to define the optimization problem. A solver then generates a series of optimal UI layouts, which our framework validates against the user’s instructions to adapt the UI with the final solution. Our approach thus overcomes the previous need for manual inspection of layouts and the use of population averages for objective parameters. We integrate multiple agents sequentially within our framework, enabling the system to leverage their reasoning capabilities to interpret user preferences, configure the optimization problem, and validate optimization outcomes.

[HC-3] "Its More of a Lifestyle: Design Considerations for Supporting Everyday Practices in Community-Based Farming

【速读】:该论文试图解决社区型小农场(community-based small farms)在发展过程中面临的挑战,尤其是由于缺乏正式教育和基础设施不足导致的知识传承断层与资源获取受限问题。解决方案的关键在于挖掘并强化社区内已有的社会资本,特别是基于紧密家庭和邻里关系的“ bonding social capital”,通过支持非正式知识交流机制,促进农户与更广泛社会网络、资源和机构之间的“ bridging”和“linking”资本连接,从而为设计符合当地文化价值观和技术实践的辅助工具提供依据。

链接: https://arxiv.org/abs/2602.13119
作者: Minghe Lu,Zhanming Chen,May Sunmin Hwang,Ji Youn Shin
机构: University of Minnesota (明尼苏达大学)
类目: Human-Computer Interaction (cs.HC)
备注: 31 pages, 8 figures, conference

点击查看摘要

Abstract:Farming plays a significant role in the economy by supporting related industries such as food, retail, and local services. Community-based small farms, while offering unique social and cultural benefits, face persistent challenges, including limited access to formal education and underdeveloped infrastructure, which have been discussed in prior research. This study focuses on community-driven factors, such as workarounds for recording critical information and practices for passing down farming knowledge across generations. Through 11 semi-structured interviews with farmers from a small ethnic community, the Hmong, we explore how bonding social capital, rooted in close family and community ties, supports informal knowledge exchange and creates pathways to bridging and linking capital. These relationships help farmers connect to broader networks, resources, and institutions. Our findings highlight opportunities for designing technologies that support and strengthen existing support systems. We discuss how technologies should be designed to reflect the cultural values, unique practices, and intergenerational relationships embedded in community-based farms.

[HC-4] Resource-Efficient Gesture Recognition through Convexified Attention

【速读】:该论文旨在解决可穿戴电子纺织品(e-textile)界面中手势识别的部署难题,其核心挑战在于资源受限环境下的低功耗、小计算容量和紧凑形态因子,使得传统深度学习方法难以适用。解决方案的关键在于提出一种凸化注意力机制(convexified attention mechanism),通过非扩张的单纯形投影(nonexpansive simplex projection)与凸损失函数(如多类铰链损失)替代传统非凸的Softmax操作,从而在保证全局收敛性的同时显著降低模型复杂度——仅需120–360个参数(较常规方法减少97%),并实现亚毫秒级推理时间(290–296 μs)和极低存储开销(<7KB),使高效且精确的手势识别直接在织物传感器上完成,无需外部处理单元。

链接: https://arxiv.org/abs/2602.13030
作者: Daniel Schwartz,Dario Salvucci,Yusuf Osmanlioglu,Richard Vallett,Genevieve Dion,Ali Shokoufandeh
机构: Drexel University (德雷塞尔大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 22 pages, 3 figures, EICS 2026

点击查看摘要

Abstract:Wearable e-textile interfaces require gesture recognition capabilities but face severe constraints in power consumption, computational capacity, and form factor that make traditional deep learning impractical. While lightweight architectures like MobileNet improve efficiency, they still demand thousands of parameters, limiting deployment on textile-integrated platforms. We introduce a convexified attention mechanism for wearable applications that dynamically weights features while preserving convexity through nonexpansive simplex projection and convex loss functions. Unlike conventional attention mechanisms using non-convex softmax operations, our approach employs Euclidean projection onto the probability simplex combined with multi-class hinge loss, ensuring global convergence guarantees. Implemented on a textile-based capacitive sensor with four connection points, our approach achieves 100.00% accuracy on tap gestures and 100.00% on swipe gestures – consistent across 10-fold cross-validation and held-out test evaluation – while requiring only 120–360 parameters, a 97% reduction compared to conventional approaches. With sub-millisecond inference times (290–296 \mu s) and minimal storage requirements ( 7KB), our method enables gesture interfaces directly within e-textiles without external processing. Our evaluation, conducted in controlled laboratory conditions with a single-user dataset, demonstrates feasibility for basic gesture interactions. Real-world deployment would require validation across multiple users, environmental conditions, and more complex gesture vocabularies. These results demonstrate how convex optimization can enable efficient on-device machine learning for textile interfaces.

[HC-5] GroundLink: Exploring How Contextual Meeting Snippets Can Close Common Ground Gaps in Editing 3D Scenes for Virtual Production

【速读】:该论文旨在解决虚拟制作(Virtual Production, VP)专业人士在团队协作中难以获取隐性知识(tacit knowledge)和创作意图的问题,这阻碍了团队成员之间建立共同认知基础,进而影响协作效率与质量。解决方案的关键在于提出GroundLink——一个基于Unity的插件,其核心功能包括:① 会议知识仪表板(meeting knowledge dashboard),用于捕获和回顾决策与评论;② 约束感知的前馈机制(constraint-aware feedforward),主动向编辑环境传递相关信息;③ 跨模态同步(cross-modal synchronization),在仪表板与编辑器之间建立参考链接。该设计有效支持了团队成员在3D场景编辑过程中快速建立共同认知,提升编辑信心与效率。

链接: https://arxiv.org/abs/2602.12987
作者: Gun Woo(Warren)Park,Frederik Brudy,George Fitzmaurice,Fraser Anderson
机构: Autodesk Research (Autodesk 研究院); University of Toronto (多伦多大学)
类目: Human-Computer Interaction (cs.HC)
备注: 27 pages, 13 figures, to appear at CHI 2026

点击查看摘要

Abstract:Virtual Production (VP) professionals often face challenges accessing tacit knowledge and creative intent, which are important in forming common ground with collaborators and in contributing more effectively and efficiently to the team. From our formative study (N=23) with a follow-up interview (N=6), we identified the significance and prevalence of this challenge. To help professionals access knowledge, we present GroundLink, a Unity add-on that surfaces meeting-derived knowledge directly in the editor to support establishing common ground. It features a meeting knowledge dashboard for capturing and reviewing decisions and comments, constraint-aware feedforward that proactively informs the editor environment, and cross-modal synchronization that provides referential links between the dashboard and the editor. A comparative study (N=12) suggested that GroundLink help users build common ground with their team while improving perceived confidence and ease of editing the 3D scene. An expert evaluation with VP professionals (N=5) indicated strong potential for GroundLink in real-world workflows.

[HC-6] Human Tool: An MCP-Style Framework for Human-Agent Collaboration

【速读】:该论文旨在解决人机协作中因AI系统在复杂任务上超越人类表现而引发的职责失衡问题,即人类虽仍承担协调、验证与决策监督等关键角色,却难以有效参与AI主导的工作流。解决方案的关键在于提出“Human Tool”这一MCP(Model Context Protocol)风格的接口抽象机制,将人类建模为可被AI代理动态调用的工具,通过结构化的工具模式(包括能力、信息和权限)实现对人类输入的精准识别与高效整合,从而在保持人类权威的前提下优化协作效率,增强人机协同的平衡性与有效性。

链接: https://arxiv.org/abs/2602.12953
作者: Yuanrong Tang,Huiling Peng,Bingxi Zhao,Hengyang Ding,Hanchao Song,Tianhong Wang,Chen Zhong,Jiangtao Gong
机构: Tsinghua University (清华大学); Zhejiang University (浙江大学); Beijing Jiaotong University (北京交通大学); The Hong Kong University of Science and Technology (香港科技大学); Anhui University (安徽大学)
类目: Human-Computer Interaction (cs.HC)
备注: 9 pages

点击查看摘要

Abstract:Human-AI collaboration faces growing challenges as AI systems increasingly outperform humans on complex tasks, while humans remain responsible for orchestration, validation, and decision oversight. To address this imbalance, we introduce Human Tool, an MCP-style interface abstraction, building on recent Model Context Protocol designs, that exposes humans as callable tools within AI-led, proactive workflows. Here, “tool” denotes a coordination abstraction, not a reduction of human authority or responsibility. Building on LLM-based agent architectures, we operationalize Human Tool by modeling human contributions through structured tool schemas of capabilities, information, and authority. These schemas enable agents to dynamically invoke human input based on relative strengths and reintegrate it through efficient, natural interaction protocols. We validate the framework through controlled studies in both decision-making and creative tasks, demonstrating improved task performance, reduced human workload, and more balanced collaboration dynamics compared to baseline systems. Finally, we discuss implications for human-centered AI design, highlighting how MCP-style human tools enable strong AI leadership while amplifying uniquely human strengths.

[HC-7] Never say never: Exploring the effects of available knowledge on agent persuasiveness in controlled physiotherapy motivation dialogues

【速读】:该论文旨在解决生成式社会代理(Generative Social Agents, GSAs)在人机交互中如何通过可控的语义输出实现有效且负责任的说服行为问题,尤其关注其在物理治疗动机场景中的应用。解决方案的关键在于通过调节大语言模型(Large Language Models, LLMs)对用户背景信息(如年龄和职业)的访问权限,来动态调整代理的表达力(expressiveness)与自信度(assertiveness),从而显著提升感知说服力(perceived persuasiveness)。研究发现,显式提供患者个体特征信息可增强代理的行为适配性,而关于物理治疗益处的上下文知识则因LLM本身具备相关先验知识而不显著影响说服效果,表明信息粒度与模型能力之间的匹配是实现负责任沟通的核心机制。

链接: https://arxiv.org/abs/2602.12924
作者: Stephan Vonschallen,Rahel Häusler,Theresa Schmiedel,Friederike Eyssel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Social Agents (GSAs) are increasingly impacting human users through persuasive means. On the one hand, they might motivate users to pursue personal goals, such as healthier lifestyles. On the other hand, they are associated with potential risks like manipulation and deception, which are induced by limited control over probabilistic agent outputs. However, as GSAs manifest communicative patterns based on available knowledge, their behavior may be regulated through their access to such knowledge. Following this approach, we explored persuasive ChatGPT-generated messages in the context of human-robot physiotherapy motivation. We did so by comparing ChatGPT-generated responses to predefined inputs from a hypothetical physiotherapy patient. In Study 1, we qualitatively analyzed 13 ChatGPT-generated dialogue scripts with varying knowledge configurations regarding persuasive message characteristics. In Study 2, third-party observers (N = 27) rated a selection of these dialogues in terms of the agent’s expressiveness, assertiveness, and persuasiveness. Our findings indicate that LLM-based GSAs can adapt assertive and expressive personality traits – significantly enhancing perceived persuasiveness. Moreover, persuasiveness significantly benefited from the availability of information about the patients’ age and past profession, mediated by perceived assertiveness and expressiveness. Contextual knowledge about physiotherapy benefits did not significantly impact persuasiveness, possibly because the LLM had inherent knowledge about such benefits even without explicit prompting. Overall, the study highlights the importance of empirically studying behavioral patterns of GSAs, specifically in terms of what information generative AI systems require for consistent and responsible communication.

[HC-8] Comparative Study of Ultrasound Shape Completion and CBCT-Based AR Workflows for Spinal Needle Interventions

【速读】:该论文旨在解决如何通过增强现实(Augmented Reality, AR)引导的成像流程提升腰椎穿刺和关节注射等脊柱介入手术的规划与执行效率、精度及操作者信任度的问题。其核心解决方案在于对比两种基于不同成像模态的AR引导工作流:一种是基于超声(Ultrasound)形状补全的方案,另一种是基于锥形束计算机断层扫描(Cone-beam Computed Tomography, CBCT)的方案。关键创新点在于将各自优势整合到AR框架中——CBCT提供高精度全局解剖结构用于规划与导航,而超声则实现无电离辐射的实时动态更新;研究发现CBCT引导在插入速度、定位误差和用户信任方面表现更优,而超声虽具备适应性但受限于深层组织形状补全的准确性,因此提出混合式AR引导策略以兼顾全局精度与术中灵活性。

链接: https://arxiv.org/abs/2602.12920
作者: Tianyu Song,Feng Li,Felix Pabst,Miruna-Alexandra Gafencu Yuan Bi,Ulrich Eck,Nassir Navab
机构: Technical University of Munich (慕尼黑工业大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Purpose: This study compares two augmented reality (AR)-guided imaging workflows, one based on ultrasound shape completion and the other on cone-beam computed tomography (CBCT), for planning and executing lumbar needle interventions. The aim is to assess how imaging modality influences user performance, usability, and trust during AR-assisted spinal procedures. Methods: Both imaging systems were integrated into an AR framework, enabling in situ visualization and trajectory guidance. The ultrasound-based workflow combined AR-guided robotic scanning, probabilistic shape completion, and AR visualization. The CBCT-based workflow used AR-assisted scan volume planning, CBCT acquisition, and AR visualization. A between-subject user study was conducted and evaluated in two phases: (1) planning and image acquisition, and (2) needle insertion. Results: Planning time was significantly shorter with the CBCT-based workflow, while SUS, SEQ, and NASA-TLX were comparable between modalities. In the needle insertion phase, the CBCT-based workflow yielded marginally faster insertion times, lower placement error, and better subjective ratings with higher Trust. The ultrasound-based workflow achieved adequate accuracy for facet joint insertion, but showed larger errors for lumbar puncture, where reconstructions depended more heavily on shape completion. Conclusion: The findings indicate that both AR-guided imaging pipelines are viable for spinal intervention support. CBCT-based AR offers advantages in efficiency, precision, usability, and user confidence during insertion, whereas ultrasound-based AR provides adaptive, radiation-free imaging but is limited by shape completion in deeper spinal regions. These complementary characteristics motivate hybrid AR guidance that uses CBCT for global anatomical context and planning, augmented by ultrasound for adaptive intraoperative updates. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.12920 [cs.HC] (or arXiv:2602.12920v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.12920 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Feng Li [view email] [v1] Fri, 13 Feb 2026 13:25:28 UTC (1,420 KB)

[HC-9] Reflection at Design Actualization (RDA) : A Tool and Process For Research Through Game Design

【速读】:该论文旨在解决游戏设计研究中两个关键问题:一是难以捕捉设计过程中涉及丰富隐性知识的微观决策,二是缺乏可视化追踪项目成长与演变的工具。解决方案的核心在于提出一种名为“设计实现时的反思”(Reflection at Design Actualization, RDA)的开源工具与流程,该方案通过在测试时刻收集细粒度的反思记录并自动录制测试过程,使反思与数据采集更贴近设计决策具体化的时间点,从而增强设计过程的可追溯性与反思深度。

链接: https://arxiv.org/abs/2602.12887
作者: Prabhav Bhatnagar,Jianheng He,Shamit Ahmed,Andrés Lucero,Perttu Hämäläinen
机构: Aalto University (阿尔托大学); University College London (伦敦大学学院)
类目: Human-Computer Interaction (cs.HC)
备注: 24 pages, 7 figures

点击查看摘要

Abstract:There is a growing interest in researching game design processes, artifacts and culture through active game design. Tools and processes to support these attempts are limited, especially in terms of a) capturing smaller design decisions where rich tacit information is often situated, and b) visually tracking the project’s growth and evolution. To address this gap, we present Reflection at Design Actualization (RDA), an open source tool and process for collecting granular reflections at playtesting moments and automatically recording the playtests, bringing reflection and data collection closer to the point where design decisions concretize. Three researchers engaged with and evaluated RDA in three varied game development projects, adhering to the principles of autobiographical design. We illustrate the designer experience with RDA through three themes, namely, designer-routine compromise, designer-researcher persona consolidation, and mirror effect of RDA. We further discuss the tool’s challenges and share each designer’s personal experience as case studies.

[HC-10] Knowledge-Based Design Requirements for Generative Social Robots in Higher Education

【速读】:该论文旨在解决生成式社交机器人(Generative Social Robots, GSRs)在高等教育中开展对话式辅导时,因缺乏明确的知识基础而导致的行为不可靠性问题,例如幻觉、过度依赖和隐私侵犯等风险。现有教育技术与负责任人工智能框架多聚焦于行为规范定义,却未阐明支撑这些行为所需的知识前提。解决方案的关键在于采用基于知识的设计视角,通过十二名高校师生的半结构化访谈,系统识别出三类核心知识需求:自我认知(如可定制的角色人格)、用户认知(如学习目标、情绪状态等个性化信息)以及情境认知(如教学材料、课程信息及物理环境)。这一知识框架为GSR的负责任设计与有效应用提供了结构化基础,并确保其能力与教学法及伦理预期相匹配。

链接: https://arxiv.org/abs/2602.12873
作者: Stephan Vonschallen,Dominique Oberle,Theresa Schmiedel,Friederike Eyssel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative social robots (GSRs) powered by large language models enable adaptive, conversational tutoring but also introduce risks such as hallucina-tions, overreliance, and privacy violations. Existing frameworks for educa-tional technologies and responsible AI primarily define desired behaviors, yet they rarely specify the knowledge prerequisites that enable generative systems to express these behaviors reliably. To address this gap, we adopt a knowledge-based design perspective and investigate what information tutor-ing-oriented GSRs require to function responsibly and effectively in higher education. Based on twelve semi-structured interviews with university stu-dents and lecturers, we identify twelve design requirements across three knowledge types: self-knowledge (assertive, conscientious and friendly per-sonality with customizable role), user-knowledge (personalized information about student learning goals, learning progress, motivation type, emotional state and background), and context-knowledge (learning materials, educa-tional strategies, course-related information, and physical learning environ-ment). By identifying these knowledge requirements, this work provides a structured foundation for the design of tutoring GSRs and future evaluations, aligning generative system capabilities with pedagogical and ethical expecta-tions.

[HC-11] Media Framing Moderates Risk-Benefit Perceptions and Value Tradeoffs in Human-Robot Collaboration

【速读】:该论文旨在解决工业人机协作(Human-Robot Collaboration, HRC)在实际应用中面临公众接受度不足的问题,核心在于探究媒体信息框架如何调节员工对HRC风险与收益的感知,并进一步影响其整体价值判断。解决方案的关键在于通过实证研究发现:正面框架下,风险和收益以独立、加性方式显著预测价值(风险β = -0.52,收益β = 0.45),而负面框架下二者不仅效应更强(风险β = -0.69,收益β = 0.63),还存在显著负向交互作用(β = -0.32),表明高风险感知会削弱收益感知的正向效应;此外,正面框架模型解释力更强(R² = 0.715 vs. 0.583)。这说明战略性的信息 framing 能有效塑造认知整合机制,从而提升HRC的社会接受度,为未来相关研究与实践提供关键干预路径。

链接: https://arxiv.org/abs/2602.12785
作者: Philipp Brauner,Felix Glawe,Luisa Vervier,Martina Ziefle
机构: RWTH Aachen University (亚琛工业大学)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Public acceptance of industrial human-robot collaboration (HRC) is shaped by how risks and benefits are perceived by affected employees. Positive or negative media framing may shape and shift how individuals evaluate HRC. This study examines how message framing moderates the effects of perceived risks and perceived benefits on overall attributed value. In a pre-registered study, participants (N = 1150) were randomly assigned to read either a positively or negatively framed newspaper article in one of three industrial contexts (autonomy, employment, safety) about HRC in production. Subsequently, perceived risks, benefits, and value were measured using reliable and publicly available psychometric scales. Two multiple regressions (one per framing condition) tested for main and interaction effects. Framing influenced absolute evaluations of risk, benefits, and value. In both frames, risks and benefits significantly predicted attributed value. Under positive framing, only main effects were observed (risks: beta = -0.52; benefits: beta = 0.45). Under negative framing, both predictors had stronger main effects (risks: beta = -0.69; benefits: beta = 0.63) along with a significant negative interaction (beta = -0.32), indicating that higher perceived risk diminishes the positive effect of perceived benefits. Model fit was higher for the positive frame (R^2 = 0.715) than for the negative frame (R^2 = 0.583), indicating greater explained variance in value attributions. Framing shapes the absolute evaluation of HRC and how risks and benefits are cognitively integrated in trade-offs. Negative framing produces stronger but interdependent effects, whereas positive framing supports additive evaluations. These findings highlight the role of strategic communication in fostering acceptance of HRC and underscore the need to consider framing in future HRC research.

[HC-12] RULER: Intelligible Rubric-Based User-Defined LLM Evaluation for Revision

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在写作评估中提供的反馈常存在不可理解、泛化且不针对用户具体标准的问题。解决方案的关键在于提出 iRULER,一种基于结构化评分量规(rubric)的交互式工具,其设计遵循明确的指导原则:通过特定标准引导评审过程、提供分数选择的合理性说明,并给出可操作的改进建议以适配不同质量层级。此外,iRULER 还支持用户自定义标准,通过递归使用“量规之量规”机制迭代优化量规本身,从而提升反馈的针对性与可解释性。实验表明,iRULER 在提升 LLM 评判的修订得分和用户感知帮助度方面显著优于仅提供文本反馈或静态量规的传统方法。

链接: https://arxiv.org/abs/2602.12779
作者: Jingwen Bai,Wei Soon Cheong,Philippe Muller,Brian Y Lim
机构: National University of Singapore(新加坡国立大学); IRITUniversity of Toulouse(图卢兹大学IRIT研究所)
类目: Human-Computer Interaction (cs.HC)
备注: To Appear at CHI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have become indispensable for evaluating writing. However, text feedback they provide is often unintelligible, generic, and not specific to user criteria. Inspired by structured rubrics in education and intelligible AI explanations, we propose iRULER following identified design guidelines to \textitscaffold the review process by \textitspecific criteria, providing \textitjustification for score selection, and offering \textitactionable revisions to target different quality levels. To \textitqualify user-defined criteria, we recursively used iRULER with a rubric-of-rubrics to iteratively \textitrefine rubrics. In controlled experiments on writing revision and rubric creation, iRULER most improved validated LLM-judged review scores and was perceived as most helpful and aligned compared to read-only rubric and text-based LLM feedback. Qualitative findings further support how iRULER satisfies the design guidelines for user-defined feedback. This work contributes interactive rubric tools for intelligible LLM-based review and revision of writing, and user-defined rubric creation.

[HC-13] Usage Matters: The Role of Frequency Duration and Experience in Presence Formation in Social Virtual Reality

【速读】:该论文试图解决的问题是:在日常使用情境下,用户的行为特征(如使用频率、会话时长和VR使用年限)如何影响社交虚拟现实(Social VR)中的沉浸感(presence)。解决方案的关键在于通过一项针对295名用户的调查,采用经验证的量表测量整体沉浸感、社交沉浸感、空间沉浸感和自我沉浸感,并发现使用频率和会话时长是预测所有维度沉浸感的稳定因素,且二者存在协同效应——即高频且长时间的使用能显著增强“身临其境”的体验。这一发现为理解真实世界中沉浸感的形成机制提供了行为学依据,并为构建更具包容性的社交VR环境提供了实证支持。

链接: https://arxiv.org/abs/2602.12775
作者: Qijia Chen,Andrea Bellucci,Giulio Jacucci
机构: University of Helsinki (赫尔辛基大学); Universidad Carlos III de Madrid (卡洛斯三世大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The sense of presence is central to immersive experiences in Virtual Reality (VR), and particularly salient in socially rich platforms like social VR. While prior studies have explored various aspects related to presence, less is known about how ongoing usage behaviors shape presence in everyday engagement. To address this gap, we examine whether usage intensity, captured through frequency of use, session duration, and years of VR experience, predicts presence in social VR. A survey of 295 users assessed overall, social, spatial, and self-presence using validated scales. Results show that both frequency and duration consistently predict higher presence across all dimensions, with interaction effects indicating that frequent and extended sessions synergistically amplify the experience of “being there.” These effects were stable across age and gender. Our findings extend presence research beyond the laboratory by identifying behavioral predictors in social VR and offer insights for building inclusive environments that reliably foster presence.

[HC-14] he Configuration of Space: Probing the Way Social Interaction and Perception are Affected by Task-Specific Spatial Representations in Online Video Communication

【速读】:该论文试图解决的问题是:在2D视频聊天界面中,人类的空间配置如何影响在线交流中的注意力分配、社会感知与行为表现,尤其是在社交支持和话题讨论等场景下。其解决方案的关键在于通过对比两种界面设计——常规的画廊式(Gallery)界面与基于场景的圆形排列式(Room-type)界面,发现空间布局能够显著调节用户的心理状态与互动策略,例如在Room格式中参与者更关注群体整体并增强自我意识,在团队对战式讨论中则利用空间距离强化归属感与共情能力,从而揭示了在二维交互设计中引入三维空间隐喻的重要性,以优化协作通信系统对心理需求的适配性。

链接: https://arxiv.org/abs/2602.12771
作者: Yihuan Chen,Kexue Fu,Qianyi Chen,Zhicong Lu,Ray LC
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: vol 15805, Springer, Cham

点击查看摘要

Abstract:Humans live and act in 3D space, but often work and communicate on 2D surfaces. The prevalence of online communication on 2D screens raises the issue of whether human spatial configuration affects our capabilities, social perception, and behaviors when interacting with others in 2D video chat. How do factors like location, setting, and context subtly shape our online communication, particularly in scenarios such as social support and topic-based discussions? Using this http URL as a platform, we compared a normal gallery interface with a scene-based Room-type interface where participants are located in circular arrangement on screen in a social support task, and found that participants allocated attention to the group as a whole, and had pronounced self-awareness in the Room format. We then chose a two-sided topic for discussion in the Gallery interface and the Room interface where participants on each team face-off against each other, and found that they utilized spatial references to orient their allegiances, expressing greater engagement with those farther away in digital space and greater empathy with those closer, in the Room over the Gallery format. We found spatial effects in the way participants hide from the spotlight, in perspective-taking, and in their use of expressive gestures in time on the screen. This work highlights the need for considering spatial configuration in 2D in the design of collaborative communication systems to optimize for psychological needs for particular tasks.

[HC-15] Social Spatial and Self-Presence as Predictors of Basic Psychological Need Satisfaction in Social Virtual Reality

【速读】:该论文试图解决的问题是:在虚拟现实(VR)环境中,不同维度的沉浸感(presence)如何与自我决定理论(Self-Determination Theory)中的基本心理需求——自主性(autonomy)、胜任感(competence)和归属感(relatedness)之间存在何种映射关系,以及这种关系是否受到性别和年龄等人口统计学因素的调节。解决方案的关键在于通过结构方程模型(Structural Equation Modeling, SEM)对301名社交VR用户的数据进行系统分析,发现社会沉浸感(social presence)可同时预测三种基本心理需求,而自我沉浸感(self-presence)主要促进胜任感和归属感,空间沉浸感(spatial presence)则无显著直接或调节效应;同时揭示了性别和年龄作为调节变量的重要作用,表明沉浸感作为动机机制具有群体差异性。这一发现为设计更具包容性和需求支持性的多人VR环境提供了理论依据与实践指导。

链接: https://arxiv.org/abs/2602.12764
作者: Qijia Chen,Andrea Bellucci,Giulio Jacucci
机构: University of Helsinki (赫尔辛基大学); Universidad Carlos III de Madrid (卡洛斯三世大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Extensive research has examined presence and basic psychological needs (drawing on Self-Determination Theory) in digital media. While prior work offers hints of potential connections, we lack a systematic account of whether and how distinct presence dimensions map onto the basic needs of autonomy, competence, and relatedness. We surveyed 301 social VR users and analyzed using Structural Equation Modeling. Results show that social presence predicts all three needs, while self-presence predicts competence and relatedness, and spatial presence shows no direct or moderating effects. Gender and age moderated these relationships: women benefited more from social presence for autonomy and relatedness, men from self- and spatial presence for competence and autonomy, and younger users showed stronger associations between social presence and relatedness, and between self-presence and autonomy. These findings position presence as a motivational mechanism shaped by demographic factors. The results offer theoretical insights and practical implications for designing inclusive, need-supportive multiuser VR environments.

[HC-16] “Not Human Funnier”: How Machine Identity Shapes Humor Perception in Online AI Stand-up Comedy

【速读】:该论文试图解决的问题是:如何提升生成式 AI (Generative AI) 在喜剧创作中的表现力,使其能够像人类表演者一样,通过身份认同(如种族、性别、社群背景等)来增强幽默效果。解决方案的关键在于设计一种基于机器身份(machine-identity-based)的代理系统,该系统主动利用其作为AI的独特身份进行在线表演,而非模仿人类身份。实验表明,相较于基准GPT代理,这种以AI自身身份为幽默来源的代理在人类观众中获得了更高的趣味性评价,证明了将AI视为独立身份而非人类替代品的设计策略具有显著优势。

链接: https://arxiv.org/abs/2602.12763
作者: Xuehan Huang,Canwen Wang,Yifei Hao,Daijin Yang,Ray LC
机构: The University of Hong Kong (香港大学); Carnegie Mellon University (卡内基梅隆大学); East China Normal University (华东师范大学); Northeastern University (东北大学); City University of Hong Kong (城市大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 27 pages, 5 figures. Conditionally Accepted to CHI '26

点击查看摘要

Abstract:Chatbots are increasingly applied to domains previously reserved for human actors. One such domain is comedy, whereby both the general public working with ChatGPT and research-based LLM-systems have tried their hands on making humor. In formative interviews with professional comedians and video analyses of stand-up comedy in humans, we found that human performers often use their ethnic, gender, community, and demographic-based identity to enable joke-making. This suggests whether the identity of AI itself can empower AI humor generation for human audiences. We designed a machine-identity-based agent that uses its own status as AI to tell jokes in online performance format. Studies with human audiences (N=32) showed that machine-identity-based agents were seen as funnier than baseline-GPT agent. This work suggests the design of human-AI integrated systems that explicitly utilize AI as its own unique identity apart from humans.

[HC-17] SoK: Understanding the Pedagogical Health Ethical and Privacy Challenges of Extended Reality in Early Childhood Education

【速读】:该论文旨在解决扩展现实(Extended Reality, XR)在早期儿童教育中应用所面临的多重挑战,包括技术可靠性、教学法适配性、健康风险、隐私保护、公平性等问题。其解决方案的关键在于通过系统化知识(Systematization of Knowledge, SoK)方法对111篇同行评审文献进行量化分析,构建了一个融合风险与关注矩阵和增强型人类发展(Augmented Human Development, AHD)模型,将XR技术流程属性与认知负荷、感官冲突及获取不平等关联起来,并采用七维度编码方案(0–2分制)评估学术关注度,揭示出当前研究偏重教学法(均值1.56),而数据安全实践(均值0.14)几乎被忽视的失衡现象。最终提出面向儿童中心的XR发展路线图,引导人机交互(HCI)研究者和教育工作者从技术新颖性转向开发符合儿童发展阶段、默认安全且包容多样学习者的系统。

链接: https://arxiv.org/abs/2602.12749
作者: Supriya Khadka,Sanchari Das
机构: Coventry University (考文垂大学); George Mason University (乔治梅森大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to Augmented Humans 2026

点击查看摘要

Abstract:Extended Reality (XR) combines dense sensing, real-time rendering, and close-range interaction, making its use in early childhood education both promising and high risk. To investigate this, we conduct a Systematization of Knowledge (SoK) of 111 peer-reviewed studies with children aged 3-8, quantifying how technical, pedagogical, health, privacy, and equity challenges arise in practice. We found that AR dominates the landscape (73%), focusing primarily on tablets or phones, while VR remains uncommon and typically relies on head mounted displays (HMDs). We integrate these quantitative patterns into a joint risk and attention matrix and an Augmented Human Development (AHD) model that link XR pipeline properties to cognitive load, sensory conflict, and access inequity. Finally, implementing a seven dimension coding scheme on a 0 - 2 scale, we obtain mean scholarly attention scores of 1.56 for pedagogy, 1.04 for privacy (primarily procedural consent), 0.96 for technical reliability, 0.92 for accessibility in low resource contexts, 0.81 for medical and health issues, 0.52 for accessibility for disabilities, and 0.14 for data security practices. This indicates that pedagogy receives the most systematic scrutiny, while data access practices is largely overlooked. We conclude by offering a roadmap for Child-Centered XR that helps HCI researchers and educators move beyond novelty to design systems that are developmentally aligned, secure by default, and accessible to diverse learners.

[HC-18] X-SYS: A Reference Architecture for Interactive Explanation Systems

【速读】:该论文旨在解决生成式 AI (Generative AI) 系统中可解释性(Explainability)在实际部署时面临的挑战,即如何将解释方法有效整合为具备持续可用性的交互式解释系统(Interactive Explanation Systems),以应对重复查询、模型与数据演化及治理约束等复杂场景。解决方案的关键在于提出 X-SYS——一个面向交互式解释系统的参考架构,其核心是围绕四个质量属性(STAR:可扩展性、可追溯性、响应性和适应性)构建五组件分解结构(XUI 服务、解释服务、模型服务、数据服务、编排与治理),并通过契约式服务边界实现前端用户界面(XUI)与后端计算的解耦,从而支持独立演进、在线/离线分离保障响应性、持久状态管理增强可追溯性,最终形成一套可复用的端到端设计蓝图。

链接: https://arxiv.org/abs/2602.12748
作者: Tobias Labarta,Nhi Hoang,Maximilian Dreyer,Jim Berend,Oleg Hein,Jackie Ma,Wojciech Samek,Sebastian Lapuschkin
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:The explainable AI (XAI) research community has proposed numerous technical methods, yet deploying explainability as systems remains challenging: Interactive explanation systems require both suitable algorithms and system capabilities that maintain explanation usability across repeated queries, evolving models and data, and governance constraints. We argue that operationalizing XAI requires treating explainability as an information systems problem where user interaction demands induce specific system requirements. We introduce X-SYS, a reference architecture for interactive explanation systems, that guides (X)AI researchers, developers and practitioners in connecting interactive explanation user interfaces (XUI) with system capabilities. X-SYS organizes around four quality attributes named STAR (scalability, traceability, responsiveness, and adaptability), and specifies a five-component decomposition (XUI Services, Explanation Services, Model Services, Data Services, Orchestration and Governance). It maps interaction patterns to system capabilities to decouple user interface evolution from backend computation. We implement X-SYS through SemanticLens, a system for semantic search and activation steering in vision-language models. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability. Together, this work provides a reusable blueprint and concrete instantiation for interactive explanation systems supporting end-to-end design under operational constraints.

[HC-19] From Guidelines to Practice: Evaluating the Reproducibility of Methods in Computational Social Science

【速读】:该论文旨在解决计算社会科学领域中可复现性(reproducibility)不足的核心问题,具体表现为复杂的工作流、不断演进的软件生态系统以及不一致的文档记录,导致研究方法难以被重新执行。其解决方案的关键在于系统性地改进三个维度:一是通过结构化和标准化的文档(curated documentation)显著降低项目级错误并提升用户对方法输出的理解;二是通过预设执行环境(preset execution environment)增强运行环境的稳定性,从而进一步提高复现成功率与任务完成效率;三是识别出研究人员在实践中频繁依赖生成式 AI (Generative AI) 工具进行故障排查,表明技术辅助手段亦是缓解复现障碍的重要环节。整体而言,该研究强调需从文档质量、环境一致性及概念清晰度三方面协同推进,以构建更可靠的计算社会科学研究基础设施。

链接: https://arxiv.org/abs/2602.12747
作者: Fakhri Momeni,Sarah Sajid,Johannes Kiesel
机构: Gesis - Leibniz Institute for Social Sciences (GESIS - 马克斯普朗克社会研究所)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Reproducibility remains a central challenge in computational social science, where complex workflows, evolving software ecosystems, and inconsistent documentation hinder researchers ability to re-execute published methods. This study presents a systematic evaluation of reproducibility across three conditions: uncurated documentation, curated documentation, and curated documentation paired with a preset execution environment. Using 47 usability test sessions, we combine behavioral performance indicators (success rates, task time, and error profiles) with questionnaire data and thematic analysis to identify technical and conceptual barriers to reproducibility. Curated documentation substantially reduced repository-level errors and improved users ability to interpret method outputs. Standardizing the execution environment further improved reproducibility, yielding the highest success rate and shortest task completion times. Across conditions, participants frequently relied on AI tools for troubleshooting, often enabling independent resolution of issues without facilitator intervention. Our findings demonstrate that reproducibility barriers are multi-layered and require coordinated improvements in documentation quality, environment stability, and conceptual clarity. We discuss implications for the design of reproducibility platforms and infrastructure in computational social science. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.12747 [cs.HC] (or arXiv:2602.12747v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.12747 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-20] Bonik Somiti: A Social-market Tool for Safe Accountable and Harmonious Informal E-Market Ecosystem in Bangladesh

【速读】:该论文旨在解决发展中国家非正式电子市场(informal e-markets)中欺诈与财务损害频发的问题,尤其是现有社交媒体群组中的信息分散、难以验证及缺乏有效处理机制的痛点。其解决方案的关键在于设计并实现一个名为Bonik Somiti的社技术系统(socio-technical system),该系统通过结构化报告流程、管理员主导的调解机制以及责任追溯功能,提升非正式市场参与者之间的信任与纠纷解决效率。

链接: https://arxiv.org/abs/2602.12650
作者: ATM Mizanur Rahman(University of Illinois Urbana-Champaign, USA),Sharifa Sultana(University of Illinois Urbana-Champaign, USA)
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:People in informal e-markets often try to deal with fraud and financial harm by sharing posts, screenshots, and warnings in social media groups. However, buyers and sellers frequently face further problems because these reports are scattered, hard to verify, and rarely lead to resolution. We studied these issues through a survey with 124 participants and interviews with 36 buyers, sellers, and related stakeholders from Bangladesh and designed Bonik Somiti, a socio-technical system that supports structured reporting, admin-led mediation, and accountability in informal e-markets. Our evaluation with 32 participants revealed several challenges in managing fraud, resolving disputes, and building trust within existing informal practices and the assumptions behind them. Based on these findings, we further discuss how community-centered technologies can be designed to support safer and more accountable informal e-markets in the Global South.

[HC-21] Artic: AI-oriented Real-time Communication for MLLM Video Assistant

【速读】:该论文旨在解决当前实时通信(Real-time Communication, RTC)框架与生成式 AI 视频助手(AI Video Assistant)之间存在的根本性不匹配问题,主要表现为 Quality of Experience (QoE) 的显著变化以及网络环境的挑战性加剧,导致现有 RTC 技术在实际部署中出现延迟突增和模型准确率下降。解决方案的关键在于提出 Artic 框架,其核心创新包括:(1) 基于响应能力感知的自适应码率控制(Response Capability-aware Adaptive Bitrate),利用多模态大语言模型(Multimodal Large Language Model, MLLM)的准确率饱和特性主动限制码率,预留带宽冗余以降低延迟;(2) 零开销上下文感知流媒体传输(Zero-overhead Context-aware Streaming),在有限带宽下优先分配关键区域资源,保障低码率下的理解准确性;(3) 降质视频理解基准测试(Degraded Video Understanding Benchmark),首次量化评估 RTC 引起的视频质量退化对 MLLM 准确性的影响。实验表明,Artic 相较于现有方法可提升准确率 15.12%,降低延迟 135.31 ms。

链接: https://arxiv.org/abs/2602.12641
作者: Jiangkai Wu,Zhiyuan Ren,Junquan Zhong,Liming Liu,Xinggong Zhang
机构: Peking University (北京大学)
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:AI Video Assistant emerges as a new paradigm for Real-time Communication (RTC), where one peer is a Multimodal Large Language Model (MLLM) deployed in the cloud. This makes interaction between humans and AI more intuitive, akin to chatting with a real person. However, a fundamental mismatch exists between current RTC frameworks and AI Video Assistants, stemming from the drastic shift in Quality of Experience (QoE) and more challenging networks. Measurements on our production prototype also confirm that current RTC fails, causing latency spikes and accuracy drops. To address these challenges, we propose Artic, an AI-oriented RTC framework for MLLM Video Assistants, exploring the shift from “humans watching video” to “AI understanding video.” Specifically, Artic proposes: (1) Response Capability-aware Adaptive Bitrate, which utilizes MLLM accuracy saturation to proactively cap bitrate, reserving bandwidth headroom to absorb future fluctuations for latency reduction; (2) Zero-overhead Context-aware Streaming, which allocates limited bitrate to regions most important for the response, maintaining accuracy even under ultra-low bitrates; and (3) Degraded Video Understanding Benchmark, the first benchmark evaluating how RTC-induced video degradation affects MLLM accuracy. Prototype experiments using real-world uplink traces show that compared with existing methods, Artic significantly improves accuracy by 15.12% and reduces latency by 135.31 ms. We will release the benchmark and codes at this https URL. Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM) Cite as: arXiv:2602.12641 [cs.NI] (or arXiv:2602.12641v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2602.12641 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-22] AI Agents for Inventory Control: Human-LLM -OR Complementarity

【速读】:该论文旨在解决传统库存控制(Inventory Control)决策中因模型假设刚性而导致的性能下降问题,尤其是在需求分布变化、季节性波动或提前期不确定等现实场景下,现有基于运筹学(Operations Research, OR)的算法表现受限。同时,尽管大语言模型(Large Language Models, LLMs)具备灵活推理与情境感知能力,其在决策系统中的整合方式尚不明确。论文提出的关键解决方案是构建一个包含超过1000个实例的基准测试平台InventoryBench,用于评估不同方法在复杂环境下的表现,并验证OR增强型LLM方法相较于单独使用OR或LLM更具优势,体现出二者互补而非替代的关系;进一步通过人机协同实验发现,人类-AI协作团队平均利润高于单一主体,且存在显著的个体层面互补效应,即多数个体从AI辅助中获益,这为AI与人类在决策流程中的深度融合提供了实证依据和理论边界。

链接: https://arxiv.org/abs/2602.12631
作者: Jackie Baek,Yaopeng Fu,Will Ma,Tianyi Peng
机构: Stern School of Business, New York University (纽约大学斯特恩商学院); Columbia University (哥伦比亚大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Inventory control is a fundamental operations problem in which ordering decisions are traditionally guided by theoretically grounded operations research (OR) algorithms. However, such algorithms often rely on rigid modeling assumptions and can perform poorly when demand distributions shift or relevant contextual information is unavailable. Recent advances in large language models (LLMs) have generated interest in AI agents that can reason flexibly and incorporate rich contextual signals, but it remains unclear how best to incorporate LLM-based methods into traditional decision-making pipelines. We study how OR algorithms, LLMs, and humans can interact and complement each other in a multi-period inventory control setting. We construct InventoryBench, a benchmark of over 1,000 inventory instances spanning both synthetic and real-world demand data, designed to stress-test decision rules under demand shifts, seasonality, and uncertain lead times. Through this benchmark, we find that OR-augmented LLM methods outperform either method in isolation, suggesting that these methods are complementary rather than substitutes. We further investigate the role of humans through a controlled classroom experiment that embeds LLM recommendations into a human-in-the-loop decision pipeline. Contrary to prior findings that human-AI collaboration can degrade performance, we show that, on average, human-AI teams achieve higher profits than either humans or AI agents operating alone. Beyond this population-level finding, we formalize an individual-level complementarity effect and derive a distribution-free lower bound on the fraction of individuals who benefit from AI collaboration; empirically, we find this fraction to be substantial. Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2602.12631 [cs.AI] (or arXiv:2602.12631v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.12631 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-23] PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People

【速读】:该论文旨在解决视障及低视力人群在复杂社会环境中面临的物理移动与交互支持双重挑战,即如何在保障安全导航的同时实现自然、智能的社会互动。其核心解决方案是提出PISHYAR系统,关键在于两个协同工作的模块:一是基于Raspberry Pi 5的社交感知导航框架,融合RGB-D实时感知(OAK-D Lite相机)、YOLOv8目标检测、COMPOSER群体活动识别、D* Lite动态路径规划与振动触觉反馈,实现对环境障碍物的有效规避和符合社交规范的路径决策;二是基于大语言模型(LLM)与视觉语言模型(VLM)的代理式多模态交互框架,通过语音识别、文本转语音以及动态路由机制,在纯语音和纯视觉模式间灵活切换,完成场景描述、物体定位等任务,从而提升用户对系统的可用性、信任感与社会互动感知。

链接: https://arxiv.org/abs/2602.12597
作者: Mahdi Haghighat Joo,Maryam Karimi Jafari,Alireza Taheri
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper presents PISHYAR, a socially intelligent smart cane designed by our group to combine socially aware navigation with multimodal human-AI interaction to support both physical mobility and interactive assistance. The system consists of two components: (1) a social navigation framework implemented on a Raspberry Pi 5 that integrates real-time RGB-D perception using an OAK-D Lite camera, YOLOv8-based object detection, COMPOSER-based collective activity recognition, D* Lite dynamic path planning, and haptic feedback via vibration motors for tasks such as locating a vacant seat; and (2) an agentic multimodal LLM-VLM interaction framework that integrates speech recognition, vision language models, large language models, and text-to-speech, with dynamic routing between voice-only and vision-only modes to enable natural voice-based communication, scene description, and object localization from visual input. The system is evaluated through a combination of simulation-based tests, real-world field experiments, and user-centered studies. Results from simulated and real indoor environments demonstrate reliable obstacle avoidance and socially compliant navigation, achieving an overall system accuracy of approximately 80% under different social conditions. Group activity recognition further shows robust performance across diverse crowd scenarios. In addition, a preliminary exploratory user study with eight visually impaired and low-vision participants evaluates the agentic interaction framework through structured tasks and a UTAUT-based questionnaire reveals high acceptance and positive perceptions of usability, trust, and perceived sociability during our experiments. The results highlight the potential of PISHYAR as a multimodal assistive mobility aid that extends beyond navigation to provide socially interactive support for such users.

[HC-24] Editable XAI: Toward Bidirectional Human-AI Alignment with Co-Editable Explanations of Interpretable Attributes

【速读】:该论文旨在解决可解释人工智能(Explainable AI, XAI)中因领域知识错位导致用户与模型决策不一致的问题,这种不一致性会阻碍用户对模型的理解,且由于传统XAI仅支持读取式解释,用户缺乏调整和优化解释的能力。解决方案的关键在于提出可编辑的XAI框架CoExplain,其核心创新包括:利用神经网络实现通用表示、通过符号规则提供直观推理能力,并构建一个忠实的代理决策树来解释神经网络;同时,将用户编写的规则解析为等效的神经网络图结构,并协同优化决策树,从而实现双向的人工智能对齐(bidirectional AI alignment)。该方法借助主动学习的生成效应增强用户理解力与控制感,在用户研究中验证了其在提升理解度、减少编辑次数和缩短使用时间上的优势。

链接: https://arxiv.org/abs/2602.12569
作者: Haoyang Chen,Jingwen Bai,Fang Tian,Brian Y Lim
机构: National University of Singapore (新加坡国立大学); National University of Singapore (新加坡国立大学); National University of Singapore (新加坡国立大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While Explainable AI (XAI) helps users understand AI decisions, misalignment in domain knowledge can lead to disagreement. This inconsistency hinders understanding, and because explanations are often read-only, users lack the control to improve alignment. We propose making XAI editable, allowing users to write rules to improve control and gain deeper understanding through the generation effect of active learning. We developed CoExplain, leveraging a neural network for universal representation and symbolic rules for intuitive reasoning on interpretable attributes. CoExplain explains the neural network with a faithful proxy decision tree, parses user-written rules as an equivalent neural network graph, and collaboratively optimizes the decision tree. In a user study (N=43), CoExplain and manually editable XAI improved user understanding and model alignment compared to read-only XAI. CoExplain was easier to use with fewer edits and less time. This work contributes Editable XAI for bidirectional AI alignment, improving understanding and control.

[HC-25] GatheringSense: AI-Generated Imagery and Embodied Experiences for Understanding Literati Gatherings

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在文化传承应用中普遍存在的局限性问题,即现有方法多聚焦于美学层面的再现,难以深入传达文化仪式与社会结构的深层意义。其解决方案的关键在于提出一个基于具身认知(embodied cognition)的双路径框架,通过融合AI生成的多模态内容与沉浸式身体参与体验,实现对文人雅集(Wenren Yaji)这类传统文化活动的认知深化与情感共鸣增强。该框架在实例化系统GatheringSense中得到验证,实证研究表明,AI内容提升符号可读性和初始情绪吸引力,而具身参与则显著强化对礼制规则和社会角色的理解,并提升心理亲近感与现场感,从而为文化遗产的生成式体验设计提供了可迁移的设计启示。

链接: https://arxiv.org/abs/2602.12565
作者: You Zhou,Bingyuan Wang,Hongcheng Guo,Rui Cao,Zeyu Wang
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Xiamen University Malaysia(厦门大学马来西亚分校); The Hong Kong University of Science and Technology(香港科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Chinese literati gatherings (Wenren Yaji), as a situated form of Chinese traditional culture, remain underexplored in depth. Although generative AI supports powerful multimodal generation, current cultural applications largely emphasize aesthetic reproduction and struggle to convey the deeper meanings of cultural rituals and social frameworks. Based on embodied cognition, we propose an AI-driven dual-path framework for cultural understanding, which we instantiate through GatheringSense, a literati-gathering experience. We conduct a mixed-methods study (N=48) to compare how AI-generated multimodal content and embodied participation complement each other in supporting the understanding of literati gatherings and fostering cultural resonance. Our results show that AI-generated content effectively improves the readability of cultural symbols and initial emotional attraction, yet limitations in physical coherence and micro-level credibility may affect users’ satisfaction. In contrast, embodied experience significantly deepens participants’ understanding of ritual rules and social roles, and increases their psychological closeness and presence. Based on these findings, we offer empirical evidence and five transferable design implications for generative experience in cultural heritage.

[HC-26] Not a Silver Bullet for Loneliness: How Attachment and Age Shape Intimacy with AI Companions

【速读】:该论文试图解决的问题是:人工智能(Artificial Intelligence, AI)伴侣被广泛宣传为缓解孤独感的通用解决方案,但忽视了个体心理特质(如依恋风格)和生命周期阶段如何塑造人机亲密关系。其解决方案的关键在于构建并验证一个整合模型,揭示孤独感对人工亲密关系的影响机制——即孤独感通过依恋不安全感的调节作用以及年龄的条件效应,呈现出差异化的影响模式:安全型依恋者在孤独时亲密感反而降低,而回避型与矛盾型依恋者则表现出增强的亲密感,且老年人即使孤独水平较低也报告更高的亲密感。这一发现表明,人工亲密是一种受心理和社会因素共同塑造的“社会技术过程”,而非普适性干预手段。

链接: https://arxiv.org/abs/2602.12476
作者: Raffaele Ciriello,Uri Gal,Ofir Turel
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) companions are increasingly promoted as solutions for loneliness, often overlooking how personal dispositions and life-stage conditions shape artificial intimacy. Because intimacy is a primary coping mechanism for loneliness that varies by attachment style and age, we examine how different types of users form intimate relationships with AI companions in response to loneliness. Drawing on a hermeneutic literature review and a survey of 277 active AI companion users, we develop and test a model in which loneliness predicts intimacy, moderated by attachment insecurity and conditioned by age. Although the cross-sectional data limits causal inference, the results reveal a differentiated pattern. Loneliness is paradoxically associated with reduced intimacy for securely attached users but with increased intimacy for avoidant and ambivalent users, while anxious users show mixed effects. Older adults report higher intimacy even at lower loneliness levels. These findings challenge portrayals of AI companions as universal remedies for loneliness. Instead, artificial intimacy emerges as a sociotechnical process shaped by psychological dispositions and demographic conditions. The study clarifies who is most likely to form intimate relationships with AI companions and highlights ethical risks in commercial models that may capitalise on user vulnerability.

[HC-27] SHAPR: A Solo Human-Centred and AI-Assisted Practice Framework for Research Software Development

【速读】:该论文试图解决的问题是:在高等学位研究(HDR)背景下,单人研究人员如何在生成式人工智能(Generative AI)辅助下开展具有方法论严谨性和责任性的研究软件开发实践,而现有行动设计研究(Action Design Research, ADR)框架对此类情境缺乏具体的操作指导。解决方案的关键在于提出SHAPR框架(Solo, Human-centred, AI-assisted PRactice),该框架作为实践层操作化工具,将ADR的高层原则转化为可执行的指南,明确人类在AI辅助开发中的角色、反思性实践机制和轻量级治理结构,从而支持研究软件开发过程中的持续问责与学习,实现人机协同下的方法论严谨性与知识生产目标。

链接: https://arxiv.org/abs/2602.12443
作者: Ka Ching Chan
机构: University of Southern Queensland (南昆士兰大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: 28 pages, 3 figures

点击查看摘要

Abstract:Research software has become a central vehicle for inquiry and learning in many Higher Degree Research (HDR) contexts, where solo researchers increasingly develop software-based artefacts as part of their research methodology. At the same time, generative artificial intelligence is reshaping development practice, offering powerful forms of assistance while introducing new challenges for accountability, reflection, and methodological rigour. Although Action Design Research (ADR) provides a well-established foundation for studying and constructing socio-technical artefacts, it offers limited guidance on how its principles can be operationalised in the day-to-day practice of solo, AI-assisted research software development. This paper proposes the SHAPR framework (Solo, Human-centred, AI-assisted PRactice) as a practice-level operational framework that complements ADR by translating its high-level principles into actionable guidance for contemporary research contexts. SHAPR supports the enactment of ADR Building-Intervention-Evaluation cycles by making explicit the roles, artefacts, reflective practices, and lightweight governance mechanisms required to sustain human accountability and learning in AI-assisted development. The contribution of the paper is conceptual: SHAPR itself is treated as the primary design artefact and unit of analysis and is evaluated formatively through reflective analysis of its internal coherence, alignment with ADR principles, and applicability to solo research practice. By explicitly linking research software development, Human-AI collaboration, and reflective learning, this study contributes to broader discussions on how SHAPR can support both knowledge production and HDR researcher training.

[HC-28] KeySense: LLM -Powered Hands-Down Ten-Finger Typing on Commodity Touchscreens

【速读】:该论文旨在解决现有触摸屏软件键盘迫使用户采用低效且易疲劳的“鸡啄式”单指敲击(chicken typing)问题,从而无法利用熟悉的双手十指打字(ten-finger typing)技能。其解决方案的关键在于提出一种纯软件方法 KeySense,通过识别用户有意敲击与手指 resting noise 之间的认知-运动时间模式来过滤干扰,并结合微调后的大型语言模型(LLM)解码器将噪声字母序列转化为正确单词,从而在不依赖额外硬件的情况下实现准确、高效且舒适的十指文本输入。

链接: https://arxiv.org/abs/2602.12432
作者: Tony Li,Yan Ma,Zhuojun Li,Chun Yu,IV Ramakrishnan,Xiaojun Bi
机构: Stony Brook University (石溪大学); Kean University (肯恩大学); Tsinghua University (清华大学)
类目: Human-Computer Interaction (cs.HC)
备注: 16 pages, 11 figures. Accepted to appear in the Proceedings of the ACM CHI Conference on Human Factors in Computing Systems (CHI 2026). This version corresponds to the accepted manuscript

点击查看摘要

Abstract:Existing touchscreen software keyboards prevent users from resting their hands, forcing slow and fatiguing index-finger tapping (“chicken typing”) instead of familiar hands-down ten-finger typing. We present KeySense, a purely software solution that preserves physical keyboard motor skills. KeySense isolates intentional taps from resting-finger noise using cognitive-motor timing patterns, and then uses a fine-tuned LLM decoder to convert the resulting noisy letter sequence into the intended word. In controlled component tests, the decoder substantially outperforms two statistical baselines (top-1 accuracy 84.8% vs 75.7% and 79.3%). A 12-participant study shows clear ergonomic and performance benefits: compared with the conventional hover-style keyboard, users rated KeySense as markedly less physically demanding (NASA-TLX median 1.5 vs 4.0), and after brief practice typed significantly faster (WPM 28.3 vs 26.2, p 0.01). These results indicate that KeySense enables accurate, efficient, and comfortable ten-finger text entry on commodity touchscreens without any extra hardware.

[HC-29] Eyes on Many: Evaluating Gaze Hand and Voice for Multi-Object Selection in Extended Reality

【速读】:该论文旨在解决扩展现实(XR)环境中多对象选择(multi-object selection)的交互效率问题,其核心挑战在于如何通过合理的模式切换(mode-switching)与子选择(subselection)技术组合,提升用户操作速度与体验。解决方案的关键在于系统性评估四种模式切换方式(SemiPinch、FullPinch、DoublePinch 和 Voice)与三种子选择方法(Gaze+Dwell、Gaze+Pinch、Gaze+Voice)的组合效果,实验表明,DoublePinch 配合 Gaze+Pinch 在整体性能上表现最优,而 Voice 模式虽在切换阶段具优势,但 Gaze+Voice 因重复语音指令易引发用户疲劳而不被青睐,从而为 XR 中多对象选择交互设计提供了实证依据与优化方向。

链接: https://arxiv.org/abs/2602.12406
作者: Mohammad Raihanul Bashar,Aunnoy K Mutasim,Ken Pfeuffer,Anil Ufuk Batmaz
机构: Concordia University (康考迪亚大学); Simon Fraser University (西蒙菲莎大学); Aarhus University (奥胡斯大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Interacting with multiple objects simultaneously makes us fast. A pre-step to this interaction is to select the objects, i.e., multi-object selection, which is enabled through two steps: (1) toggling multi-selection mode – mode-switching – and then (2) selecting all the intended objects – subselection. In extended reality (XR), each step can be performed with the eyes, hands, and voice. To examine how design choices affect user performance, we evaluated four mode-switching (SemiPinch, FullPinch, DoublePinch, and Voice) and three subselection techniques (Gaze+Dwell, Gaze+Pinch, and Gaze+Voice) in a user study. Results revealed that while DoublePinch paired with Gaze+Pinch yielded the highest overall performance, SemiPinch achieved the lowest performance. Although Voice-based mode-switching showed benefits, Gaze+Voice subselection was less favored, as the required repetitive vocal commands were perceived as tedious. Overall, these findings provide empirical insights and inform design recommendations for multi-selection techniques in XR.

[HC-30] A Lightweight Cubature Kalman Filter for Attitude and Heading Reference Systems Using Simplified Prediction Equations

【速读】:该论文旨在解决姿态与航向参考系统(Attitude and Heading Reference Systems, AHRS)中卡尔曼滤波算法计算复杂度高、实时性受限的问题。解决方案的关键在于提出一种轻量化的立方体Kalman滤波器(Kaisoku Cubature Kalman Filter, KCKF),其通过简化原始立方体Kalman滤波器(Cubature Kalman Filter, CKF)的预测方程,保留等效数学关系的同时显著降低浮点运算次数(FLOPs)。具体而言,KCKF通过对CKF中求和项进行展开并简化,得到计算量更小但精度相当的预测公式,在高性能计算机和低成本单板计算机上分别实现约19%和15%的计算时间减少,同时保持与CKF相当的姿态估计精度。

链接: https://arxiv.org/abs/2602.12283
作者: Shunsei Yamagishi,Lei Jing
机构: University of Aizu (会津大学)
类目: ystems and Control (eess.SY); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Attitude and Heading Reference Systems (AHRSs) are broadly applied wherever reliable orientation and motion sensing is required. In this paper, we present an improved Cubature Kalman Filter (CKF) with lower computational cost while maintaining estimation accuracy, which is named “Kaisoku Cubature Kalman Filter (KCKF)”. The computationally efficient equations of the KCKF are derived by simplifying those of the CKF, while preserving equivalent mathematical relations. The lightweight prediction equations in the KCKF are derived by expanding the summation terms in the CKF and simplifying the result. This paper shows that the KCKF requires fewer floating-point operations (FLOPs) than the CKF. The controlled experimental results show that the KCKF reduces the computation time by approximately 19% compared to the CKF on a high-performance computer, whereas the KCKF reduces the computation time by approximately 15% compared to the CKF on a low-cost single-board computer. In addition, the KCKF maintains the attitude estimation accuracy of the CKF.

计算机视觉

[CV-0] Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

【速读】:该论文旨在解决机器人通过观看人类视频学习预握式操作(prehensile manipulation)时,难以有效学习任务相关的抓取行为的问题,尤其在机器人不具备类人手结构的情况下,单纯依赖人类视频难以获得与下游操作任务兼容的稳定抓取策略。解决方案的关键在于提出Perceive-Simulate-Imitate(PSI)框架,其核心创新是利用仿真中的配对抓取-轨迹过滤机制,为人类视频数据添加抓取适用性标签(grasp suitability labels),从而实现基于监督学习的任务导向型抓取能力训练,最终使模块化操作策略在真实世界中表现出显著优于仅使用通用抓取生成器的鲁棒性能。

链接: https://arxiv.org/abs/2602.13197
作者: Albert J. Zhai,Kuo-Hao Zeng,Jiasen Lu,Ali Farhadi,Shenlong Wang,Wei-Chiu Ma
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Vercept; Apple; Allen Institute for AI (艾伦人工智能研究所); University of Washington (华盛顿大学); Cornell University (康奈尔大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot’s ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

[CV-1] Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision ALT

【速读】:该论文旨在解决现有对话式图像分割(Conversational Image Segmentation, CIS)任务中缺乏对功能性和物理推理能力的建模问题,即传统方法仅关注类别和空间查询(如“最左边的苹果”),而忽视了意图驱动的语义理解(如“我可以在哪里安全地存放刀具?”)。其解决方案的关键在于构建了一个涵盖实体、空间关系、意图、可及性(affordance)、功能、安全性及物理推理的多维基准数据集 ConverSeg,并提出 ConverSeg-Net 模型——该模型融合了强大的分割先验与语言理解能力,同时开发了一种无需人工标注的 AI 驱动数据生成引擎,用于自动合成提示-掩码对。实验表明,当前基于语言引导的分割模型在 CIS 任务上表现不足,而 ConverSeg-Net 在 ConverSeg 上取得显著提升,并保持在主流语言引导分割基准上的优异性能。

链接: https://arxiv.org/abs/2602.13195
作者: Aadarsh Sahoo,Georgia Gkioxari
机构: California Institute of Technology(加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., “left-most apple”) and overlooks functional and physical reasoning (e.g., “where can I safely store the knife?”). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: this https URL

[CV-2] FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

【速读】:该论文旨在解决视频生成中控制效果不佳且泛化能力弱的问题,尤其针对现有方法依赖模糊或任务特定信号所带来的局限性。其解决方案的关键在于提出一种统一框架 FlexAM,通过引入一种新颖的三维控制信号来表征视频动态——将运动信息表示为点云,并结合多频位置编码以区分精细运动特征、深度感知的位置编码以及灵活可调的控制信号,从而实现外观(appearance)与运动(motion)的有效解耦,显著提升在图像到视频(I2V)、视频到视频(V2V)编辑、相机控制及空间物体编辑等任务中的性能与通用性。

链接: https://arxiv.org/abs/2602.13185
作者: Mingzhi Sheng,Zekai Gu,Peng Li,Cheng Lin,Hao-Xiang Guo,Ying-Cong Chen,Yuan Liu
机构: HKUST(GZ); HKUST; MUST; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Codes: this https URL

点击查看摘要

Abstract:Effective and generalizable control in video generation remains a significant challenge. While many methods rely on ambiguous or task-specific signals, we argue that a fundamental disentanglement of “appearance” and “motion” provides a more robust and scalable pathway. We propose FlexAM, a unified framework built upon a novel 3D control signal. This signal represents video dynamics as a point cloud, introducing three key enhancements: multi-frequency positional encoding to distinguish fine-grained motion, depth-aware positional encoding, and a flexible control signal for balancing precision and generative quality. This representation allows FlexAM to effectively disentangle appearance and motion, enabling a wide range of tasks including I2V/V2V editing, camera control, and spatial object editing. Extensive experiments demonstrate that FlexAM achieves superior performance across all evaluated tasks.

[CV-3] Monocular Markerless Motion Capture Enables Quantitative Assessment of Upper Extremity Reachable Workspace

【速读】:该论文旨在解决临床环境中对上肢可及工作空间(Upper Extremity Reachable Workspace, UERW)进行量化评估时存在的技术复杂性和实施门槛高的问题。传统基于标记的运动捕捉系统虽然精度高,但设备昂贵、操作繁琐,限制了其在日常临床实践中的应用。解决方案的关键在于采用单目(monocular)摄像头结合生成式 AI(Generative AI)驱动的无标记运动捕捉(Markerless Motion Capture, MMC)技术,通过简化数据采集流程并保持较高测量准确性,实现对UERW任务的快速、可及的定量分析。研究结果表明,正面视角的单目摄像机配置与标记式运动捕捉系统具有高度一致性(平均偏差仅为0.61% ± 0.12%),尤其适用于前向工作空间的评估,验证了该方法在临床场景下的可行性与潜力。

链接: https://arxiv.org/abs/2602.13176
作者: Seth Donahue,J.D. Peiffer,R. Tyler Richardson,Yishan Zhong,Shaun Q. Y. Tan,Benoit Marteau,Stephanie R. Russo,May D. Wang,R. James Cotton,Ross Chafetz
机构: Shriners Children’s Lexington(肖勒斯儿童医院); University of Kentucky Department of Physical Therapy(肯塔基大学物理治疗系); Shirley Ryan AbilityLab(萧莉·瑞安能力实验室); Northwestern University(西北大学); Shirley Ryan Ability Lab(萧莉·瑞安能力实验室); School of Electrical and Computer Engineering, Georgia Institute of Technology(佐治亚理工学院电气与计算机工程学院); The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University(韦尔斯·H·库尔特生物医学工程系,佐治亚理工学院与埃默里大学); Shriners Hospitals for Children(肖勒斯儿童医院); Pennsylvania State University at Harrisburg(宾夕法尼亚州立大学哈里斯堡分校); Nationwide Children’s Hospital(全国儿童医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To validate a clinically accessible approach for quantifying the Upper Extremity Reachable Workspace (UERW) using a single (monocular) camera and Artificial Intelligence (AI)-driven Markerless Motion Capture (MMC) for biomechanical analysis. Objective assessment and validation of these techniques for specific clinically oriented tasks are crucial for their adoption in clinical motion analysis. AI-driven monocular MMC reduces the barriers to adoption in the clinic and has the potential to reduce the overhead for analysis of this common clinical assessment. Nine adult participants with no impairments performed the standardized UERW task, which entails reaching targets distributed across a virtual sphere centered on the torso, with targets displayed in a VR headset. Movements were simultaneously captured using a marker-based motion capture system and a set of eight FLIR cameras. We performed monocular video analysis on two of these video camera views to compare a frontal and offset camera configurations. The frontal camera orientation demonstrated strong agreement with the marker-based reference, exhibiting a minimal mean bias of 0.61 \pm 0.12 % reachspace reached per octanct (mean \pm standard deviation). In contrast, the offset camera view underestimated the percent workspace reached ( -5.66 \pm 0.45 % reachspace reached). Conclusion: The findings support the feasibility of a frontal monocular camera configuration for UERW assessment, particularly for anterior workspace evaluation where agreement with marker-based motion capture was highest. The overall performance demonstrates clinical potential for practical, single-camera assessments. This study provides the first validation of monocular MMC system for the assessment of the UERW task. By reducing technical complexity, this approach enables broader implementation of quantitative upper extremity mobility assessment.

[CV-4] LongStream: Long-Sequence Streaming Autoregressive Visual Geometry

【速读】:该论文旨在解决长序列流式三维重建(long-sequence streaming 3D reconstruction)中的关键挑战,即现有自回归模型在处理数千帧长序列时因首帧锚定(first-frame anchor)导致的注意力衰减(attention decay)、尺度漂移(scale drift)和外推误差(extrapolation errors)。其解决方案的核心在于三点:一是摒弃首帧锚定机制,转而预测关键帧相对位姿(keyframe-relative poses),将远距离外推转化为局部恒定难度任务;二是引入正交尺度学习(orthogonal scale learning),实现几何与尺度估计的完全解耦以抑制漂移;三是通过缓存一致性训练(cache-consistent training)与周期性缓存刷新(periodic cache refresh)策略,缓解Transformer缓存中注意力依赖(attention-sink reliance)和长期键值缓存污染(KV-cache contamination)问题,从而显著提升超长序列下的稳定性和精度。

链接: https://arxiv.org/abs/2602.13172
作者: Chong Cheng,Xianda Chen,Tao Xie,Wei Yin,Weiqiang Ren,Qian Zhang,Xiaoyuang Guo,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou); Horizon Robotics; Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention degradation over ultra-long sequences and reduces the gap between training and inference. Experiments show LongStream achieves state-of-the-art performance. It delivers stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS. Project Page: this https URL

[CV-5] Realistic Face Reconstruction from Facial Embeddings via Diffusion Models AAAI2026

【速读】:该论文旨在解决隐私保护人脸识别(Privacy-Preserving Face Recognition, PPFR)系统中潜在的隐私泄露风险问题,特别是通过从其嵌入表示(embedding)中重建高分辨率人脸图像来验证攻击可行性。解决方案的关键在于提出一种通用框架——面部嵌入映射(Face Embedding Mapping, FEM),利用预训练的身份保留扩散模型(Identity-Preserving Diffusion Model)结合科尔莫戈罗夫-阿诺德网络(Kolmogorov-Arnold Network, KAN),实现从PPFR系统输出的嵌入到真实人脸图像的高效映射。实验表明,该方法不仅能成功重建可用于实际人脸识别系统的高质量人脸图像,且对部分或受保护的嵌入仍具鲁棒性,从而为评估FR与PPFR系统的隐私安全性提供了有效工具。

链接: https://arxiv.org/abs/2602.13168
作者: Dong Han,Yong Li,Joachim Denzler
机构: 1: University of Marburg (马尔堡大学); 2: TU Darmstadt (达姆施塔特工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:With the advancement of face recognition (FR) systems, privacy-preserving face recognition (PPFR) systems have gained popularity for their accurate recognition, enhanced facial privacy protection, and robustness to various attacks. However, there are limited studies to further verify privacy risks by reconstructing realistic high-resolution face images from embeddings of these systems, especially for PPFR. In this work, we propose the face embedding mapping (FEM), a general framework that explores Kolmogorov-Arnold Network (KAN) for conducting the embedding-to-face attack by leveraging pre-trained Identity-Preserving diffusion model against state-of-the-art (SOTA) FR and PPFR systems. Based on extensive experiments, we verify that reconstructed faces can be used for accessing other real-word FR systems. Besides, the proposed method shows the robustness in reconstructing faces from the partial and protected face embeddings. Moreover, FEM can be utilized as a tool for evaluating safety of FR and PPFR systems in terms of privacy leakage. All images used in this work are from public datasets.

[CV-6] Universal Transformation of One-Class Classifiers for Unsupervised Anomaly Detection

【速读】:该论文旨在解决传统一类别分类(one-class classification)异常检测方法对训练标签噪声敏感的问题,尤其是在仅有正常样本(nominal data)可用的情况下,如何提升模型在无监督场景下的鲁棒性和性能。其解决方案的关键在于提出了一种“数据折叠”(dataset folding)方法,通过引入两个核心弱假设——异常在训练集中稀少且具有异质性——利用多个独立训练的一类分类器对训练集进行交叉筛选,从而自动识别并剔除潜在的异常样本,将原本依赖于一类别分类的检测器转化为完全无监督的方法。此方法无需修改原始异常检测器结构,仅通过算法选择的数据子集实现转换,显著提升了在图像和视频领域无监督异常检测的性能,并在MVTec AD、ViSA和MVTec Loco AD等多个基准数据集上达到当前最优效果。

链接: https://arxiv.org/abs/2602.13091
作者: Declan McIntosh,Alexandra Branzan Albu
机构: University of Victoria (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 figures, 9 pages main paper, 15 pages total with supplemental

点击查看摘要

Abstract:Detecting anomalies in images and video is an essential task for multiple real-world problems, including industrial inspection, computer-assisted diagnosis, and environmental monitoring. Anomaly detection is typically formulated as a one-class classification problem, where the training data consists solely of nominal values, leaving methods built on this assumption susceptible to training label noise. We present a dataset folding method that transforms an arbitrary one-class classifier-based anomaly detector into a fully unsupervised method. This is achieved by making a set of key weak assumptions: that anomalies are uncommon in the training dataset and generally heterogeneous. These assumptions enable us to utilize multiple independently trained instances of a one-class classifier to filter the training dataset for anomalies. This transformation requires no modifications to the underlying anomaly detector; the only changes are algorithmically selected data subsets used for training. We demonstrate that our method can transform a wide variety of one-class classifier anomaly detectors for both images and videos into unsupervised ones. Our method creates the first unsupervised logical anomaly detectors by transforming existing methods. We also demonstrate that our method achieves state-of-the-art performance for unsupervised anomaly detection on the MVTec AD, ViSA, and MVTec Loco AD datasets. As improvements to one-class classifiers are made, our method directly transfers those improvements to the unsupervised domain, linking the domains.

[CV-7] SIEFormer: Spectral-Interpretable and -Enhanced Transformer for Generalized Category Discovery

【速读】:该论文旨在解决视觉Transformer(ViT)在通用类别发现(Generalized Category Discovery, GCD)任务中特征适应性不足的问题,尤其是如何更有效地建模局部结构相关性和全局依赖关系以提升模型性能。其解决方案的关键在于提出了一种谱可解释且增强的Transformer架构(SIEFormer),通过两个并行分支实现对ViT注意力机制的谱域重解释与增强:隐式分支利用不同类型的图拉普拉斯矩阵建模token间的局部结构关联,并引入一种带通/带阻自适应滤波层(Band-adaptive Filter, BaF)实现灵活频段选择;显式分支则设计了一个可调节滤波层(Maneuverable Filtering Layer, MFL),通过对“值”特征进行傅里叶变换,在频域中学习可调参数对信号进行调制,再经逆傅里叶变换获得增强特征,从而联合优化局部与全局语义信息。

链接: https://arxiv.org/abs/2602.13067
作者: Chunming Li,Shidong Wang,Tong Xin,Haofeng Zhang
机构: Nanjing University of Science and Technology (南京理工大学); Newcastle University (纽卡斯尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel approach, Spectral-Interpretable and -Enhanced Transformer (SIEFormer), which leverages spectral analysis to reinterpret the attention mechanism within Vision Transformer (ViT) and enhance feature adaptability, with particular emphasis on challenging Generalized Category Discovery (GCD) tasks. The proposed SIEFormer is composed of two main branches, each corresponding to an implicit and explicit spectral perspective of the ViT, enabling joint optimization. The implicit branch realizes the use of different types of graph Laplacians to model the local structure correlations of tokens, along with a novel Band-adaptive Filter (BaF) layer that can flexibly perform both band-pass and band-reject filtering. The explicit branch, on the other hand, introduces a Maneuverable Filtering Layer (MFL) that learns global dependencies among tokens by applying the Fourier transform to the input ``value" features, modulating the transformed signal with a set of learnable parameters in the frequency domain, and then performing an inverse Fourier transform to obtain the enhanced features. Extensive experiments reveal state-of-the-art performance on multiple image recognition datasets, reaffirming the superiority of our approach through ablation studies and visualizations.

[CV-8] A Calibrated Memorization Index (MI) for Detecting Training Data Leakage in Generative MRI Models

【速读】:该论文旨在解决医学图像生成模型在训练过程中可能对训练数据进行记忆和复制(memorization and duplication)的问题,这会引发隐私泄露风险。解决方案的关键在于提出一种校准的逐样本度量指标,该指标基于MRI基础模型提取的图像特征,通过聚合多层白化后的最近邻相似性,并映射为有界值的过拟合/新颖性指数(Overfit/Novelty Index, ONI)与记忆指数(Memorization Index, MI),从而在样本层面实现对重复图像的近乎完美的检测,且在不同MRI数据集上具有鲁棒性和一致性。

链接: https://arxiv.org/abs/2602.13066
作者: Yash Deo,Yan Jia,Toni Lassila,Victoria J Hodge,Alejandro F Frang,Chenghao Qian,Siyuan Kang,Ibrahim Habli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ISBI 2026

点击查看摘要

Abstract:Image generative models are known to duplicate images from the training data as part of their outputs, which can lead to privacy concerns when used for medical image generation. We propose a calibrated per-sample metric for detecting memorization and duplication of training data. Our metric uses image features extracted using an MRI foundation model, aggregates multi-layer whitened nearest-neighbor similarities, and maps them to a bounded \emphOverfit/Novelty Index (ONI) and \emphMemorization Index (MI) scores. Across three MRI datasets with controlled duplication percentages and typical image augmentations, our metric robustly detects duplication and provides more consistent metric values across datasets. At the sample level, our metric achieves near-perfect detection of duplicates.

[CV-9] Curriculum-DPO: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

【速读】:该论文旨在解决现有偏好优化方法(如直接偏好优化,DPO)在文本到图像生成任务中未考虑不同偏好学习难度差异的问题,导致优化过程次优。其关键解决方案是提出 Curriculum-DPO++,通过引入数据级课程学习与创新的模型级课程学习相结合的方式,动态提升去噪网络的学习能力:一方面逐步解冻训练层以增加模型容量,另一方面基于低秩适应(LoRA)对低秩矩阵维度进行渐进式增长;同时设计了一种新的排序策略替代原方法,从而在九个基准测试中显著优于现有最优方法,在文本对齐、美学质量和人类偏好方面表现更优。

链接: https://arxiv.org/abs/2602.13055
作者: Florinel-Alin Croitoru,Vlad Hondru,Radu Tudor Ionescu,Nicu Sebe,Mubarak Shah
机构: University of Bucharest (布加勒斯特大学); University of Trento (特伦托大学); Center for Research in Computer Vision (CRCV) (计算机视觉研究中心); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2405.13637

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at this https URL.

[CV-10] Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images WWW

【速读】:该论文旨在解决饮食评估中食物份量估计的几何推理不足问题,现有方法多依赖单图分析或基于外观的推断(如视觉-语言模型),缺乏显式的几何建模能力且易受尺度模糊性影响。其解决方案的关键在于将食物份量估计重构为单目多食物图像下的隐式尺度三维重建问题,通过移除显式的物理参照物和度量标注,仅提供盘子、餐具等上下文物体,迫使算法从隐式线索和先验知识中推断尺度信息,从而实现更准确、鲁棒的几何感知建模。

链接: https://arxiv.org/abs/2602.13041
作者: Yuhao Chen,Gautham Vinod,Siddeshwar Raghavan,Talha Ibn Mahmud,Bruce Coburn,Jinge Ma,Fengqing Zhu,Jiangpeng He
机构: University of Waterloo (滑铁卢大学); Purdue University (普渡大学); Indiana University (印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted to 2026 IEEE Southwest Symposium on Image Analysis and Interpretation. The dataset can be downloaded at: this https URL

点击查看摘要

Abstract:We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision–language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.

[CV-11] FedHENet: A Frugal Federated Learning Framework for Heterogeneous Environments

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在处理敏感视觉信息时面临的隐私风险与计算效率低下问题。现有FL方法通常依赖于昂贵的迭代式深度网络优化,且在通信过程中共享梯度可能泄露隐私。其解决方案的关键在于提出FedHENet框架,通过使用固定的预训练特征提取器并仅学习单个输出层来避免本地昂贵的微调过程;同时,利用同态加密(Homomorphic Encryption, HE)在单轮通信中解析聚合客户端知识,从而实现高隐私保障、稳定性能和显著更高的能效(相比传统FL提升达70%),且无需超参数调优,消除了由此产生的碳足迹。

链接: https://arxiv.org/abs/2602.13024
作者: Alejandro Dopico-Castro,Oscar Fontenla-Romero,Bertha Guijarro-Berdiñas,Amparo Alonso-Betanzos,Iván Pérez Digón
机构: Universidade da Coruña(拉科鲁尼亚大学); CITIC; Facultade de Informática(计算机学院); Campus de Elviña s/n; A Coruña; Spain
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at the 34th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2026)

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative training without centralizing data, essential for privacy compliance in real-world scenarios involving sensitive visual information. Most FL approaches rely on expensive, iterative deep network optimization, which still risks privacy via shared gradients. In this work, we propose FedHENet, extending the FedHEONN framework to image classification. By using a fixed, pre-trained feature extractor and learning only a single output layer, we avoid costly local fine-tuning. This layer is learned by analytically aggregating client knowledge in a single round of communication using homomorphic encryption (HE). Experiments show that FedHENet achieves competitive accuracy compared to iterative FL baselines while demonstrating superior stability performance and up to 70% better energy efficiency. Crucially, our method is hyperparameter-free, removing the carbon footprint associated with hyperparameter tuning in standard FL. Code available in this https URL

[CV-12] Learning Image-based Tree Crown Segmentation from Enhanced Lidar-based Pseudo-labels

【速读】:该论文旨在解决在航空影像中自动分割并分离单株树冠(individual tree crown)的难题,尤其是在纹理复杂和树冠部分重叠的情况下。其核心挑战在于获取高质量的训练标注数据,传统方法依赖人工标注,成本高昂且效率低下。解决方案的关键在于利用机载激光扫描(ALS)数据生成伪标签(pseudo-labels),并通过零样本实例分割模型Segment Anything Model 2(SAM 2)对这些伪标签进行增强与优化,从而构建无需人工标注的领域特定训练数据集,显著提升了基于光学图像的分割模型性能,优于现有通用领域部署模型。

链接: https://arxiv.org/abs/2602.13022
作者: Julius Pesonen,Stefan Rua,Josef Taher,Niko Koivumäki,Xiaowei Yu,Eija Honkavaara
机构: NLS (National Land Survey of Finland); University of Helsinki (赫尔辛基大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mapping individual tree crowns is essential for tasks such as maintaining urban tree inventories and monitoring forest health, which help us understand and care for our environment. However, automatically separating the crowns from each other in aerial imagery is challenging due to factors such as the texture and partial tree crown overlaps. In this study, we present a method to train deep learning models that segment and separate individual trees from RGB and multispectral images, using pseudo-labels derived from aerial laser scanning (ALS) data. Our study shows that the ALS-derived pseudo-labels can be enhanced using a zero-shot instance segmentation model, Segment Anything Model 2 (SAM 2). Our method offers a way to obtain domain-specific training annotations for optical image-based models without any manual annotation cost, leading to segmentation models which outperform any available models which have been targeted for general domain deployment on the same task.

[CV-13] DynaGuide: A Generalizable Dynamic Guidance Framework for Unsupervised Semantic Segmentation

【速读】:该论文旨在解决无监督图像分割(Unsupervised Image Segmentation)中难以兼顾全局语义结构与细粒度边界精度的问题。现有方法往往在两者之间难以取得平衡,导致分割结果要么语义不准确,要么边界模糊。解决方案的关键在于提出一种名为DynaGuide的自适应分割框架,其核心创新是采用新颖的双引导策略(dual-guidance strategy)和动态损失优化机制:一方面利用零样本模型(如DiffSeg或SegFormer)生成全局伪标签提供语义指导,另一方面通过从零训练的轻量级卷积神经网络(CNN)进行局部边界精修;同时设计多组件动态损失函数,协同优化特征相似性、Huber平滑的空间连续性(含对角关系)以及与全局伪标签的语义一致性,从而在无需目标域真实标注的情况下实现高精度分割。

链接: https://arxiv.org/abs/2602.13020
作者: Boujemaa Guermazi,Riadh Ksantini,Naimul Khan
机构: Toronto Metropolitan University (多伦多都会大学); University of Bahrain (巴林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Image and Vision Computing

点击查看摘要

Abstract:Unsupervised image segmentation is a critical task in computer vision. It enables dense scene understanding without human annotations, which is especially valuable in domains where labelled data is scarce. However, existing methods often struggle to reconcile global semantic structure with fine-grained boundary accuracy. This paper introduces DynaGuide, an adaptive segmentation framework that addresses these challenges through a novel dual-guidance strategy and dynamic loss optimization. Building on our previous work, DynaSeg, DynaGuide combines global pseudo-labels from zero-shot models such as DiffSeg or SegFormer with local boundary refinement using a lightweight CNN trained from scratch. This synergy allows the model to correct coarse or noisy global predictions and produce high-precision segmentations. At the heart of DynaGuide is a multi-component loss that dynamically balances feature similarity, Huber-smoothed spatial continuity, including diagonal relationships, and semantic alignment with the global pseudo-labels. Unlike prior approaches, DynaGuide trains entirely without ground-truth labels in the target domain and supports plug-and-play integration of diverse guidance sources. Extensive experiments on BSD500, PASCAL VOC2012, and COCO demonstrate that DynaGuide achieves state-of-the-art performance, improving mIoU by 17.5% on BSD500, 3.1% on PASCAL VOC2012, and 11.66% on COCO. With its modular design, strong generalization, and minimal computational footprint, DynaGuide offers a scalable and practical solution for unsupervised segmentation in real-world settings. Code available at: this https URL

[CV-14] Multimodal Classification via Total Correlation Maximization ICLR2026

【速读】:该论文旨在解决多模态学习中因模态竞争导致的性能下降问题,即联合学习往往过度拟合某些模态而忽略其他模态,从而使得整体性能低于单一模态学习。其解决方案的关键在于从信息论角度出发,通过最大化多模态特征与标签之间的总相关性(Total Correlation)来缓解模态竞争,并利用特征对齐捕捉模态间的交互关系。为此,作者提出了一种基于互信息神经估计(Mutual Information Neural Estimation, MINE)的总相关性神经估计方法(Total Correlation Neural Estimation, TCNE),并进一步设计出无需超参数调整的TCMax损失函数,通过变分下界优化实现对总相关性的最大化。

链接: https://arxiv.org/abs/2602.13015
作者: Feng Yu,Xiangyu Wu,Yang Yang,Jianfeng Lu
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at ICLR 2026; 19 pages; 2 figures

点击查看摘要

Abstract:Multimodal learning integrates data from diverse sensors to effectively harness information from different modalities. However, recent studies reveal that joint learning often overfits certain modalities while neglecting others, leading to performance inferior to that of unimodal learning. Although previous efforts have sought to balance modal contributions or combine joint and unimodal learning, thereby mitigating the degradation of weaker modalities with promising outcomes, few have examined the relationship between joint and unimodal learning from an information-theoretic perspective. In this paper, we theoretically analyze modality competition and propose a method for multimodal classification by maximizing the total correlation between multimodal features and labels. By maximizing this objective, our approach alleviates modality competition while capturing inter-modal interactions via feature alignment. Building on Mutual Information Neural Estimation (MINE), we introduce Total Correlation Neural Estimation (TCNE) to derive a lower bound for total correlation. Subsequently, we present TCMax, a hyperparameter-free loss function that maximizes total correlation through variational bound optimization. Extensive experiments demonstrate that TCMax outperforms state-of-the-art joint and unimodal learning approaches. Our code is available at this https URL.

[CV-15] owards Universal Video MLLM s with Attribute-Structured and Quality-Verified Instructions

【速读】:该论文旨在解决当前通用视频理解模型在复杂真实场景下性能受限的问题,其根源在于现有视频指令数据仅以单一、不完整的描述呈现多模态内容,缺乏细粒度的结构化组织与可靠标注。解决方案的关键在于构建一个高质量、细粒度的音频视觉指令数据集 ASID-1M(包含单属性和多属性监督),并设计可扩展的数据清洗流程 ASID-Verify,通过自动验证与精修机制确保描述与音视频内容在语义和时间维度上的一致性;在此基础上训练出的 ASID-Captioner 模型基于监督微调(Supervised Fine-Tuning, SFT)方法,在多个基准测试中显著提升了细粒度描述质量,减少了幻觉现象,并增强了指令遵循能力,最终实现了开源模型中的最先进性能,且与 Gemini-3-Pro 相当。

链接: https://arxiv.org/abs/2602.13013
作者: Yunheng Li,Hengrui Zhang,Meng-Hao Guo,Wenzhao Gao,Shaoyong Jia,Shaohui Jiao,Qibin Hou,Ming-Ming Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

[CV-16] MASAR: Motion-Appearance Synergy Refinement for Joint Detection and Trajectory Forecasting ICRA2026

【速读】:该论文旨在解决传统自动驾驶系统中感知与预测模块通过手工设计的边界框接口连接所导致的信息流受限及误差传播问题,以及现有端到端模型未能充分挖掘外观(appearance)与运动(motion)特征之间协同效应、主要依赖短期视觉特征的局限性。其解决方案的关键在于提出一种全新的可微分框架MASAR,该框架采用基于物体中心的时空机制,联合编码外观与运动特征;并通过预测过去轨迹并利用外观线索进行修正,从而捕捉长期时间依赖关系,显著提升未来轨迹预测性能。

链接: https://arxiv.org/abs/2602.13003
作者: Mohammed Amine Bencheikh Lehocine,Julian Schmidt,Frank Moosmann,Dikshant Gupta,Fabian Flohr
机构: Mercedes-Benz AG (梅赛德斯-奔驰集团); Munich University of Applied Sciences (慕尼黑应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)

点击查看摘要

Abstract:Classical autonomous driving systems connect perception and prediction modules via hand-crafted bounding-box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end-to-end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short-term visual features. We follow the idea of “looking backward to look forward”, and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector. MASAR employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR’s effectiveness, showing improvements of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at this https URL.

[CV-17] Detecting Object Tracking Failure via Sequential Hypothesis Testing WACV

【速读】:该论文旨在解决实时视频目标跟踪系统缺乏形式化安全保证的问题,即无法明确判断跟踪是否可靠或何时可能失效,现有方法通常依赖启发式置信度度量来触发警报,但缺乏理论保障。解决方案的关键在于将目标跟踪建模为一个序贯假设检验问题(sequential hypothesis test),通过累积证据逐步检测跟踪失败,并利用e-process(e-过程)的形式化框架,在理论上控制误报率的同时快速识别跟踪失效,从而提供可证明的安全性保障。该方法计算轻量、无需额外训练或微调,且原则上对跟踪模型具有通用性,同时提出了基于标注数据和仅依赖内部跟踪信息的监督与无监督变体,实验证明其在多个基准上对主流跟踪模型均有效。

链接: https://arxiv.org/abs/2602.12983
作者: Alejandro Monroy Muñoz,Rajeev Verma,Alexander Timans
机构: UvA-Bosch Delta Lab, University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in WACV workshop “Real World Surveillance: Applications and Challenges, 6th”

点击查看摘要

Abstract:Real-time online object tracking in videos constitutes a core task in computer vision, with wide-ranging applications including video surveillance, motion capture, and robotics. Deployed tracking systems usually lack formal safety assurances to convey when tracking is reliable and when it may fail, at best relying on heuristic measures of model confidence to raise alerts. To obtain such assurances we propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time. Leveraging recent advancements in the field, our sequential test (formalized as an e-process) quickly identifies when tracking failures set in whilst provably containing false alerts at a desired rate, and thus limiting potentially costly re-calibration or intervention steps. The approach is computationally light-weight, requires no extra training or fine-tuning, and is in principle model-agnostic. We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information, and demonstrate its effectiveness for two established tracking models across four video benchmarks. As such, sequential testing can offer a statistically grounded and efficient mechanism to incorporate safety assurances into real-time tracking systems.

[CV-18] raining-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding

【速读】:该论文旨在解决基于视觉语言模型(Vision-Language Model, VLM)的文档解析任务中因自回归生成长序列而导致的高推理延迟问题。其核心解决方案是提出一种无需训练的高效加速方法,受推测解码(speculative decoding)启发,利用一个轻量级文档解析流水线作为草稿模型并行预测未来一批token,同时由主VLM模型对这些预测进行验证;此外,进一步利用文档的布局结构特性,将每页划分为独立区域,并在各区域内并行执行上述草稿-验证策略,最终按自然阅读顺序整合结果。该方法在OmniDocBench基准上实现了2.42倍的无损加速,在长文档解析任务中最高可达4.89倍加速。

链接: https://arxiv.org/abs/2602.12957
作者: Wenhui Liao,Hongliang Li,Pengyu Xie,Xinyu Cai,Yufan Shen,Yi Xin,Qi Qin,Shenglong Ye,Tianbin Li,Ming Hu,Junjun He,Yihao Liu,Wenhai Wang,Min Dou,Bin Fu,Botian Shi,Yu Qiao,Lianwen Jin
机构: South China University of Technology (华南理工大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shenzhen Institute of Advanced Technology, CAS (中国科学院深圳先进技术研究院); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preliminary version of an ongoing project; the paper will be refined and extended in subsequent revisions

点击查看摘要

Abstract:Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must auto-regressively generate long token sequences when processing long-form documents. In this work, motivated by the extremely long outputs and complex layout structures commonly found in document parsing, we propose a training-free and highly efficient acceleration method. Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens, while the more accurate VLM verifies these draft predictions in parallel. Moreover, we further exploit the layout-structured nature of documents by partitioning each page into independent regions, enabling parallel decoding of each region using the same draft-verify strategy. The final predictions are then assembled according to the natural reading order. Experimental results demonstrate the effectiveness of our approach: on the general-purpose OmniDocBench, our method provides a 2.42x lossless acceleration for the this http URL model, and achieves up to 4.89x acceleration on long-document parsing tasks. We will release our code to facilitate reproducibility and future research.

[CV-19] ransporting Task Vectors across Different Architectures without Training

【速读】:该论文旨在解决大规模预训练模型在下游任务中因参数更新(task-specific parameter updates)导致的计算成本高、难以复用的问题,尤其是跨不同宽度(width)模型时任务更新迁移困难的挑战。其解决方案的关键在于提出 Theseus 方法,通过将任务更新视为对中间表示(intermediate representations)的功能性影响来建模,而非直接匹配参数;具体而言,利用正交 Procrustes 分析对齐表示空间后,将任务向量传输问题形式化为一个函数匹配问题,并获得稳定且保持更新几何结构的闭式解,从而实现无需额外训练即可在异构模型间迁移任务更新。

链接: https://arxiv.org/abs/2602.12952
作者: Filippo Rinaldi,Aniello Panariello,Giacomo Salici,Angelo Porrello,Simone Calderara
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains largely unexplored. In this work, we introduce Theseus, a training-free method for transporting task-specific updates across heterogeneous models. Rather than matching parameters directly, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over strong baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically.

[CV-20] Unleashing MLLM s on the Edge: A Unified Framework for Cross-Modal ReID via Adaptive SVD Distillation

【速读】:该论文旨在解决跨模态重识别(Cross-Modal Re-identification, CM-ReID)在云边协同部署中面临的挑战,即如何在保持多样化模态(如RGB、红外、草图和文本)专用云端模型碎片化生态的同时,实现统一的端到端建模与高效知识迁移至资源受限的边缘设备。其解决方案的关键在于提出一个基于强大云边架构的统一框架——MLLMEmbed-ReID:首先,将基础多模态大语言模型(Multi-Modal Large Language Models, MLLMs)适配为先进的云端模型,通过指令驱动提示(instruction-based prompting)引导生成跨模态统一嵌入空间,并采用分层低秩适应微调(LoRA-SFT)策略优化整体跨模态对齐目标;其次,设计一种新颖的知识蒸馏策略,利用教师模型特征空间的低秩特性,结合主成分映射损失(Principal Component Mapping loss)优先保留关键信息,以及特征关系损失(Feature Relation loss)保持模态间结构一致性,从而有效将云端智能迁移至轻量级边缘学生模型,实现边缘侧性能最优且云端全面领先的效果。

链接: https://arxiv.org/abs/2602.12936
作者: Hongbo Jiang,Jie Li,Xinqi Cai,Tianyu Xie,Yunhang Shen,Pingyang Dai,Liujuan Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Equal contribution by Jie Li

点击查看摘要

Abstract:Practical cloud-edge deployment of Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models for diverse modalities. While Multi-Modal Large Language Models (MLLMs) offer strong unification potential, existing approaches fail to adapt them into a single end-to-end backbone and lack effective knowledge distillation strategies for edge deployment. To address these limitations, we propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture. First, we adapt a foundational MLLM into a state-of-the-art cloud model. We leverage instruction-based prompting to guide the MLLM in generating a unified embedding space across RGB, infrared, sketch, and text modalities. This model is then trained efficiently with a hierarchical Low-Rank Adaptation finetuning (LoRA-SFT) strategy, optimized under a holistic cross-modal alignment objective. Second, to deploy its knowledge onto an edge-native student, we introduce a novel distillation strategy motivated by the low-rank property in the teacher’s feature space. To prioritize essential information, this method employs a Principal Component Mapping loss, while relational structures are preserved via a Feature Relation loss. Our lightweight edge-based model achieves state-of-the-art performance on multiple visual CM-ReID benchmarks, while its cloud-based counterpart excels across all CM-ReID benchmarks. The MLLMEmbed-ReID framework thus presents a complete and effective solution for deploying unified MLLM-level intelligence on resource-constrained devices. The code and models will be open-sourced soon.

[CV-21] Deep-Learning Atlas Registration for Melanoma Brain Metastases: Preserving Pathology While Enabling Cohort-Level Analyses

【速读】:该论文旨在解决黑色素瘤脑转移(Melanoma Brain Metastases, MBM)在多中心影像数据中因解剖变异和MRI扫描协议差异导致的群体级分析难题。传统方法依赖于病灶掩膜或预处理步骤,难以适应空间异质性病变的标准化配准需求。其解决方案的关键在于提出一种全可微分的深度学习变形配准框架,通过基于距离变换的解剖标签相似性度量处理因转移灶缺失导致的对应关系缺失问题,并引入体积保持正则化项确保形变合理性,从而无需病灶掩膜即可将个体病理脑图像精确对齐至标准图谱。该方法实现了高精度注册(Dice系数0.89–0.92),并揭示了MBM在皮层、尾状核及灰质-白质交界处的显著富集特征,为多中心脑转移瘤研究提供了可重复的标准化分析平台。

链接: https://arxiv.org/abs/2602.12933
作者: Nanna E. Wielenberg,Ilinca Popp,Oliver Blanck,Lucas Zander,Jan C. Peeken,Stephanie E. Combs,Anca-Ligia Grosu,Dimos Baltas,Tobias Fechter
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Melanoma brain metastases (MBM) are common and spatially heterogeneous lesions, complicating cohort-level analyses due to anatomical variability and differing MRI protocols. We propose a fully differentiable, deep-learning-based deformable registration framework that aligns individual pathological brains to a common atlas while preserving metastatic tissue without requiring lesion masks or preprocessing. Missing anatomical correspondences caused by metastases are handled through a forward-model similarity metric based on distance-transformed anatomical labels, combined with a volume-preserving regularization term to ensure deformation plausibility. Registration performance was evaluated using Dice coefficient (DSC), Hausdorff distance (HD), average symmetric surface distance (ASSD), and Jacobian-based measures. The method was applied to 209 MBM patients from three centres, enabling standardized mapping of metastases to anatomical, arterial, and perfusion atlases. The framework achieved high registration accuracy across datasets (DSC 0.89-0.92, HD 6.79-7.60 mm, ASSD 0.63-0.77 mm) while preserving metastatic volumes. Spatial analysis demonstrated significant over-representation of MBM in the cerebral cortex and putamen, under-representation in white matter, and consistent localization near the gray-white matter junction. No arterial territory showed increased metastasis frequency after volume correction. This approach enables robust atlas registration of pathological brain MRI without lesion masks and supports reproducible multi-centre analyses. Applied to MBM, it confirms and refines known spatial predilections, particularly preferential seeding near the gray-white matter junction and cortical regions. The publicly available implementation facilitates reproducible research and extension to other brain tumours and neurological pathologies. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph) Cite as: arXiv:2602.12933 [cs.CV] (or arXiv:2602.12933v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.12933 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tobias Fechter [view email] [v1] Fri, 13 Feb 2026 13:43:57 UTC (1,605 KB)

[CV-22] Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos

【速读】:该论文旨在解决产程中胎儿生物测量(intrapartum biometry)在资源有限地区难以实现自动化的问题,尤其是在缺乏 trained sonographers 的情况下,常规超声检查难以广泛应用。其核心挑战在于如何通过自动化的多任务学习框架提升产程超声图像分析的准确性与实用性。解决方案的关键在于提出一个临床导向的多任务自动测量系统,该系统整合了标准切面分类、胎儿头部-耻骨联合分割和生物测量三个子任务,并利用各任务间的互补信息以提高整体估计精度;同时,研究团队发布了迄今最大的多中心产程超声视频数据集(774段视频,共68,106帧),为算法训练与评估提供了坚实基础,从而推动自动产程超声生物测量技术的发展。

链接: https://arxiv.org/abs/2602.12922
作者: Jieyun Bai,Zihao Zhou,Yitong Tang,Jie Gan,Zhuonan Liang,Jianan Fan,Lisa B. Mcguire,Jillian L. Clarke,Weidong Cai,Jacaueline Spurway,Yubo Tang,Shiye Wang,Wenda Shen,Wangwang Yu,Yihao Li,Philippe Zhang,Weili Jiang,Yongjie Li,Salem Muhsin Ali Binqahal Al Nasim,Arsen Abzhanov,Numan Saeed,Mohammad Yaqub,Zunhui Xian,Hongxing Lin,Libin Lan,Jayroop Ramesh,Valentin Bacher,Mark Eid,Hoda Kalabizadeh,Christian Rupprecht,Ana I. L. Namburete,Pak-Hei Yeung,Madeleine K. Wyburd,Nicola K. Dinsdale,Assanali Serikbey,Jiankai Li,Sung-Liang Chen,Zicheng Hu,Nana Liu,Yian Deng,Wei Hu,Cong Tan,Wenfeng Zhang,Mai Tuyet Nhi,Gregor Koehler,Rapheal Stock,Klaus Maier-Hein,Marawan Elbatel,Xiaomeng Li,Saad Slimani,Victor M. Campello,Benard Ohene-Botwe,Isaac Khobo,Yuxin Huang,Zhenyan Han,Hongying Hou,Di Qiu,Zheng Zheng,Gongning Luo,Dong Ni,Yaosheng Lu,Karim Lekadir,Shuo Li
机构: The First Affiliated Hospital of Jinan University, Jinan University, Guangzhou, China; The University of Auckland, Auckland, New Zealand; University of Sydney, Sydney, Australia; University of Oxford, Oxford, United Kingdom; University of Electronic Science and Technology of China, Chengdu, China; Changchun University of Science and Technology, Changchun, China; United Imaging Healthcare, Shanghai, China; University of Western Brittany, Brest, France; Sichuan University, Chengdu, China; Mohamed bin Zayed University of Artificial Intelligence, Masdar, Abu Dhabi; Chongqing University of Technology, Chongqing, China; Oxford Machine Learning in NeuroImaging Lab, Department of Computer Science, University of Oxford, Oxford, United Kingdom; Visual Geometry Group, University of Oxford, Oxford, United Kingdom; Nanyang Technological University, Singapore; Shanghai Jiao Tong University, Shanghai, China; Chongqing Normal University, Chongqing, China; The University of Manchester, Manchester, United Kingdom; Southwest University, Chongqing, China; German Cancer Research Center (DKFZ), Heidelberg, Germany; The Hong Kong University of Science and Technology, Hongkong, China; Hassan II University, Casablanca, Morocco; Universitat de Barcelona, Barcelona, Spain; University of Ghana, Accra; University of Cape Town, Cape Town, South Africa; Southern Medical University, Guangzhou, China; Sun Yat-sen University, Guangzhou, China; Guangdong Provincial Clinical Research Center for Child Health, Guangzhou, China; King Abdullah University of Science and Technology, Thuwal, Saudi Arabia; Shenzhen University, Shenzhen, China; Artificial Intelligence in Medicine Lab (BCN-AIM), Barcelona, Spain; Case Western Reserve University, Cleveland, OH, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A substantial proportion (45%) of maternal deaths, neonatal deaths, and stillbirths occur during the intrapartum phase, with a particularly high burden in low- and middle-income countries. Intrapartum biometry plays a critical role in monitoring labor progression; however, the routine use of ultrasound in resource-limited settings is hindered by a shortage of trained sonographers. To address this challenge, the Intrapartum Ultrasound Grand Challenge (IUGC), co-hosted with MICCAI 2024, was launched. The IUGC introduces a clinically oriented multi-task automatic measurement framework that integrates standard plane classification, fetal head-pubic symphysis segmentation, and biometry, enabling algorithms to exploit complementary task information for more accurate estimation. Furthermore, the challenge releases the largest multi-center intrapartum ultrasound video dataset to date, comprising 774 videos (68,106 frames) collected from three hospitals, providing a robust foundation for model training and evaluation. In this study, we present a comprehensive overview of the challenge design, review the submissions from eight participating teams, and analyze their methods from five perspectives: preprocessing, data augmentation, learning strategy, model architecture, and post-processing. In addition, we perform a systematic analysis of the benchmark results to identify key bottlenecks, explore potential solutions, and highlight open challenges for future research. Although encouraging performance has been achieved, our findings indicate that the field remains at an early stage, and further in-depth investigation is required before large-scale clinical deployment. All benchmark solutions and the complete dataset have been publicly released to facilitate reproducible research and promote continued advances in automatic intrapartum ultrasound biometry.

[CV-23] EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition

【速读】:该论文旨在解决事件流(event stream)驱动的视觉定位(Visual Place Recognition, VPR)领域缺乏高质量基准数据集和可解释性强的多模态融合方法的问题。针对现有研究中事件流VPR算法评估标准不统一、场景多样性不足及语义理解能力有限等挑战,作者提出了EPRBench——一个包含10K事件序列和65K事件帧的高保真基准数据集,涵盖手持与车载采集方式以覆盖多样视角、天气和光照条件,并引入由大语言模型(LLM)生成并经人工校正的场景描述,为语义感知和跨模态融合提供基础。解决方案的关键在于提出一种新颖的多模态融合范式:利用LLM从原始事件流中生成文本场景描述,进而指导空间注意力机制下的特征选择、跨模态特征融合与多尺度表示学习,从而在提升识别精度的同时实现可解释的推理过程,显著增强模型透明度与可信性。

链接: https://arxiv.org/abs/2602.12919
作者: Xiao Wang,Xingxing Xiong,Jinfeng Gao,Xufeng Lou,Bo Jiang,Si-bao Chen,Yaowei Wang,Yonghong Tian
机构: Anhui University (安徽大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Peng Cheng Laboratory (鹏城实验室); School of Computer Science, Peking University (北京大学计算机学院); Shenzhen Graduate School, Peking University (北京大学深圳研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR. EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios. To support semantic-aware and language-integrated VPR research, we provide LLM-generated scene descriptions, subsequently refined through human annotation, establishing a solid foundation for integrating LLMs into event-based perception pipelines. To facilitate systematic evaluation, we implement and benchmark 15 state-of-the-art VPR algorithms on EPRBench, offering a strong baseline for future algorithmic comparisons. Furthermore, we propose a novel multi-modal fusion paradigm for VPR: leveraging LLMs to generate textual scene descriptions from raw event streams, which then guide spatially attentive token selection, cross-modal feature fusion, and multi-scale representation learning. This framework not only achieves highly accurate place recognition but also produces interpretable reasoning processes alongside its predictions, significantly enhancing model transparency and explainability. The dataset and source code will be released on this https URL

[CV-24] Reliable Thinking with Images

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在采用Thinking with Images (TWI)进行推理时所面临的“噪声思维”(Noisy Thinking, NT)问题,即由于视觉线索提取不准确或文本推理链存在错误而导致的错误累积现象。解决方案的关键在于提出一种名为Reliable Thinking with Images (RTWI)的新方法,其核心思想是在统一的以文本为中心的框架中估计视觉线索与文本推理链的可靠性,并通过鲁棒的过滤和投票模块有效防止噪声推理污染最终答案。

链接: https://arxiv.org/abs/2602.12916
作者: Haobin Li,Yutong Yang,Yijie Lin,Dai Xiang,Mouxing Yang,Xi Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages, 19 figures

点击查看摘要

Abstract:As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another’', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.

[CV-25] Adaptive Scaling with Geometric and Visual Continuity of completed 3D objects

【速读】:该论文旨在解决生成式对象完成网络(object completion networks)输出的静态符号距离场(Signed Distance Fields, SDFs)缺乏可编辑性和变形能力的问题,这类SDFs虽能精确重建几何结构,但无法在不引入结构失真的情况下进行缩放或形变,从而限制了其在室内重设计、仿真和数字内容创作等需要灵活操作对象的应用场景。解决方案的关键在于提出一种部件感知的缩放框架(part-aware scaling framework),通过自动部件分割、用户可控的缩放区域定义,以及对SDF、颜色和部件索引进行平滑插值,实现比例一致且无伪影的形变;此外,还引入基于重复模式的策略以在保持几何重复结构的前提下处理大尺度变形,显著提升了复杂形状与重复结构对象的编辑灵活性与视觉质量。

链接: https://arxiv.org/abs/2602.12905
作者: Jelle Vermandere,Maarten Bassier,Maarten Vergauwen
机构: KU Leuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ISPRS Congress 2026

点击查看摘要

Abstract:Object completion networks typically produce static Signed Distance Fields (SDFs) that faithfully reconstruct geometry but cannot be rescaled or deformed without introducing structural distortions. This limitation restricts their use in applications requiring flexible object manipulation, such as indoor redesign, simulation, and digital content creation. We introduce a part-aware scaling framework that transforms these static completed SDFs into editable, structurally coherent objects. Starting from SDFs and Texture Fields generated by state-of-the-art completion models, our method performs automatic part segmentation, defines user-controlled scaling zones, and applies smooth interpolation of SDFs, color, and part indices to enable proportional and artifact-free deformation. We further incorporate a repetition-based strategy to handle large-scale deformations while preserving repeating geometric patterns. Experiments on Matterport3D and ShapeNet objects show that our method overcomes the inherent rigidity of completed SDFs and is visually more appealing than global and naive selective scaling, particularly for complex shapes and repetitive structures.

[CV-26] Robustness of Object Detection of Autonomous Vehicles in Adverse Weather Conditions

【速读】:该论文旨在解决自动驾驶车辆在复杂环境条件下(如恶劣天气和光照变化)中物体检测模型鲁棒性不足的问题,以确保其在真实世界中的安全运行。解决方案的关键在于提出一种基于数据增强的评估方法,通过七种模拟不同严重程度的天气(雾、雨、雪)和光照条件(暗、亮、眩光、阴影)的数据增强算子,生成渐进式强度的合成数据,从而确定模型首次失效的最低环境强度阈值;并引入平均首次失效系数(AFFC)作为量化指标,系统评估和比较多个物体检测模型(YOLOv5s、YOLOv11s、Faster R-CNN、Detectron2)在多种不利工况下的鲁棒性表现。

链接: https://arxiv.org/abs/2602.12902
作者: Fox Pettersen,Hong Zhu
机构: Oxford Brookes University (牛津布鲁克斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:As self-driving technology advances toward widespread adoption, determining safe operational thresholds across varying environmental conditions becomes critical for public safety. This paper proposes a method for evaluating the robustness of object detection ML models in autonomous vehicles under adverse weather conditions. It employs data augmentation operators to generate synthetic data that simulates different severance degrees of the adverse operation conditions at progressive intensity levels to find the lowest intensity of the adverse conditions at which the object detection model fails. The robustness of the object detection model is measured by the average first failure coefficients (AFFC) over the input images in the benchmark. The paper reports an experiment with four object detection models: YOLOv5s, YOLOv11s, Faster R-CNN, and Detectron2, utilising seven data augmentation operators that simulate weather conditions fog, rain, and snow, and lighting conditions of dark, bright, flaring, and shadow. The experiment data show that the method is feasible, effective, and efficient to evaluate and compare the robustness of object detection models in various adverse operation conditions. In particular, the Faster R-CNN model achieved the highest robustness with an overall average AFFC of 71.9% over all seven adverse conditions, while YOLO variants showed the AFFC values of 43%. The method is also applied to assess the impact of model training that targets adverse operation conditions using synthetic data on model robustness. It is observed that such training can improve robustness in adverse conditions but may suffer from diminishing returns and forgetting phenomena (i.e., decline in robustness) if overtrained.

[CV-27] RoadscapesQA: A Multitask Multimodal Dataset for Visual Question Answering on Indian Roads

【速读】:该论文旨在解决自动驾驶系统在非结构化环境中进行视觉场景理解的难题,尤其针对印度复杂多样的道路场景缺乏高质量、多任务、多模态数据集的问题。解决方案的关键在于构建了Roadscapes数据集,该数据集包含最多9,000张来自印度城乡不同道路环境(如高速公路、乡村小路、拥堵城市街道)的图像,并配有手动验证的边界框标注;同时,通过规则启发式方法推断场景属性并生成问答(QA)对,从而支持对象定位、推理和场景理解等多任务学习。这一设计显著提升了视觉-语言模型在复杂现实场景中的可扩展性和泛化能力。

链接: https://arxiv.org/abs/2602.12877
作者: Vijayasri Iyer,Maahin Rathinagiriswaran,Jyothikamalesh S
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding road scenes is essential for autonomous driving, as it enables systems to interpret visual surroundings to aid in effective decision-making. We present Roadscapes, a multitask multimodal dataset consisting of upto 9,000 images captured in diverse Indian driving environments, accompanied by manually verified bounding boxes. To facilitate scalable scene understanding, we employ rule-based heuristics to infer various scene attributes, which are subsequently used to generate question-answer (QA) pairs for tasks such as object grounding, reasoning, and scene understanding. The dataset includes a variety of scenes from urban and rural India, encompassing highways, service roads, village paths, and congested city streets, captured in both daytime and nighttime settings. Roadscapes has been curated to advance research on visual scene understanding in unstructured environments. In this paper, we describe the data collection and annotation process, present key dataset statistics, and provide initial baselines for image QA tasks using vision-language models.

[CV-28] X-VORTEX: Spatio-Temporal Contrastive Learning for Wake Vortex Trajectory Forecasting

【速读】:该论文旨在解决航空器尾涡(Wake Vortex)在实际运行中难以准确追踪的问题,核心挑战在于LiDAR点云数据稀疏、尾涡信号随大气湍流逐渐衰减以及人工逐帧标注成本高昂。现有方法多将每帧视为独立的监督分割任务,忽视了时间动态性和大规模无标签数据的利用潜力。其解决方案的关键是提出X-VORTEX框架——一种基于增强重叠理论(Augmentation Overlap Theory)的时空对比学习方法,通过构造来自同一飞行事件的弱扰动序列与强增强序列(经时间子采样和空间掩码生成),迫使模型在缺失帧和部分观测之间对齐表示,从而学习到具有物理感知能力的特征;该框架采用时序分布的几何编码器提取单帧特征,并结合序列聚合模块建模可变长度序列中的尾涡演化状态,在仅使用1%标注数据的情况下实现了优于传统监督方法的中心定位精度,并支持高精度轨迹预测。

链接: https://arxiv.org/abs/2602.12869
作者: Zhan Qu,Michael Färber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wake vortices are strong, coherent air turbulences created by aircraft, and they pose a major safety and capacity challenge for air traffic management. Tracking how vortices move, weaken, and dissipate over time from LiDAR measurements is still difficult because scans are sparse, vortex signatures fade as the flow breaks down under atmospheric turbulence and instabilities, and point-wise annotation is prohibitively expensive. Existing approaches largely treat each scan as an independent, fully supervised segmentation problem, which overlooks temporal structure and does not scale to the vast unlabeled archives collected in practice. We present X-VORTEX, a spatio-temporal contrastive learning framework grounded in Augmentation Overlap Theory that learns physics-aware representations from unlabeled LiDAR point cloud sequences. X-VORTEX addresses two core challenges: sensor sparsity and time-varying vortex dynamics. It constructs paired inputs from the same underlying flight event by combining a weakly perturbed sequence with a strongly augmented counterpart produced via temporal subsampling and spatial masking, encouraging the model to align representations across missing frames and partial observations. Architecturally, a time-distributed geometric encoder extracts per-scan features and a sequential aggregator models the evolving vortex state across variable-length sequences. We evaluate on a real-world dataset of over one million LiDAR scans. X-VORTEX achieves superior vortex center localization while using only 1% of the labeled data required by supervised baselines, and the learned representations support accurate trajectory forecasting.

[CV-29] hinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation

【速读】:该论文旨在解决当前医学大视觉语言模型(Large Vision Language Models, LVLMs)在胸部X光片解读中因仅进行单次视觉检查并依赖纯文本链式思维(Chain-of-Thought, CoT)推理而导致的幻觉问题。现有方法虽尝试引入边界框等视觉相关坐标以缓解此问题,但此类方案本质上仍是伪视觉的,无法保留纹理、密度等关键视觉细节。论文的关键解决方案是构建首个专为胸片解读设计的原生多模态交替推理数据集MMRad-IVL-22K,其模拟放射科医生“视觉检查与语言推理反复交织”的工作流程,使每一步推理均有视觉理由支撑,并通过多轮视觉-语言交互实现高保真证据融合。实验表明,基于该数据集训练的模型在临床准确性和报告质量上显著优于仅使用文本CoT的方法(如RadGraph指标提升6%),验证了多模态交替推理对可靠医疗AI的不可替代性。

链接: https://arxiv.org/abs/2602.12843
作者: Yichen Zhao,Zelin Peng,Piao Yang,Xiaokang Yang,Wei Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radiological diagnosis is a perceptual process in which careful visual inspection and language reasoning are repeatedly interleaved. Most medical large vision language models (LVLMs) perform visual inspection only once and then rely on text-only chain-of-thought (CoT) reasoning, which operates purely in the linguistic space and is prone to hallucination. Recent methods attempt to mitigate this issue by introducing visually related coordinates, such as bounding boxes. However, these remain a pseudo-visual solution: coordinates are still text and fail to preserve rich visual details like texture and density. Motivated by the interleaved nature of radiological diagnosis, we introduce MMRad-IVL-22K, the first large-scale dataset designed for natively interleaved visual language reasoning in chest X-ray interpretation. MMRad-IVL-22K reflects a repeated cycle of reasoning and visual inspection workflow of radiologists, in which visual rationales complement textual descriptions and ground each step of the reasoning process. MMRad-IVL-22K comprises 21,994 diagnostic traces, enabling systematic scanning across 35 anatomical regions. Experimental results on advanced closed-source LVLMs demonstrate that report generation guided by multimodal CoT significantly outperforms that guided by text-only CoT in clinical accuracy and report quality (e.g., 6% increase in the RadGraph metric), confirming that high-fidelity interleaved vision language evidence is a non-substitutable component of reliable medical AI. Furthermore, benchmarking across seven state-of-the-art open-source LVLMs demonstrates that models fine-tuned on MMRad-IVL-22K achieve superior reasoning consistency and report quality compared with both general-purpose and medical-specific LVLMs. The project page is available at this https URL.

[CV-30] GSM-GS: Geometry-Constrained Single and Multi-view Gaussian Splatting for Surface Reconstruction

【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)在复杂表面微结构重建中因高斯点云无序性和不规则性导致的高频细节丢失问题。其解决方案的关键在于提出GSM-GS框架,通过单视图自适应子区域加权约束与多视图空间结构精化协同优化:单视图层面利用图像梯度特征划分纹理丰富与稀疏区域,并基于深度差异特征实施自适应滤波机制,结合双分支约束策略以增强几何细节表征;多视图层面则引入几何引导的跨视图点云关联方法与动态权重采样策略,构建相邻点云帧间的三维结构法向量约束,从而显著提升多视角一致性与重建保真度。

链接: https://arxiv.org/abs/2602.12796
作者: Xiao Ren,Yu Liu,Ning An,Jian Cheng,Xin Qiao,He Kong
机构: Southern University Science and Technology (南方科技大学); China Coal Research Institute (中国煤炭科学研究总院); State Key Laboratory of Intelligent Coal Mining and Strata Control (智能采矿与围岩控制国家重点实验室); Institute of Artificial Intelligence and Robotics (人工智能与机器人研究所); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: this https URL

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting has emerged as a prominent research direction owing to its ultrarapid training speed and high-fidelity rendering capabilities. However, the unstructured and irregular nature of Gaussian point clouds poses challenges to reconstruction accuracy. This limitation frequently causes high-frequency detail loss in complex surface microstructures when relying solely on routine strategies. To address this limitation, we propose GSM-GS: a synergistic optimization framework integrating single-view adaptive sub-region weighting constraints and multi-view spatial structure refinement. For single-view optimization, we leverage image gradient features to partition scenes into texture-rich and texture-less sub-regions. The reconstruction quality is enhanced through adaptive filtering mechanisms guided by depth discrepancy features. This preserves high-weight regions while implementing a dual-branch constraint strategy tailored to regional texture variations, thereby improving geometric detail characterization. For multi-view optimization, we introduce a geometry-guided cross-view point cloud association method combined with a dynamic weight sampling strategy. This constructs 3D structural normal constraints across adjacent point cloud frames, effectively reinforcing multi-view consistency and reconstruction fidelity. Extensive experiments on public datasets demonstrate that our method achieves both competitive rendering quality and geometric reconstruction. See our interactive project page

[CV-31] Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting ICLR2026

【速读】:该论文旨在解决现有弱监督目标计数方法通常仅适用于单一类别(如人员)且依赖图像级标签的问题,同时希望在降低标注成本的前提下实现类无关(class-agnostic)的多类别目标计数。其解决方案的关键在于提出了一种基于多模态大语言模型(Multimodal Large Language Model, MLLM)的弱监督框架WS-COC,通过三个创新策略构建高效的计数范式:一是采用“分而辨”对话微调策略,引导MLLM通过多轮交互逐步缩小计数范围;二是引入“比对排序”优化策略,使MLLM学习不同图像间对象数量的相对排序关系;三是设计“全局与局部增强”策略,融合局部与全局预测以提升密集场景下的计数精度。该方法无需点级标注即可达到或超越多个全监督先进方法的性能。

链接: https://arxiv.org/abs/2602.12774
作者: Xiaowen Zhang,Zijie Yue,Yong Luo,Cairong Zhao,Qijun Chen,Miaojing Shi
机构: Tongji University (同济大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, e.g. person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting. Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs. Code is available at this https URL.

[CV-32] PixelRush: Ultra-Fast Training-Free High-Resolution Image Generation via One-step Diffusion

【速读】:该论文旨在解决预训练扩散模型在高分辨率图像生成中面临的固有局限性,即其原始训练分辨率限制了输出质量与效率。现有无训练方法虽尝试通过去噪过程干预来突破这一瓶颈,但往往引入显著的计算开销,生成单张4K图像耗时超过五分钟。本文提出PixelRush,首个无需微调的高效高分辨率文生图框架,其核心创新在于基于补丁(patch)推理范式,摒弃了多轮反演与重生成循环,实现了低步数条件下的高效补丁级去噪;同时提出无缝融合策略以缓解少步数生成中的拼接伪影,并引入噪声注入机制抑制过度平滑效应,从而在约20秒内完成4K图像生成,相较当前最优方法提速10–35倍,且保持卓越视觉保真度。

链接: https://arxiv.org/abs/2602.12769
作者: Hong-Phuc Lai,Phong Nguyen,Anh Tran
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司); Qualcomm Vietnam Company Limited (高通越南有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image. In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a 10 \times to 35 \times speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.

[CV-33] owards complete digital twins in cultural heritage with ART3mis 3D artifacts annotator

【速读】:该论文旨在解决文化遗产领域中3D数字文物在标注与元数据附加方面功能不足的问题,当前多数工具虽在特定应用场景下表现优异,但缺乏通用性与互操作性,难以满足考古学家及遗产保护从业者对精细化信息管理的需求。解决方案的关键在于提出ART3mis——一个面向3D对象的通用、易用、功能丰富的交互式Web文本标注工具,其基于W3C Web Annotation Data Model标准,支持信息的通信、分发与重用,从而帮助非技术背景的文化遗产保护人员高效完成3D数字复制品的处理、分割与标注任务。

链接: https://arxiv.org/abs/2602.12761
作者: Dimitrios Karamatskos,Vasileios Arampatzakis,Vasileios Sevetlidis,Stavros Nousias,Athanasios Kalogeras,Christos Koulamas,Aris Lalos,George Pavlidis
机构: Athena Research Center, Institute for Language and Speech Processing (ILSP); Athena Research Center, Industrial Systems Institute (ISI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at EUROMED 2022: International Conference on Digital Heritage

点击查看摘要

Abstract:Archaeologists, as well as specialists and practitioners in cultural heritage, require applications with additional functions, such as the annotation and attachment of metadata to specific regions of the 3D digital artifacts, to go beyond the simplistic three-dimensional (3D) visualization. Different strategies addressed this issue, most of which are excellent in their particular area of application, but their capacity is limited to their design’s purpose; they lack generalization and interoperability. This paper introduces ART3mis, a general-purpose, user-friendly, feature-rich, interactive web-based textual annotation tool for 3D objects. Moreover, it enables the communication, distribution, and reuse of information as it complies with the W3C Web Annotation Data Model. It is primarily designed to help cultural heritage conservators, restorers, and curators who lack technical expertise in 3D imaging and graphics, handle, segment, and annotate 3D digital replicas of artifacts with ease.

[CV-34] owards reconstructing experimental sparse-view X-ray CT data with diffusion models

【速读】:该论文旨在解决生成式扩散模型(diffusion-based image generators)在实际稀疏视图X射线计算机断层成像(sparse-view X-ray Computed Tomography, CT)中应用时性能下降的问题,特别是由训练数据域偏移(domain shift)和前向模型失配(forward model mismatch)引起的挑战。其解决方案的关键在于:首先通过控制合成数据与物理幻影实验数据之间的域偏移程度来系统评估模型鲁棒性,发现严重域偏移会导致模型崩溃和伪影,而多样化的先验模型反而优于单一匹配但狭窄的先验;其次,针对前向模型失配导致图像样本偏离先验流形的问题,采用退火似然调度(annealed likelihood schedules)策略,在抑制伪影的同时提升计算效率。研究揭示了从合成数据到真实实验数据的性能迁移并非直接有效,强调未来方法开发必须基于真实世界基准进行验证。

链接: https://arxiv.org/abs/2602.12755
作者: Nelas J. Thomsen,Xinyuan Wang,Felix Lucka,Ezgi Demircan-Tureyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages + references, 4 figures, 2 tables, conference paper

点击查看摘要

Abstract:Diffusion-based image generators are promising priors for ill-posed inverse problems like sparse-view X-ray Computed Tomography (CT). As most studies consider synthetic data, it is not clear whether training data mismatch (``domain shift’') or forward model mismatch complicate their successful application to experimental data. We measured CT data from a physical phantom resembling the synthetic Shepp-Logan phantom and trained diffusion priors on synthetic image data sets with different degrees of domain shift towards it. Then, we employed the priors in a Decomposed Diffusion Sampling scheme on sparse-view CT data sets with increasing difficulty leading to the experimental data. Our results reveal that domain shift plays a nuanced role: while severe mismatch causes model collapse and hallucinations, diverse priors outperform well-matched but narrow priors. Forward model mismatch pulls the image samples away from the prior manifold, which causes artifacts but can be mitigated with annealed likelihood schedules that also increase computational efficiency. Overall, we demonstrate that performance gains do not immediately translate from synthetic to experimental data, and future development must validate against real-world benchmarks.

[CV-35] ReBA-Pred-Net: Weakly-Supervised Regional Brain Age Prediction on MRI

【速读】:该论文旨在解决现有全脑年龄(Whole Brain Age, WBA)估计方法在疾病特征刻画和发育与老化模式研究中受限的问题,因为这些变化通常是区域特异性的而非全脑性的。为实现细粒度的区域脑年龄(Regional Brain Age, ReBA)估计,论文提出了一种基于教师-学生框架的ReBA-Pred-Net模型,其关键在于利用教师模型生成软标签(soft ReBA)指导学生模型学习,并引入临床先验一致性约束(clinical-prior consistency constraint),即功能相关的脑区应表现出相似的变化趋势,从而提升ReBA估计的统计一致性和事实一致性。

链接: https://arxiv.org/abs/2602.12751
作者: Shuai Shao,Yan Wang,Shu Jiang,Shiyuan Zhao,Xinzhe Luo,Di Yang,Jiangtao Wang,Yutong Bai,Jianguo Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain age has become a prominent biomarker of brain health. Yet most prior work targets whole brain age (WBA), a coarse paradigm that struggles to support tasks such as disease characterization and research on development and aging patterns, because relevant changes are typically region-selective rather than brain-wide. Therefore, robust regional brain age (ReBA) estimation is critical, yet a widely generalizable model has yet to be established. In this paper, we propose the Regional Brain Age Prediction Network (ReBA-Pred-Net), a Teacher-Student framework designed for fine-grained brain age estimation. The Teacher produces soft ReBA to guide the Student to yield reliable ReBA estimates with a clinical-prior consistency constraint (regions within the same function should change similarly). For rigorous evaluation, we introduce two indirect metrics: Healthy Control Similarity (HCS), which assesses statistical consistency by testing whether regional brain-age-gap (ReBA minus chronological age) distributions align between training and unseen HC; and Neuro Disease Correlation (NDC), which assesses factual consistency by checking whether clinically confirmed patients show elevated brain-age-gap in disease-associated regions. Experiments across multiple backbones demonstrate the statistical and factual validity of our method.

[CV-36] Synthetic Craquelure Generation for Unsupervised Painting Restoration

【速读】:该论文旨在解决文化遗产保护中绘画修复的难题,特别是针对复杂笔触背景下细小裂纹(craquelure)模式的识别与恢复问题,而这一任务因缺乏像素级标注数据而尤为困难。解决方案的关键在于提出一种完全无需标注数据的框架,其核心创新包括:(1) 基于领域特定的合成裂纹生成器,利用贝塞尔曲线(Bézier trajectories)模拟逼真的分叉和渐缩裂纹几何形态;(2) 将传统形态学检测器与基于学习的精修模块相结合,后者采用LoRA(Low-Rank Adaptation)微调SegFormer骨干网络;(3) 引入检测引导策略,将形态学输出作为空间先验注入模型,并通过掩码混合损失与logit调整聚焦训练于候选裂纹区域的精细化处理;最终,精修后的掩膜驱动各向异性扩散图像修复阶段以重建缺失内容。该方法在零样本场景下显著优于现有摄影修复模型,同时忠实保留原始画笔痕迹。

链接: https://arxiv.org/abs/2602.12742
作者: Jana Cuch-Guillén,Antonio Agudo,Raül Pérez-Gonzalo
机构: Universitat de Barcelona (巴塞罗那大学); Institut de Robòtica i Informàtica Industrial, CSIC-UPC (西班牙国家研究委员会-加泰罗尼亚理工大学机器人与工业信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CAI 2026

点击查看摘要

Abstract:Cultural heritage preservation increasingly demands non-invasive digital methods for painting restoration, yet identifying and restoring fine craquelure patterns from complex brushstrokes remains challenging due to scarce pixel-level annotations. We propose a fully annotation-free framework driven by a domain-specific synthetic craquelure generator, which simulates realistic branching and tapered fissure geometry using Bézier trajectories. Our approach couples a classical morphological detector with a learning-based refinement module: a SegFormer backbone adapted via Low-Rank Adaptation (LoRA). Uniquely, we employ a detector-guided strategy, injecting the morphological map as an input spatial prior, while a masked hybrid loss and logit adjustment constrain the training to focus specifically on refining candidate crack regions. The refined masks subsequently guide an Anisotropic Diffusion inpainting stage to reconstruct missing content. Experimental results demonstrate that our pipeline significantly outperforms state-of-the-art photographic restoration models in zero-shot settings, while faithfully preserving the original paint brushwork.

[CV-37] SPRig: Self-Supervised Pose-Invariant Rigging from Mesh Sequences

【速读】:该论文旨在解决现有绑定(rigging)方法在处理缺乏标准静止姿态(如T-pose)的序列数据时所面临的拓扑不一致性问题,尤其是在动物动作捕捉或生成式AI(AIGC)/视频衍生网格序列等场景下,传统方法因假设存在规范静止姿态而失效,且逐帧应用时不具备姿态不变性。解决方案的关键在于提出SPRig框架,通过引入跨帧一致性损失(cross-frame consistency losses),在已有模型基础上进行微调,以学习具有姿态不变性的绑定结构,从而显著提升时间稳定性并减少基线方法中常见的伪影问题。

链接: https://arxiv.org/abs/2602.12740
作者: Ruipeng Wang,Langkun Zhong,Miaowei Wang
机构: University of Pennsylvania (宾夕法尼亚大学); The University of Hong Kong (香港大学); The University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Code: this https URL

点击查看摘要

Abstract:State-of-the-art rigging methods assume a canonical rest pose–an assumption that fails for sequential data (e.g., animal motion capture or AIGC/video-derived mesh sequences) that lack the T-pose. Applied frame-by-frame, these methods are not pose-invariant and produce topological inconsistencies across frames. Thus We propose SPRig, a general fine-tuning framework that enforces cross-frame consistency losses to learn pose-invariant rigs on top of existing models. We validate our approach on rigging using a new permutation-invariant stability protocol. Experiments demonstrate SOTA temporal stability: our method produces coherent rigs from challenging sequences and dramatically reduces the artifacts that plague baseline methods. The code will be released publicly upon acceptance.

[CV-38] ART3mis: Ray-Based Textual Annotation on 3D Cultural Objects

【速读】:该论文旨在解决文化遗产领域中3D数字对象缺乏高效、通用且易用的文本标注工具的问题,尤其针对无3D成像与图形技术背景的文物保护者、修复师和策展人。现有方法虽在特定场景下效果良好,但通常局限于单一应用领域且难以跨场景迁移。解决方案的关键在于提出ART3mis——一个面向用户的、交互式的直接表面标注工具,采用用户驱动的直接标注方式,在实时处理复杂3D文化对象的同时,以JSON格式存储多区域文本注释,从而实现对3D数字文物的便捷分割与语义标注。

链接: https://arxiv.org/abs/2602.12725
作者: Vasileios Arampatzakis,Vasileios Sevetlidis,Fotis Arnaoutoglou,Athanasios Kalogeras,Christos Koulamas,Aris Lalos,Chairi Kiourt,George Ioannakis,Anestis Koutsoudis,George Pavlidis
机构: Athena Research Center (希腊阿瑟娜研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at CAA 2021 - “Digital Crossroads”

点击查看摘要

Abstract:Beyond simplistic 3D visualisations, archaeologists, as well as cultural heritage experts and practitioners, need applications with advanced functionalities. Such as the annotation and attachment of metadata onto particular regions of the 3D digital objects. Various approaches have been presented to tackle this challenge, most of which achieve excellent results in the domain of their application. However, they are often confined to that specific domain and particular problem. In this paper, we present ART3mis - a general-purpose, user-friendly, interactive textual annotation tool for 3D objects. Primarily attuned to aid cultural heritage conservators, restorers and curators with no technical skills in 3D imaging and graphics, the tool allows for the easy handling, segmenting and annotating of 3D digital replicas of artefacts. ART3mis applies a user-driven, direct-on-surface approach. It can handle detailed 3D cultural objects in real-time and store textual annotations for multiple complex regions in JSON data format.

[CV-39] Channel-Aware Probing for Multi-Channel Imaging

【速读】:该论文旨在解决多通道成像(Multi-Channel Imaging, MCI)数据中视觉编码器训练与评估的挑战,即不同数据集的通道配置差异导致固定通道训练不可行,限制了预训练编码器在新通道设置下的复用。现有方法多采用全量微调进行评估,而对冻结预训练编码器的探针(probing)策略研究不足,且直接迁移其他领域的方法在MCI场景下效果不佳。其解决方案的关键在于提出通道感知探针(Channel-Aware Probing, CAP),通过两个核心机制实现:一是独立特征编码(Independent Feature Encoding, IFE),对每个通道分别编码以保留通道特异性;二是解耦池化(Decoupled Pooling, DCP),在跨通道聚合前先在各通道内进行池化,从而更好地利用MCI数据中的内在通道多样性。实验表明,CAP在三个MCI基准上显著优于默认探针协议,并接近全微调性能,大幅缩小了与全微调之间的差距。

链接: https://arxiv.org/abs/2602.12696
作者: Umar Marikkar,Syed Sameed Husain,Muhammad Awais,Sara Atito
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training and evaluating vision encoders on Multi-Channel Imaging (MCI) data remains challenging as channel configurations vary across datasets, preventing fixed-channel training and limiting reuse of pre-trained encoders on new channel settings. Prior work trains MCI encoders but typically evaluates them via full fine-tuning, leaving probing with frozen pre-trained encoders comparatively underexplored. Existing studies that perform probing largely focus on improving representations, rather than how to best leverage fixed representations for downstream tasks. Although the latter problem has been studied in other domains, directly transferring those strategies to MCI yields weak results, even worse than training from scratch. We therefore propose Channel-Aware Probing (CAP), which exploits the intrinsic inter-channel diversity in MCI datasets by controlling feature flow at both the encoder and probe levels. CAP uses Independent Feature Encoding (IFE) to encode each channel separately, and Decoupled Pooling (DCP) to pool within channels before aggregating across channels. Across three MCI benchmarks, CAP consistently improves probing performance over the default probing protocol, matches fine-tuning from scratch, and largely reduces the gap to full fine-tuning from the same MCI pre-trained checkpoints. Code can be found in this https URL.

[CV-40] Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening ICLR2026

【速读】:该论文旨在解决图像到视频(image-to-video, I2V)扩散模型在生成中间帧时存在的时序不连续性和视觉伪影问题,这些问题主要源于双向路径(前向与后向)生成结果因运动先验不一致而导致的错位。解决方案的关键在于提出一种称为“运动先验蒸馏”(Motion Prior Distillation, MPD)的推理阶段蒸馏技术,通过将前向路径的运动残差蒸馏到后向路径中,抑制双向路径间的不匹配,并主动避免对末端条件路径进行去噪,从而利用前向运动先验获得更时序一致的中间帧生成效果。

链接: https://arxiv.org/abs/2602.12679
作者: Wooseok Jeon,Seunghyun Shin,Dongmin Shin,Hae-Gon Jeon
机构: Yonsei University (延世大学); GIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026. Project page: this https URL

点击查看摘要

Abstract:Recent progress in image-to-video (I2V) diffusion models has significantly advanced the field of generative inbetweening, which aims to generate semantically plausible frames between two keyframes. In particular, inference-time sampling strategies, which leverage the generative priors of large-scale pre-trained I2V models without additional training, have become increasingly popular. However, existing inference-time sampling, either fusing forward and backward paths in parallel or alternating them sequentially, often suffers from temporal discontinuities and undesirable visual artifacts due to the misalignment between the two generated paths. This is because each path follows the motion prior induced by its own conditioning frame. In this work, we propose Motion Prior Distillation (MPD), a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path. Our method can deliberately avoid denoising the end-conditioned path which causes the ambiguity of the path, and yield more temporally coherent inbetweening results with the forward motion prior. We not only perform quantitative evaluations on standard benchmarks, but also conduct extensive user studies to demonstrate the effectiveness of our approach in practical scenarios.

[CV-41] SLA2: Sparse-Linear Attention with Learnable Routing and QAT

【速读】:该论文旨在解决Sparse-Linear Attention (SLA) 在扩散模型中存在的两个核心问题:其一,SLA 依赖于基于注意力权重幅度的启发式分割策略,导致计算分配可能次优;其二,SLA 与稀疏和线性注意力的直接分解存在误差不匹配问题。解决方案的关键在于提出 SLA2,包含三个创新:(I) 引入可学习的路由器(learnable router),动态决定每项注意力计算应采用稀疏或线性分支;(II) 提出更忠实且直接的稀疏-线性注意力公式,通过可学习比例参数融合两分支输出;(III) 设计稀疏 + 低比特注意力机制,利用量化感知微调引入低比特注意力以降低量化误差。实验表明,SLA2 在视频扩散模型中实现了 97% 的注意力稀疏度,并带来 18.6 倍的注意力加速,同时保持生成质量。

链接: https://arxiv.org/abs/2602.12675
作者: Jintao Zhang,Haoxu Wang,Kai Jiang,Kaiwen Zheng,Youhe Jiang,Ion Stoica,Jianfei Chen,Jun Zhu,Joseph E. Gonzalez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

[CV-42] IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language Models

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在印度人群表征中存在的地理与社会多样性缺失问题,即现有公平性数据集将印度人视为单一类别,忽视了印度28个邦和8个联邦领土之间的巨大内部差异,导致代表性偏差和地理偏见。解决方案的关键在于构建首个针对印度语境的平衡人脸数据集IndicFairFace,该数据集包含14,400张图像,按州和性别均匀分布,并通过伦理方式从Wikimedia Commons等开放许可资源获取。利用该数据集,研究者量化了主流基于CLIP的VLMs中的国内地理偏见,并采用后处理迭代零空间投影(Iterative Nullspace Projection)方法进行去偏,同时保持嵌入空间性能损失低于1.5%,从而为评估和缓解VLMs中的地理偏见提供了首个基准。

链接: https://arxiv.org/abs/2602.12659
作者: Aarish Shah Mohsin,Mohammed Tayyab Ilyas Khan,Mohammad Nadeem,Shahab Saquib Sohail,Erik Cambria,Jiechao Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are known to inherit and amplify societal biases from their web-scale training data with Indian being particularly misrepresented. Existing fairness-aware datasets have significantly improved demographic balance across global race and gender groups, yet they continue to treat Indian as a single monolithic category. The oversimplification ignores the vast intra-national diversity across 28 states and 8 Union Territories of India and leads to representational and geographical bias. To address the limitation, we present IndicFairFace, a novel and balanced face dataset comprising 14,400 images representing geographical diversity of India. Images were sourced ethically from Wikimedia Commons and open-license web repositories and uniformly balanced across states and gender. Using IndicFairFace, we quantify intra-national geographical bias in prominent CLIP-based VLMs and reduce it using post-hoc Iterative Nullspace Projection debiasing approach. We also show that the adopted debiasing approach does not adversely impact the existing embedding space as the average drop in retrieval accuracy on benchmark datasets is less than 1.5 percent. Our work establishes IndicFairFace as the first benchmark to study geographical bias in VLMs for the Indian context.

[CV-43] CBEN – A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding

【速读】:该论文旨在解决光学遥感影像中云层干扰导致的机器学习模型性能下降问题,特别是在时间敏感场景(如自然灾害监测)下,传统方法因剔除含云图像而无法适用。其关键解决方案是构建了一个名为CloudyBigEarthNet(CBEN)的配对光学与雷达图像数据集,其中包含云遮挡样本,并通过在训练阶段引入含云光学数据来提升模型对云层的鲁棒性,从而显著改善模型在 cloudy test cases 上的表现,相较原始方法相对提升达17.2–28.7个百分点。

链接: https://arxiv.org/abs/2602.12652
作者: Marco Stricker,Masakazu Iwamura,Koichi Kise
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE Transactions on Geoscience Remote Sensing for possible publication

点击查看摘要

Abstract:Clouds are a common phenomenon that distorts optical satellite imagery, which poses a challenge for remote sensing. However, in the literature cloudless analysis is often performed where cloudy images are excluded from machine learning datasets and methods. Such an approach cannot be applied to time sensitive applications, e.g., during natural disasters. A possible solution is to apply cloud removal as a preprocessing step to ensure that cloudfree solutions are not failing under such conditions. But cloud removal methods are still actively researched and suffer from drawbacks, such as generated visual artifacts. Therefore, it is desirable to develop cloud robust methods that are less affected by cloudy weather. Cloud robust methods can be achieved by combining optical data with radar, a modality unaffected by clouds. While many datasets for machine learning combine optical and radar data, most researchers exclude cloudy images. We identify this exclusion from machine learning training and evaluation as a limitation that reduces applicability to cloudy scenarios. To investigate this, we assembled a dataset, named CloudyBigEarthNet (CBEN), of paired optical and radar images with cloud occlusion for training and evaluation. Using average precision (AP) as the evaluation metric, we show that state-of-the-art methods trained on combined clear-sky optical and radar imagery suffer performance drops of 23-33 percentage points when evaluated on cloudy images. We then adapt these methods to cloudy optical data during training, achieving relative improvement of 17.2-28.7 percentage points on cloudy test cases compared with the original approaches. Code and dataset are publicly available at: this https URL

[CV-44] Multi-Task Learning with Additive U-Net for Image Denoising and Classification

【速读】:该论文旨在解决U-Net架构中跳接连接(skip connection)在图像去噪及去噪导向的多任务学习(Multi-Task Learning, MTL)场景下导致的信息流失控与训练不稳定问题。其解决方案的关键在于用门控加法融合(gated additive fusion)替代传统的拼接式跳接,构建出Additive U-Net(AddUNet),该结构通过约束捷径路径的信息容量并保持特征维度恒定,实现编码器-解码器间受控的信息流动,从而提升联合优化的稳定性,并在多任务设置中表现出任务感知的权重再分配特性,即浅层跳接倾向于重建任务,深层跳接支持判别任务,且在分类能力受限时仍能维持重建性能,体现出通过加法融合实现的隐式任务解耦机制。

链接: https://arxiv.org/abs/2602.12649
作者: Vikram Lakkavalli,Neelam Sinha
机构: Indian Institute of Information Technology, Bhpur (印度信息科技学院,比哈尔分校); Centre for Bioinformatics, Indian Institute of Science (印度科学研究所生物信息中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We investigate additive skip fusion in U-Net architectures for image denoising and denoising-centric multi-task learning (MTL). By replacing concatenative skips with gated additive fusion, the proposed Additive U-Net (AddUNet) constrains shortcut capacity while preserving fixed feature dimensionality across depth. This structural regularization induces controlled encoder-decoder information flow and stabilizes joint optimization. Across single-task denoising and joint denoising-classification settings, AddUNet achieves competitive reconstruction performance with improved training stability. In MTL, learned skip weights exhibit systematic task-aware redistribution: shallow skips favor reconstruction, while deeper features support discrimination. Notably, reconstruction remains robust even under limited classification capacity, indicating implicit task decoupling through additive fusion. These findings show that simple constraints on skip connections act as an effective architectural regularizer for stable and scalable multi-task learning without increasing model complexity.

[CV-45] ImageRAG Turbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models

【速读】:该论文旨在解决少步扩散模型(few-step diffusion models)在文本到图像生成中因采样步骤减少而导致的图像质量下降和提示对齐度不足的问题,同时克服现有方法训练成本高的缺陷。其解决方案的关键在于提出ImageRAGTurbo框架,通过检索增强机制,在不增加额外微调的情况下,利用数据库中相关的文本-图像对来丰富UNet去噪器的潜在空间(H\mathcal{H}-space)信息,并引入一个可训练的适配器模块,结合交叉注意力机制高效融合检索内容与目标提示,从而在保持低延迟的同时显著提升生成图像的质量和语义一致性。

链接: https://arxiv.org/abs/2602.12640
作者: Peijie Qiu,Hariharan Ramshankar,Arnau Ramisa,René Vidal,Amit Kumar K C,Vamsi Salaka,Rahul Bhagat
机构: Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Diffusion models have emerged as the leading approach for text-to-image generation. However, their iterative sampling process, which gradually morphs random noise into coherent images, introduces significant latency that limits their applicability. While recent few-step diffusion models reduce the number of sampling steps to as few as one to four steps, they often compromise image quality and prompt alignment, especially in one-step generation. Additionally, these models require computationally expensive training procedures. To address these limitations, we propose ImageRAGTurbo, a novel approach to efficiently finetune few-step diffusion models via retrieval augmentation. Given a text prompt, we retrieve relevant text-image pairs from a database and use them to condition the generation process. We argue that such retrieved examples provide rich contextual information to the UNet denoiser that helps reduce the number of denoising steps without compromising image quality. Indeed, our initial investigations show that using the retrieved content to edit the denoiser’s latent space ( \mathcalH -space) without additional finetuning already improves prompt fidelity. To further improve the quality of the generated images, we augment the UNet denoiser with a trainable adapter in the \mathcalH -space, which efficiently blends the retrieved content with the target prompt using a cross-attention mechanism. Experimental results on fast text-to-image generation demonstrate that our approach produces high-fidelity images without compromising latency compared to existing methods.

[CV-46] Formalizing the Sampling Design Space of Diffusion-Based Generative Models via Adaptive Solvers and Wasserstein-Bounded Timesteps

【速读】:该论文旨在解决基于扩散模型(Diffusion-based Generative Models)在实际部署中因采样成本过高而导致的效率瓶颈问题,尤其是现有方法在求解器选择与调度策略上依赖静态启发式规则、未能充分适配扩散轨迹内在动态特性的问题。解决方案的关键在于提出一种名为SDM(Solver-Driven Method)的系统性框架,其核心思想是从几何视角重新审视采样过程,通过分析常微分方程(ODE)动力学特性,发现早期高噪声阶段仅需低阶求解器即可高效逼近,而后期非线性增强阶段则可逐步引入高阶求解器以提升精度;进一步地,该框架构建了一个基于Wasserstein距离约束的优化机制,显式控制局部离散化误差,从而自适应地推导出满足连续动力学忠实性的采样步长调度策略。此方案无需额外训练或网络结构修改,即在多个标准数据集上实现SOTA性能并显著减少函数评估次数。

链接: https://arxiv.org/abs/2602.12624
作者: Sangwoo Jo,Sungjoon Choi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based generative models have achieved remarkable performance across various domains, yet their practical deployment is often limited by high sampling costs. While prior work focuses on training objectives or individual solvers, the holistic design of sampling, specifically solver selection and scheduling, remains dominated by static heuristics. In this work, we revisit this challenge through a geometric lens, proposing SDM, a principled framework that aligns the numerical solver with the intrinsic properties of the diffusion trajectory. By analyzing the ODE dynamics, we show that efficient low-order solvers suffice in early high-noise stages while higher-order solvers can be progressively deployed to handle the increasing non-linearity of later stages. Furthermore, we formalize the scheduling by introducing a Wasserstein-bounded optimization framework. This method systematically derives adaptive timesteps that explicitly bound the local discretization error, ensuring the sampling process remains faithful to the underlying continuous dynamics. Without requiring additional training or architectural modifications, SDM achieves state-of-the-art performance across standard benchmarks, including an FID of 1.93 on CIFAR-10, 2.41 on FFHQ, and 1.98 on AFHQv2, with a reduced number of function evaluations compared to existing samplers. Our code is available at this https URL.

[CV-47] QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching AAAI2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练量化(Post-Training Quantization, PTQ)过程中面临的高存储开销与优化成本问题,尤其针对Transformer架构中多比特位宽(multi-bit bit-widths)部署的灵活性与效率瓶颈。其核心解决方案是提出QuEPT,一种基于单次校准(one-shot calibration)的弹性精度量化方法,通过级联不同低秩适配器(Multi-Bit Cascaded Low-Rank adapters, MB-CLoRA)实现对多种预定义位宽的动态适应,并支持均匀量化与混合精度量化之间的实时切换而无需重复优化。此外,引入多比特令牌合并(Multi-Bit Token Merging, MB-ToMe)机制以增强跨位宽的token特征融合能力,从而提升模型在位宽切换时的鲁棒性与整体性能。

链接: https://arxiv.org/abs/2602.12609
作者: Ke Xu,Yixin Wang,Zhongcheng Li,Hao Cui,Jinshui Hu,Xingyi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization this http URL, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language this http URL paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization this http URL code is available at this https URL

[CV-48] Unbiased Gradient Estimation for Event Binning via Functional Backpropagation

【速读】:该论文旨在解决事件相机(event camera)数据在训练过程中因分箱函数(binning function)不连续性导致的梯度估计偏差问题。传统方法将事件按时间窗口分箱成帧以适配图像处理流程,但这种离散操作会截断梯度,迫使算法仅依赖帧特征;而直接处理原始事件虽可避免此限制,却因分箱函数的不连续性造成梯度估计偏倚,影响学习效率。解决方案的关键在于提出一种基于积分变换(integration by parts)的弱导数(weak derivative)计算框架:在反向传播中合成弱导数,同时保持前向输出不变,通过重构余切函数(cotangent function)来获得与平滑及非光滑目标函数长期有限差分一致的梯度估计,从而实现无偏梯度更新。

链接: https://arxiv.org/abs/2602.12590
作者: Jinze Chen,Wei Zhai,Han Han,Tiankai Ma,Yang Cao,Bin Li,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based vision encodes dynamic scenes as asynchronous spatio-temporal spikes called events. To leverage conventional image processing pipelines, events are typically binned into frames. However, binning functions are discontinuous, which truncates gradients at the frame level and forces most event-based algorithms to rely solely on frame-based features. Attempts to directly learn from raw events avoid this restriction but instead suffer from biased gradient estimation due to the discontinuities of the binning operation, ultimately limiting their learning efficiency. To address this challenge, we propose a novel framework for unbiased gradient estimation of arbitrary binning functions by synthesizing weak derivatives during backpropagation while keeping the forward output unchanged. The key idea is to exploit integration by parts: lifting the target functions to functionals yields an integral form of the derivative of the binning function during backpropagation, where the cotangent function naturally arises. By reconstructing this cotangent function from the sampled cotangent vector, we compute weak derivatives that provably match long-range finite differences of both smooth and non-smooth targets. Experimentally, our method improves simple optimization-based egomotion estimation with 3.2% lower RMS error and 1.57 \times faster convergence. On complex downstream tasks, we achieve 9.4% lower EPE in self-supervised optical flow, and 5.1% lower RMS error in SLAM, demonstrating broad benefits for event-based visual perception. Source code can be found at this https URL.

[CV-49] he Constant Eye: Benchmarking and Bridging Appearance Robustness in Autonomous Driving

【速读】:该论文旨在解决当前自动驾驶规划算法在分布外(Out-of-Distribution, OOD)条件下表现脆弱的问题,特别是由于缺乏对“外观变化”(如天气、光照等)与“场景结构变化”之间的区分,导致无法准确判断算法失效的根本原因。解决方案的关键在于提出一个高保真鲁棒性基准navdream,利用生成式像素对齐风格迁移技术,在几何结构几乎不变的前提下隔离外观因素的影响;同时设计一种基于冻结视觉基础模型(DINOv3)的通用感知接口,提取对外观不敏感的特征作为规划器的稳定输入,从而实现跨多种规划范式的零样本泛化能力,且无需额外微调即可保持在极端外观变化下的性能一致性。

链接: https://arxiv.org/abs/2602.12563
作者: Jiabao Wang,Hongyu Zhou,Yuanbo Yang,Jiahao Shao,Yiyi Liao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite rapid progress, autonomous driving algorithms remain notoriously fragile under Out-of-Distribution (OOD) conditions. We identify a critical decoupling failure in current research: the lack of distinction between appearance-based shifts, such as weather and lighting, and structural scene changes. This leaves a fundamental question unanswered: Is the planner failing because of complex road geometry, or simply because it is raining? To resolve this, we establish navdream, a high-fidelity robustness benchmark leveraging generative pixel-aligned style transfer. By creating a visual stress test with negligible geometric deviation, we isolate the impact of appearance on driving performance. Our evaluation reveals that existing planning algorithms often show significant degradation under OOD appearance conditions, even when the underlying scene structure remains consistent. To bridge this gap, we propose a universal perception interface leveraging a frozen visual foundation model (DINOv3). By extracting appearance-invariant features as a stable interface for the planner, we achieve exceptional zero-shot generalization across diverse planning paradigms, including regression-based, diffusion-based, and scoring-based models. Our plug-and-play solution maintains consistent performance across extreme appearance shifts without requiring further fine-tuning. The benchmark and code will be made available.

[CV-50] PLLM : Pseudo-Labeling Large Language Models for CAD Program Synthesis

【速读】:该论文旨在解决从无标注的3D形状数据中恢复计算机辅助设计(Computer-Aided Design, CAD)程序的问题,这一任务在现有方法中通常依赖于成对的形状-程序标注数据,而此类数据往往难以获取。解决方案的关键在于提出一种名为PLLM的自训练框架,其核心机制包括:利用预训练的CAD能力语言模型(LLM)迭代采样候选程序、选择高保真度的执行结果,并通过程序增强构建合成的程序-形状配对用于后续微调,从而在无需人工标注的情况下实现CAD程序合成性能的提升。

链接: https://arxiv.org/abs/2602.12561
作者: Yuanbo Li,Dule Shu,Yanying Chen,Matt Klenk,Daniel Ritchie
机构: Brown University (布朗大学); Toyota Research Institute (丰田研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recovering Computer-Aided Design (CAD) programs from 3D geometries is a widely studied problem. Recent advances in large language models (LLMs) have enabled progress in CAD program synthesis, but existing methods rely on supervised training with paired shape-program data, which is often unavailable. We introduce PLLM, a self-training framework for CAD program synthesis from unlabeled 3D shapes. Given a pre-trained CAD-capable LLM and a shape dataset, PLLM iteratively samples candidate programs, selects high-fidelity executions, and augments programs to construct synthetic program-shape pairs for fine-tuning. We experiment on adapting CAD-Recode from DeepCAD to the unlabeled ABC dataset show consistent improvements in geometric fidelity and program diversity.

[CV-51] Self-Supervised JEPA-based World Models for LiDAR Occupancy Completion and Forecasting

【速读】:该论文旨在解决自动驾驶系统中长期规划所依赖的环境动态建模问题,即如何从大量未标注的LiDAR数据中自监督地学习可泛化的世界模型(world model),以准确捕捉环境中时空演变规律。其解决方案的关键在于提出AD-LiST-JEPA框架,该框架基于联合嵌入预测架构(joint-embedding predictive architecture, JEPA),通过自监督方式利用大规模无标签LiDAR数据训练模型,从而实现对未来时空状态的预测,同时在下游的LiDAR感知与预测联合任务(occupancy completion and forecasting, OCF)中验证了所学表征的有效性。

链接: https://arxiv.org/abs/2602.12540
作者: Haoran Zhu,Anna Choromanska
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous driving, as an agent operating in the physical world, requires the fundamental capability to build \textitworld models that capture how the environment evolves spatiotemporally in order to support long-term planning. At the same time, scalability demands learning such models in a self-supervised manner; \textitjoint-embedding predictive architecture (JEPA) enables learning world models via leveraging large volumes of unlabeled data without relying on expensive human annotations. In this paper, we propose \textbfAD-LiST-JEPA, a self-supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using a JEPA framework. We evaluate the quality of the learned representations through a downstream LiDAR-based occupancy completion and forecasting (OCF) task, which jointly assesses perception and prediction. Proof of concept experiments show better OCF performance with pretrained encoder after JEPA-based world model learning.

[CV-52] Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models

【速读】:该论文旨在解决扩散模型(diffusion models)与流匹配模型(flow-matching models)在对齐人类偏好时面临的工程复杂性问题,包括代码库碎片化、模型特定实现以及算法集成难度大等挑战。解决方案的关键在于提出一个名为Flow-Factory的统一框架,其核心是通过模块化、基于注册表(registry-based)的架构将算法、模型和奖励机制解耦,从而实现新算法与架构的无缝集成,并支持多种训练范式(如GRPO、DiffusionNFT、AWM)在不同模型(Flux、Qwen-Image、WAN视频模型)上的灵活部署。该设计显著降低了实现门槛,同时提供生产级内存优化、多奖励训练灵活性和分布式训练支持,助力研究者高效原型开发与规模化创新。

链接: https://arxiv.org/abs/2602.12529
作者: Bowen Ping,Chengyou Jia,Minnan Luo,Hangwei Qian,Ivor Tsang
机构: Xi’an Jiaotong University (西安交通大学); CFAR, A*STAR (先进制造研究机构)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning has emerged as a promising paradigm for aligning diffusion and flow-matching models with human preferences, yet practitioners face fragmented codebases, model-specific implementations, and engineering complexity. We introduce Flow-Factory, a unified framework that decouples algorithms, models, and rewards through through a modular, registry-based architecture. This design enables seamless integration of new algorithms and architectures, as demonstrated by our support for GRPO, DiffusionNFT, and AWM across Flux, Qwen-Image, and WAN video models. By minimizing implementation overhead, Flow-Factory empowers researchers to rapidly prototype and scale future innovations with ease. Flow-Factory provides production-ready memory optimization, flexible multi-reward training, and seamless distributed training support. The codebase is available at this https URL.

[CV-53] Geometric Stratification for Singular Configurations of the P3P Problem via Local Dual Space

【速读】:该论文致力于解决P3P(Perspective-3-Point)问题中相机中心(O)的奇异配置(singular configurations)的几何分类与代数刻画问题。其核心挑战在于理解当相机中心处于特定位置时,P3P问题解的多重性(multiplicity, μ)如何随几何条件变化,从而揭示导致解不唯一或无穷多解的临界情形。解决方案的关键在于引入局部对偶空间(local dual space)构建系统性的代数-计算框架,通过该框架实现了对相机中心O及其互补配置O′的完整几何分层:具体而言,μ≥2时O位于“危险圆柱”(danger cylinder)上,μ≥3时O落在与第一个Morley三角形或外接圆相关的三条母线上,μ≥4时O位于外接圆上(对应无限多解);同时,对于互补点O′,其几何约束也得到明确刻画——μ≥2时位于与危险圆柱关联的Deltoidal曲面上,μ≥3时位于该曲面的三条尖点曲线(cuspidal curves)上。此方法为P3P问题的稳定性分析与鲁棒求解提供了理论基础。

链接: https://arxiv.org/abs/2602.12525
作者: Xueying Sun,Zijia Li,Nan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates singular configurations of the P3P problem. Using local dual space, a systematic algebraic-computational framework is proposed to give a complete geometric stratification for the P3P singular configurations with respect to the multiplicity \mu of the camera center O : for \mu\ge 2 , O lies on the ``danger cylinder’', for \mu\ge 3 , O lies on one of three generatrices of the danger cylinder associated with the first Morley triangle or the circumcircle, and for \mu\ge 4 , O lies on the circumcircle which indeed corresponds to infinite P3P solutions. Furthermore, a geometric stratification for the complementary configuration O^\prime associated with a singular configuration O is studied as well: for \mu\ge 2 , O^\prime lies on a deltoidal surface associated with the danger cylinder, and for \mu\ge 3 , O^\prime lies on one of three cuspidal curves of the deltoidal surface.

[CV-54] LiDAR-Anchored Collaborative Distillation for Robust 2D Representations

【速读】:该论文旨在解决预训练的2D图像编码器在噪声和恶劣天气条件下(如非晴朗白天场景)视觉感知鲁棒性不足的问题,从而限制其在真实世界中的应用。解决方案的关键在于提出一种新颖的自监督方法——协同蒸馏(Collaborative Distillation),该方法利用3D激光雷达(LiDAR)作为自监督信号,指导2D图像编码器在保持原有功能的基础上提升对复杂环境的适应能力,同时增强模型对3D空间信息的理解,显著改善下游任务在多样条件下的性能与泛化能力。

链接: https://arxiv.org/abs/2602.12524
作者: Wonjun Jo,Hyunwoo Ha,Kim Ji-Yeon,Hawook Jeong,Tae-Hyun Oh
机构: POSTECH(浦项科技大学); RideFlux Inc.; KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As deep learning continues to advance, self-supervised learning has made considerable strides. It allows 2D image encoders to extract useful features for various downstream tasks, including those related to vision-based systems. Nevertheless, pre-trained 2D image encoders fall short in conducting the task under noisy and adverse weather conditions beyond clear daytime scenes, which require for robust visual perception. To address these issues, we propose a novel self-supervised approach, \textbfCollaborative Distillation, which leverages 3D LiDAR as self-supervision to improve robustness to noisy and adverse weather conditions in 2D image encoders while retaining their original capabilities. Our method outperforms competing methods in various downstream tasks across diverse conditions and exhibits strong generalization ability. In addition, our method also improves 3D awareness stemming from LiDAR’s characteristics. This advancement highlights our method’s practicality and adaptability in real-world scenarios.

[CV-55] Matching of SAR and optical images based on transformation to shared modality

【速读】:该论文旨在解决光学图像与合成孔径雷达(SAR)图像之间精确配准(co-registration)难题,该问题源于两者成像物理机制的根本差异导致的显著图像特征不一致性。解决方案的关键在于提出一种新的图像变换方法,将光学图像和SAR图像统一映射到一个共享的新模态(modality),该模态满足三个条件:1)具有预定义的相同通道数;2)变换后的图像尽可能相似;3)保持原始图像的关键特征不变(非退化性)。在此基础上,利用已在常规数字照片上预训练的RoMa图像匹配模型进行匹配,无需针对新模态重新训练即可实现高质量的跨模态图像配准,显著优于基于原始模态间图像翻译或传统特征匹配算法的方法。

链接: https://arxiv.org/abs/2602.12515
作者: Alexey Borisov,Evgeny Myasnikov,Vladislav Myasnikov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Significant differences in optical images and Synthetic Aperture Radar (SAR) images are caused by fundamental differences in the physical principles underlying their acquisition by Earth remote sensing platforms. These differences make precise image matching (co-registration) of these two types of images difficult. In this paper, we propose a new approach to image matching of optical and SAR images, which is based on transforming the images to a new modality. The new image modality is common to both optical and SAR images and satisfies the following conditions. First, the transformed images must have an equal pre-defined number of channels. Second, the transformed and co-registered images must be as similar as possible. Third, the transformed images must be non-degenerate, meaning they must preserve the significant features of the original images. To further match images transformed to this shared modality, we train the RoMa image matching model, which is one of the leading solutions for matching of regular digital photographs. We evaluated the proposed approach on the publicly available MultiSenGE dataset containing both optical and SAR images. We demonstrated its superiority over alternative approaches based on image translation between original modalities and various feature matching algorithms. The proposed solution not only provides better quality of matching, but is also more versatile. It enables the use of ready-made RoMa and DeDoDe models, pre-trained for regular images, without retraining for a new modality, while maintaining high-quality matching of optical and SAR images.

[CV-56] Monocular Reconstruction of Neural Tactile Fields

【速读】:该论文旨在解决机器人在真实环境中执行任务时面临的挑战:传统路径规划方法依赖于静态几何占据表示,无法有效处理因接触而发生形变、屈服或重组的动态环境。为应对这一问题,作者提出了一种新颖的三维表示方法——神经触觉场(neural tactile fields),其核心在于将空间位置映射为接触时预期的触觉响应,从而实现对交互敏感的3D环境建模。该方案的关键创新是首次从单目RGB图像中学习并预测神经触觉场,使机器人能够识别高阻力与低阻力区域,并据此生成更合理的运动路径——例如主动穿越植被等低阻力区域,而非将所有占据空间视为不可通行。实验表明,该方法在体素级3D重建和表面重建上分别较当前最优单目重建方法(LRM和Direct3D)提升85.8%和26.7%。

链接: https://arxiv.org/abs/2602.12508
作者: Pavan Mantripragada,Siddhanth Deshmukh,Eadom Dessalene,Manas Desai,Yiannis Aloimonos
机构: University of Maryland, College Park, USA(马里兰大学学院市分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Robots operating in the real world must plan through environments that deform, yield, and reconfigure under contact, requiring interaction-aware 3D representations that extend beyond static geometric occupancy. To address this, we introduce neural tactile fields, a novel 3D representation that maps spatial locations to the expected tactile response upon contact. Our model predicts these neural tactile fields from a single monocular RGB image – the first method to do so. When integrated with off-the-shelf path planners, neural tactile fields enable robots to generate paths that avoid high-resistance objects while deliberately routing through low-resistance regions (e.g. foliage), rather than treating all occupied space as equally impassable. Empirically, our learning framework improves volumetric 3D reconstruction by 85.8% and surface reconstruction by 26.7% compared to state-of-the-art monocular 3D reconstruction methods (LRM and Direct3D).

[CV-57] Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models ICML2026

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在临床报告中对肯定与否定语义区分能力不足的问题,尤其是在放射学场景下,常见VLMs会混淆被否定和未被否定的医学发现。其解决方案的关键在于提出一种基于因果可解释性的选择性训练方法——Negation-Aware Selective Training (NAST),该方法利用因果追踪效应(Causal Tracing Effects, CTEs)来调节微调过程中各层的梯度更新幅度,依据每层对否定处理的因果贡献动态调整学习率,从而将机制可解释性信号转化为结构化的优化规则,实现对否定语义的精准识别,同时不损害模型整体的视觉-语言对齐性能。

链接: https://arxiv.org/abs/2602.12498
作者: Ali Abbasi,Mehdi Taghipour,Rahmatollah Beheshti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures. Submitted to ICML 2026

点击查看摘要

Abstract:Negation is a fundamental linguistic operation in clinical reporting, yet vision-language models (VLMs) frequently fail to distinguish affirmative from negated medical statements. To systematically characterize this limitation, we introduce a radiology-specific diagnostic benchmark that evaluates polarity sensitivity under controlled clinical conditions, revealing that common medical VLMs consistently confuse negated and non-negated findings. To enable learning beyond simple condition absence, we further construct a contextual clinical negation dataset that encodes structured claims and supports attribute-level negations involving location and severity. Building on these resources, we propose Negation-Aware Selective Training (NAST), an interpretability-guided adaptation method that uses causal tracing effects (CTEs) to modulate layer-wise gradient updates during fine-tuning. Rather than applying uniform learning rates, NAST scales each layer’s update according to its causal contribution to negation processing, transforming mechanistic interpretability signals into a principled optimization rule. Experiments demonstrate improved discrimination of affirmative and negated clinical statements without degrading general vision-language alignment, highlighting the value of causal interpretability for targeted model adaptation in safety-critical medical settings. Code and resources are available at this https URL.

[CV-58] Insertion Network for Image Sequence Correspondence

【速读】:该论文旨在解决2D图像序列中切片级内容导航的问题,即在3D医学影像中精确定位特定的2D切片或确定3D扫描的解剖覆盖范围,这是诊断任务及自动配准与分割流程的重要预处理步骤。其解决方案的关键在于构建一个基于插入网络(insertion network)的序列对应方法,通过训练神经网络学习如何将一个序列中的切片插入到另一个序列的合适位置,利用切片间注意力机制建模切片到切片的上下文关系,从而有效利用整个序列的上下文信息,而非像传统体部回归(body part regression)那样独立处理每个切片,实验表明该方法在监督设置下将定位误差从8.4 mm降低至5.4 mm,显著提升了精度。

链接: https://arxiv.org/abs/2602.12489
作者: Dingjie Su,Weixiang Hong,Benoit M. Dawant,Bennett A. Landman
机构: Vanderbilt University (范德堡大学); University of Portsmouth (朴茨茅斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a novel method for establishing correspondence between two sequences of 2D images. One particular application of this technique is slice-level content navigation, where the goal is to localize specific 2D slices within a 3D volume or determine the anatomical coverage of a 3D scan based on its 2D slices. This serves as an important preprocessing step for various diagnostic tasks, as well as for automatic registration and segmentation pipelines. Our approach builds sequence correspondence by training a network to learn how to insert a slice from one sequence into the appropriate position in another. This is achieved by encoding contextual representations of each slice and modeling the insertion process using a slice-to-slice attention mechanism. We apply this method to localize manually labeled key slices in body CT scans and compare its performance to the current state-of-the-art alternative known as body part regression, which predicts anatomical position scores for individual slices. Unlike body part regression, which treats each slice independently, our method leverages contextual information from the entire sequence. Experimental results show that the insertion network reduces slice localization errors in supervised settings from 8.4 mm to 5.4 mm, demonstrating a substantial improvement in accuracy.

[CV-59] Human-Like Coarse Object Representations in Vision Models

【速读】:该论文旨在解决当前视觉模型(如分割模型)在物理推理任务中为何难以习得人类直观物理认知中所依赖的“粗粒度、体积化”物体表征的问题。其核心挑战在于,现有模型追求像素级精确掩码(pixel-accurate masks),往往与人类用于物理预测的抽象、平滑的物体边界不一致。解决方案的关键在于通过引入一个基于时间到碰撞(time-to-collision, TTC)的行为范式和对齐度量,系统性地操纵模型训练时长、规模及有效容量(通过剪枝实现),发现模型与人类行为的对齐程度呈现倒U型曲线关系:过小或过短训练的模型会欠分割为模糊块状,而过大或完全训练的模型则出现过度分割的边界波动;唯有处于中间状态的模型展现出最接近人类的“理想体粒度”(ideal body granularity)。这一结果表明,人类式的粗粒度物理表征并非源于特定先验偏置,而是由计算资源约束自然涌现,提示可通过早期检查点、适度架构或轻量剪枝等简单策略诱导出具有物理效率的表征。

链接: https://arxiv.org/abs/2602.12486
作者: Andrey Gizdov,Andrea Procopio,Yichen Li,Daniel Harari,Tomer Ullman
机构: Harvard University (哈佛大学); Weizmann Institute of Science (魏茨曼科学研究所); Bocconi University (博科尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans appear to represent objects for intuitive physics with coarse, volumetric bodies’’ that smooth concavities - trading fine visual details for efficient physical predictions - yet their internal structure is largely unknown. Segmentation models, in contrast, optimize pixel-accurate masks that may misalign with such bodies. We ask whether and when these models nonetheless acquire human-like bodies. Using a time-to-collision (TTC) behavioral paradigm, we introduce a comparison pipeline and alignment metric, then vary model training time, size, and effective capacity via pruning. Across all manipulations, alignment with human behavior follows an inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; and an intermediate ideal body granularity’’ best matches humans. This suggests human-like coarse bodies emerge from resource constraints rather than bespoke biases, and points to simple knobs - early checkpoints, modest architectures, light pruning - for eliciting physics-efficient representations. We situate these results within resource-rational accounts balancing recognition detail against physical affordances.

[CV-60] A Lightweight and Explainable DenseNet-121 Framework for Grape Leaf Disease Classification

【速读】:该论文旨在解决葡萄叶片病害(如细菌性腐烂、霜霉病和白粉病)早期精准识别难题,以支持葡萄园的可持续管理。当前基于YOLO等框架的自动化方法存在计算成本高且缺乏可解释性的问题,难以应用于实际场景。解决方案的关键在于提出一种优化的DenseNet121模型,结合领域特定预处理和深度连接结构,有效提取病害相关特征(如叶脉、边缘和病斑),并通过Grad-CAM增强模型输出的可解释性,确保其聚焦于生理上相关的病变区域;同时利用迁移学习提升小样本和不平衡数据下的鲁棒性,并通过模型优化实现低延迟推理(9秒/次),从而在保证高精度(准确率99.27%、F1分数99.28%)的同时具备实时部署能力。

链接: https://arxiv.org/abs/2602.12484
作者: Md. Ehsanul Haque,Md.Saymon Hosen Polash,Rakib Hasan Ovi,Aminul Kader Bulbul,Md Kamrul Siam,Tamim Hasan Saykat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted and Presented at 28th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:Grapes are among the most economically and culturally significant fruits on a global scale, and table grapes and wine are produced in significant quantities in Europe and Asia. The production and quality of grapes are significantly impacted by grape diseases such as Bacterial Rot, Downy Mildew, and Powdery Mildew. Consequently, the sustainable management of a vineyard necessitates the early and precise identification of these diseases. Current automated methods, particularly those that are based on the YOLO framework, are often computationally costly and lack interpretability that makes them unsuitable for real-world scenarios. This study proposes grape leaf disease classification using Optimized DenseNet 121. Domain-specific preprocessing and extensive connectivity reveal disease-relevant characteristics, including veins, edges, and lesions. An extensive comparison with baseline CNN models, including ResNet18, VGG16, AlexNet, and SqueezeNet, demonstrates that the proposed model exhibits superior performance. It achieves an accuracy of 99.27%, an F1 score of 99.28%, a specificity of 99.71%, and a Kappa of 98.86%, with an inference time of 9 seconds. The cross-validation findings show a mean accuracy of 99.12%, indicating strength and generalizability across all classes. We also employ Grad-CAM to highlight disease-related regions to guarantee the model is highlighting physiologically relevant aspects and increase transparency and confidence. Model optimization reduces processing requirements for real-time deployment, while transfer learning ensures consistency on smaller and unbalanced samples. An effective architecture, domain-specific preprocessing, and interpretable outputs make the proposed framework scalable, precise, and computationally inexpensive for detecting grape leaf diseases.

[CV-61] Semantic-aware Adversarial Fine-tuning for CLIP

【速读】:该论文旨在解决当前基于余弦相似度生成对抗样本(Adversarial Examples, AEs)在CLIP模型零样本分类任务中对抗鲁棒性不足的问题。研究表明,使用手工模板(如“A photo of a label”)计算图像与文本间的余弦相似度所生成的AEs,在替换为语义更丰富的相似度度量时失效,导致由此训练得到的图像编码器鲁棒性下降。解决方案的关键在于提出一种语义集成攻击(Semantic-ensemble Attack),通过最小化原始图像与一组经基础模型生成并优化后的语义增强文本描述之间的平均相似度,生成更具语义感知能力的对抗样本;进而提出语义感知对抗微调(Semantic-aware Adversarial Fine-Tuning, SAFT),利用这些语义感知AEs对CLIP图像编码器进行微调,从而显著提升其在16个数据集上的零样本对抗鲁棒性。

链接: https://arxiv.org/abs/2602.12461
作者: Jiacheng Zhang,Jinhao Li,Hanxun Huang,Sarah M. Erfani,Benjamin I.P. Rubinstein,Feng Liu
机构: The University of Melbourne(墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent studies have shown that CLIP model’s adversarial robustness in zero-shot classification tasks can be enhanced by adversarially fine-tuning its image encoder with adversarial examples (AEs), which are generated by minimizing the cosine similarity between images and a hand-crafted template (e.g., ‘‘A photo of a label’’). However, it has been shown that the cosine similarity between a single image and a single hand-crafted template is insufficient to measure the similarity for image-text pairs. Building on this, in this paper, we find that the AEs generated using cosine similarity may fail to fool CLIP when the similarity metric is replaced with semantically enriched alternatives, making the image encoder fine-tuned with these AEs less robust. To overcome this issue, we first propose a semantic-ensemble attack to generate semantic-aware AEs by minimizing the average similarity between the original image and an ensemble of refined textual descriptions. These descriptions are initially generated by a foundation model to capture core semantic features beyond hand-crafted templates and are then refined to reduce hallucinations. To this end, we propose Semantic-aware Adversarial Fine-Tuning (SAFT), which fine-tunes CLIP’s image encoder with semantic-aware AEs. Extensive experiments show that SAFT outperforms current methods, achieving substantial improvements in zero-shot adversarial robustness across 16 datasets. Our code is available at: this https URL.

[CV-62] Prototype-driven fusion of pathology and spatial transcriptomics for interpretable survival prediction

【速读】:该论文旨在解决多模态空间组学与病理图像融合中缺乏可解释且具判别力的联合建模方法的问题,尤其在配对的全切片图像(Whole Slide Images, WSI)与空间转录组学(Spatial Transcriptomics, ST)数据规模扩展至人群水平时,如何有效整合二者互补的空间信号以提升预后预测性能。其解决方案的关键在于提出PathoSpatial框架,该框架采用多层次专家架构结合任务引导的原型学习机制,自适应地协调模态内无监督发现与模态间监督聚合,在保证判别能力的同时显著增强模型可解释性;通过后验原型解释和分子风险分解,实现定量且生物学可解释的预后因子识别。

链接: https://arxiv.org/abs/2602.12441
作者: Lihe Liu,Xiaoxi Pan,Yinyin Yuan,Lulu Shang
机构: MD Anderson Cancer Center (MD安德森癌症中心); The Institute for Data Science in Oncology (IDSO) (肿瘤数据科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole slide images (WSIs) enable weakly supervised prognostic modeling via multiple instance learning (MIL). Spatial transcriptomics (ST) preserves in situ gene expression, providing a spatial molecular context that complements morphology. As paired WSI-ST cohorts scale to population level, leveraging their complementary spatial signals for prognosis becomes crucial; however, principled cross-modal fusion strategies remain limited for this paradigm. To this end, we introduce PathoSpatial, an interpretable end-to-end framework integrating co-registered WSIs and ST to learn spatially informed prognostic representations. PathoSpatial uses task-guided prototype learning within a multi-level experts architecture, adaptively orchestrating unsupervised within-modality discovery with supervised cross-modal aggregation. By design, PathoSpatial substantially strengthens interpretability while maintaining discriminative ability. We evaluate PathoSpatial on a triple-negative breast cancer cohort with paired ST and WSIs. PathoSpatial delivers strong and consistent performance across five survival endpoints, achieving superior or comparable performance to leading unimodal and multimodal methods. PathoSpatial inherently enables post-hoc prototype interpretation and molecular risk decomposition, providing quantitative, biologically grounded explanations, highlighting candidate prognostic factors. We present PathoSpatial as a proof-of-concept for scalable and interpretable multimodal learning for spatial omics-pathology fusion.

[CV-63] MiDAS: A Multimodal Data Acquisition System and Dataset for Robot-Assisted Minimally Invasive Surgery

【速读】:该论文旨在解决机器人辅助微创手术(Robot-assisted Minimally Invasive Surgery, RMIS)研究中因缺乏对专有机器人遥测数据访问而导致的多模态数据获取难题。其解决方案的关键在于提出了一种名为MiDAS的开源、平台无关的系统,能够实现时间同步的非侵入式多模态数据采集,包括电磁和RGB-D手部追踪、脚踏板传感以及手术视频捕捉,且无需依赖专有机器人接口。该系统在Raven-II和da Vinci Xi平台上验证有效,实现了与内部机器人运动学高度相关的外部感知信号,并达到了与专有遥测相当的手势识别性能,从而推动了RMIS研究的数据可复现性和标准化。

链接: https://arxiv.org/abs/2602.12407
作者: Keshara Weerasinghe,Seyed Hamid Reza Roodabeh,Andrew Hawkins(MD),Zhaomeng Zhang,Zachary Schrader,Homa Alemzadeh
机构: University of Virginia (弗吉尼亚大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 pages, 17 figures

点击查看摘要

Abstract:Background: Robot-assisted minimally invasive surgery (RMIS) research increasingly relies on multimodal data, yet access to proprietary robot telemetry remains a major barrier. We introduce MiDAS, an open-source, platform-agnostic system enabling time-synchronized, non-invasive multimodal data acquisition across surgical robotic platforms. Methods: MiDAS integrates electromagnetic and RGB-D hand tracking, foot pedal sensing, and surgical video capturing without requiring proprietary robot interfaces. We validated MiDAS on the open-source Raven-II and the clinical da Vinci Xi by collecting multimodal datasets of peg transfer and hernia repair suturing tasks performed by surgical residents. Correlation analysis and downstream gesture recognition experiments were conducted. Results: External hand and foot sensing closely approximated internal robot kinematics and non-invasive motion signals achieved gesture recognition performance comparable to proprietary telemetry. Conclusion: MiDAS enables reproducible multimodal RMIS data collection and is released with annotated datasets, including the first multimodal dataset capturing hernia repair suturing on high-fidelity simulation models. Comments: 29 pages, 17 figures Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2602.12407 [cs.RO] (or arXiv:2602.12407v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2602.12407 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Keshara Weerasinghe [view email] [v1] Thu, 12 Feb 2026 20:56:15 UTC (41,409 KB)

[CV-64] MonoLoss: A Training Objective for Interpretable Monosemantic Representations

【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在训练过程中难以有效分解多义神经表征(polysemantic neural representations)为可解释的单义特征(monosemantic features)的问题。现有方法依赖的单义性度量(monosemanticity metrics)需对数据集中的所有样本进行两两比较,导致计算复杂度呈二次增长,严重限制了训练与评估效率。论文的关键解决方案是提出一种单遍(single-pass)算法,精确计算近期提出的MonoScore指标,其时间复杂度从二次降低至线性,显著提升效率(在OpenImagesV7上实现最高1200倍的评估加速和159倍的训练加速,仅增加约4%每轮开销)。基于此高效度量,作者进一步引入Monosemanticity Loss(MonoLoss),作为可直接嵌入训练过程的损失项,显式鼓励神经元激活模式在语义上的一致性,从而提升SAEs学习到的单义特征的质量与类纯度(class purity),并在多种预训练模型(CLIP、SigLIP2、ViT)及不同SAE架构下验证了其有效性。

链接: https://arxiv.org/abs/2602.12403
作者: Ali Nasiri-Sarvi,Anh Tien Nguyen,Hassan Rivaz,Dimitris Samaras,Mahdi S. Hosseini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Sparse autoencoders (SAEs) decompose polysemantic neural representations, where neurons respond to multiple unrelated concepts, into monosemantic features that capture single, interpretable concepts. However, standard training objectives only weakly encourage this decomposition, and existing monosemanticity metrics require pairwise comparisons across all dataset samples, making them inefficient during training and evaluation. We study a recent MonoScore metric and derive a single-pass algorithm that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images. On OpenImagesV7, we achieve up to a 1200x speedup wall-clock speedup in evaluation and 159x during training, while adding only ~4% per-epoch overhead. This allows us to treat MonoScore as a training signal: we introduce the Monosemanticity Loss (MonoLoss), a plug-in objective that directly rewards semantically consistent activations for learning interpretable monosemantic representations. Across SAEs trained on CLIP, SigLIP2, and pretrained ViT features, using BatchTopK, TopK, and JumpReLU SAEs, MonoLoss increases MonoScore for most latents. MonoLoss also consistently improves class purity (the fraction of a latent’s activating images belonging to its dominant class) across all encoder and SAE combinations, with the largest gain raising baseline purity from 0.152 to 0.723. Used as an auxiliary regularizer during ResNet-50 and CLIP-ViT-B/32 finetuning, MonoLoss yields up to 0.6% accuracy gains on ImageNet-1K and monosemantic activating patterns on standard benchmark datasets. The code is publicly available at this https URL.

[CV-65] ZeroDiff: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning

【速读】:该论文针对零样本学习(Zero-shot Learning, ZSL)中因训练样本稀缺导致的虚假视觉语义关联问题,以及现有生成式方法中未适应的全噪声生成器所引发的真实测试样本特征断连问题展开研究。其核心解决方案是提出ZeroDiff++,一个基于扩散模型的生成框架:在训练阶段,通过扩散增强(diffusion augmentation)、监督对比表示(supervised contrastive representations)和 Wasserstein 互学习多视角判别器来强化可见类与不可见类的视觉-语义关联;在测试阶段,引入基于扩散的测试时自适应(Diffusion-based Test time Adaptation, DiffTTA)与测试时生成(Diffusion-based Test time Generation, DiffGen),利用伪标签重建和去噪路径追踪,生成既连接真实数据又具语义一致性的部分合成特征,从而缓解数据稀缺并提升模型鲁棒性。

链接: https://arxiv.org/abs/2602.12401
作者: Zihan Ye,Shreyank N Gowda,Kaile Du,Weijian Luo,Ling Shao
机构: UCAS-Terminus AI Lab, University of Chinese Academy of Sciences (中国科学院大学Terminus人工智能实验室); University of Nottingham (诺丁汉大学); Southeast University (东南大学); hi-lab of Xiaohongshu Inc (小红书公司hi-lab)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Zero-shot Learning (ZSL) enables classifiers to recognize classes unseen during training, commonly via generative two stage methods: (1) learn visual semantic correlations from seen classes; (2) synthesize unseen class features from semantics to train classifiers. In this paper, we identify spurious visual semantic correlations in existing generative ZSL worsened by scarce seen class samples and introduce two metrics to quantify spuriousness for seen and unseen classes. Furthermore, we point out a more critical bottleneck: existing unadaptive fully noised generators produce features disconnected from real test samples, which also leads to the spurious correlation. To enhance the visual-semantic correlations on both seen and unseen classes, we propose ZeroDiff++, a diffusion-based generative framework. In training, ZeroDiff++ uses (i) diffusion augmentation to produce diverse noised samples, (ii) supervised contrastive (SC) representations for instance level semantics, and (iii) multi view discriminators with Wasserstein mutual learning to assess generated features. At generation time, we introduce (iv) Diffusion-based Test time Adaptation (DiffTTA) to adapt the generator using pseudo label reconstruction, and (v) Diffusion-based Test time Generation (DiffGen) to trace the diffusion denoising path and produce partially synthesized features that connect real and generated data, and mitigates data scarcity further. Extensive experiments on three ZSL benchmarks demonstrate that ZeroDiff++ not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Code would be available.

[CV-66] What does RL improve for Visual Reasoning ? A Frankenstein-Style Analysis

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在视觉语言模型后训练阶段中对视觉推理能力提升机制不明确的问题,特别是难以区分RL相较于监督微调(Supervised Fine-Tuning, SFT)所带来的具体技能改进。现有端到端基准测试结果混杂了多种因素,导致无法准确归因于特定能力的增强。其解决方案的关键在于提出一种类“弗兰肯斯坦”式的分析框架,包含三个核心步骤:(i) 通过因果探测(causal probing)实现功能定位;(ii) 基于参数比较刻画更新特性;(iii) 利用模型融合进行可迁移性测试。该框架揭示RL主要引发中到晚期Transformer层的推理时一致性偏移,且这些层的优化既可通过融合迁移至其他模型(transferable),又在冻结后导致性能下降(necessary),表明RL的核心贡献并非整体提升视觉感知,而是系统性地优化了视觉到推理的对齐与推理能力。

链接: https://arxiv.org/abs/2602.12395
作者: Xirui Li,Ming Li,Tianyi Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL’s reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.

[CV-67] Reproducing Drag Diffusion: Interactive Point-Based Editing with Diffusion Models CVPR2024

【速读】:该论文旨在解决图像编辑中用户对空间位置控制精度不足的问题,尤其在基于扩散模型的交互式点驱动图像编辑任务中。其解决方案的关键在于通过在中间扩散时间步(diffusion timestep)优化单一扩散潜在表示(diffusion latent),结合身份保持微调(identity-preserving fine-tuning)与空间正则化(spatial regularization),从而实现精确的空间操控能力。实验表明,该方法在特定超参数配置下可有效提升编辑准确性,但性能对优化时间步和运动监督特征层级较为敏感。

链接: https://arxiv.org/abs/2602.12393
作者: Ali Subhan,Ashir Raza
机构: University of Ljubljana (卢布尔雅那大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 8 figures. Reproducibility study of DragDiffusion (CVPR 2024). Submitted to TMLR Reproducibility Challenge. Code available on GitHub

点击查看摘要

Abstract:DragDiffusion is a diffusion-based method for interactive point-based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single diffusion latent at an intermediate timestep, together with identity-preserving fine-tuning and spatial regularization. This work presents a reproducibility study of DragDiffusion using the authors’ released implementation and the DragBench benchmark. We reproduce the main ablation studies on diffusion timestep selection, LoRA-based fine-tuning, mask regularization strength, and UNet feature supervision, and observe close agreement with the qualitative and quantitative trends reported in the original work. At the same time, our experiments show that performance is sensitive to a small number of hyperparameter assumptions, particularly the optimized timestep and the feature level used for motion supervision, while other components admit broader operating ranges. We further evaluate a multi-timestep latent optimization variant and find that it does not improve spatial accuracy while substantially increasing computational cost. Overall, our findings support the central claims of DragDiffusion while clarifying the conditions under which they are reliably reproducible. Code is available at this https URL.

[CV-68] Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

【速读】:该论文旨在解决生成式图像检测(Synthetic Image Detection, SID)中模型泛化能力不足的问题,特别是针对高质扩散模型生成的图像,现有基于CLIP的检测方法在实际场景中表现不稳定。其解决方案的关键在于构建了一个去偏的配对数据集SynthCLIC,该数据集包含来自最新扩散模型的高质量合成图像及其真实对应物,并结合可解释的线性头与文本引导的概念模型,系统分析CLIP特征中驱动检测决策的语义线索。研究发现,当前CLIP-based检测器主要依赖于高层次摄影属性(如极简风格、镜头光晕或深度分层),而非明显的生成器特异性伪影,这揭示了其在不同生成架构间泛化性能差异的根本原因,从而强调需持续更新模型并扩大训练数据覆盖范围,以推动更通用、鲁棒的SID方法发展。

链接: https://arxiv.org/abs/2602.12381
作者: Marco Willi,Melanie Mathys,Michael Graber
机构: University of Applied Sciences FHNW (瑞士西北应用科学与艺术大学); Institute for Data Science I4DS (数据科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 figures; 23 pages

点击查看摘要

Abstract:Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs–unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real photographs and high-quality synthetic counterparts from recent diffusion models, designed to reduce semantic bias in SID. Using an interpretable linear head with de-correlated activations and a text-grounded concept-model, we analyze what CLIP-based detectors learn. CLIP-based linear detectors reach 0.96 mAP on a GAN-based benchmark but only 0.92 on our high-quality diffusion dataset SynthCLIC, and generalization across generator families drops to as low as 0.37 mAP. We find that the detectors primarily rely on high-level photographic attributes (e.g., minimalist style, lens flare, or depth layering), rather than overt generator-specific artifacts. CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.

[CV-69] FT-ACB-XML: Decision-Level Integration of Customized Temporal Fusion Transformer and Attention-BiLSTM with XGBoost Meta-Learner for BTC Price Forecasting

【速读】:该论文旨在解决比特币(Bitcoin, BTC)价格预测中的非线性、高波动性和时间不规则性问题,同时克服现有深度学习模型在可解释性和跨市场条件泛化能力方面的局限。其解决方案的关键在于提出一种混合堆叠泛化框架(TFT-ACB-XML),通过并行集成两个定制化的基础学习器——基于变量选择网络和可解释单头注意力机制的时序融合变换器(Temporal Fusion Transformer, TFT)用于捕捉长程依赖与全局时间动态,以及结合新型注意力机制的定制双向长短期记忆网络(Attention-Customized Bidirectional Long Short-Term Memory, ACB)以提取短时序列依赖;随后采用误差倒数加权策略对两者的输出进行自适应赋权,并将加权结果拼接为特征向量输入XGBoost回归器,从而有效建模非线性残差并提升最终预测精度。

链接: https://arxiv.org/abs/2602.12380
作者: Raiz Ud Din(1),Saddam Hussain Khan(2) ((1) Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat, Pakistan, (2) Interdisciplinary Research Center for Smart Mobility and Logistics, King Fahad University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 15 Figures, 12 Tables

点击查看摘要

Abstract:Accurate forecasting of Bitcoin (BTC) has always been a challenge because decentralized markets are non-linear, highly volatile, and have temporal irregularities. Existing deep learning models often struggle with interpretability and generalization across diverse market conditions. This research presents a hybrid stacked-generalization framework, TFT-ACB-XML, for BTC closing price prediction. The framework integrates two parallel base learners: a customized Temporal Fusion Transformer (TFT) and an Attention-Customized Bidirectional Long Short-Term Memory network (ACB), followed by an XGBoost regressor as the meta-learner. The customized TFT model handles long-range dependencies and global temporal dynamics via variable selection networks and interpretable single-head attention. The ACB module uses a new attention mechanism alongside the customized BiLSTM to capture short-term sequential dependencies. Predictions from both customized TFT and ACB are weighted through an error-reciprocal weighting strategy. These weights are derived from validation performance, where a model showing lower prediction error receives a higher weight. Finally, the framework concatenates these weighted outputs into a feature vector and feeds the vector to an XGBoost regressor, which captures non-linear residuals and produces the final BTC closing price prediction. Empirical validation using BTC data from October 1, 2014, to January 5, 2026, shows improved performance of the proposed framework compared to recent Deep Learning and Transformer baseline models. The results show a MAPE of 0.65%, an MAE of 198.15, and an RMSE of 258.30 for one-step-ahead out-of-sample under a walk-forward evaluation on the test block. The evaluation period spans the 2024 BTC halving and the spot ETFs (exchange-traded funds) period, which coincide with major liquidity and volatility shifts.

[CV-70] LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

【速读】:该论文旨在解决当前统一运动-语言生成与理解模型发展不足的问题,特别是现有方法在利用配对运动-文本数据微调大语言模型(LLM)时易导致语言能力灾难性遗忘,以及通过量化将运动转换为离散表示而引入显著抖动伪影的局限性。解决方案的关键在于提出一种名为LLaMo的统一框架,其核心创新是采用模态特定的混合Transformer(Modality-specific Mixture-of-Transformers, MoT)架构扩展预训练LLM,在保持原始语言理解能力的同时实现可扩展的多模态适配;同时,该框架将人体运动编码为因果连续潜在空间,并通过轻量级流匹配头(flow-matching head)维持解码器-only结构中的下一个词预测范式,从而支持实时流式运动生成(30 FPS),显著提升文本到运动生成和运动到文本描述任务的保真度,尤其在零样本运动生成场景下表现突出。

链接: https://arxiv.org/abs/2602.12370
作者: Zekun Li,Sizhe An,Chengcheng Tang,Chuan Guo,Ivan Shugurov,Linguang Zhang,Amy Zhao,Srinath Sridhar,Lingling Tao,Abhay Mittal
机构: Brown University (布朗大学); Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.

[CV-71] hermal Imaging for Contactless Cardiorespiratory and Sudomotor Response Monitoring

【速读】:该论文旨在解决通过热红外成像实现无接触式生理信号估计的问题,特别是针对传统可见光方法无法获取的电导皮肤反应(EDA),同时提升心率(HR)和呼吸率(BR)的非接触测量精度。其解决方案的关键在于构建一个信号处理流程:首先追踪面部解剖区域并进行空间聚合,再分离出慢速的汗腺调节趋势与快速的心肺波动成分;对于HR采用多区域感兴趣区(ROIs)的正交矩阵图像变换(OMIT)分解,BR则通过对鼻部和脸颊信号平均后进行频谱峰值检测。该方法在SIMULATOR STUDY 1数据集上验证了对EDA、HR和BR的可估计性,为热成像生物信号提取提供了性能基准与设计指导。

链接: https://arxiv.org/abs/2602.12361
作者: Constantino Álvarez Casado,Mohammad Rahman,Sasan Sharifipour,Nhi Nguyen,Manuel Lage Cañellas,Xiaoting Wu,Miguel Bordallo López
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures, 3 tables, 22 references, 1 equation, conference

点击查看摘要

Abstract:Thermal infrared imaging captures skin temperature changes driven by autonomic regulation and can potentially provide contactless estimation of electrodermal activity (EDA), heart rate (HR), and breathing rate (BR). While visible-light methods address HR and BR, they cannot access EDA, a standard marker of sympathetic activation. This paper characterizes the extraction of these three biosignals from facial thermal video using a signal-processing pipeline that tracks anatomical regions, applies spatial aggregation, and separates slow sudomotor trends from faster cardiorespiratory components. For HR, we apply an orthogonal matrix image transformation (OMIT) decomposition across multiple facial regions of interest (ROIs), and for BR we average nasal and cheek signals before spectral peak detection. We evaluate 288 EDA configurations and the HR/BR pipeline on 31 sessions from the public SIMULATOR STUDY 1 (SIM1) driver monitoring dataset. The best fixed EDA configuration (nose region, exponential moving average) reaches a mean absolute correlation of 0.40 \pm 0.23 against palm EDA, with individual sessions reaching 0.89. BR estimation achieves a mean absolute error of 3.1 \pm 1.1 bpm, while HR estimation yields 13.8 \pm 7.5 bpm MAE, limited by the low camera frame rate (7.5 Hz). We report signal polarity alternation across sessions, short thermodynamic latency for well-tracked signals, and condition-dependent and demographic effects on extraction quality. These results provide baseline performance bounds and design guidance for thermal contactless biosignal estimation.

[CV-72] LongNav-R1: Horizon-Adaptive Multi-Turn RL for Long-Horizon VLA Navigation

【速读】:该论文旨在解决长时程导航任务中视觉-语言-动作(Visual-Language-Action, VLA)模型的决策连贯性与样本效率问题,特别是现有单轮(single-turn)范式难以建模历史交互因果关系及长期未来影响的局限。解决方案的关键在于提出一种端到端的多轮强化学习(multi-turn reinforcement learning, RL)框架 LongNav-R1,将导航决策过程建模为VLA策略与具身环境之间的连续多轮对话,从而支持对历史交互和序列未来结果的推理;同时引入时域自适应策略优化(Horizon-Adaptive Policy Optimization)机制,在优势估计中显式考虑不同轨迹长度,实现更精准的时间信用分配,有效提升模型在长时程任务中的行为多样性与鲁棒性。

链接: https://arxiv.org/abs/2602.12351
作者: Yue Hu,Avery Xi,Qixin Xiao,Seth Isaacson,Henry X. Liu,Ram Vasudevan,Maani Ghaffari
机构: University of Michigan (密歇根大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: VLA, Navigation

点击查看摘要

Abstract:This paper develops LongNav-R1, an end-to-end multi-turn reinforcement learning (RL) framework designed to optimize Visual-Language-Action (VLA) models for long-horizon navigation. Unlike existing single-turn paradigm, LongNav-R1 reformulates the navigation decision process as a continuous multi-turn conversation between the VLA policy and the embodied environment. This multi-turn RL framework offers two distinct advantages: i) it enables the agent to reason about the causal effects of historical interactions and sequential future outcomes; and ii) it allows the model to learn directly from online interactions, fostering diverse trajectory generation and avoiding the behavioral rigidity often imposed by human demonstrations. Furthermore, we introduce Horizon-Adaptive Policy Optimization. This mechanism explicitly accounts for varying horizon lengths during advantage estimation, facilitating accurate temporal credit assignment over extended sequences. Consequently, the agent develops diverse navigation behaviors and resists collapse during long-horizon tasks. Experiments on object navigation benchmarks validate the framework’s efficacy: With 4,000 rollout trajectories, LongNav-R1 boosts the Qwen3-VL-2B success rate from 64.3% to 73.0%. These results demonstrate superior sample efficiency and significantly outperform state-of-the-art methods. The model’s generalizability and robustness are further validated by its zero-shot performance in long-horizon real-world navigation settings. All source code will be open-sourced upon publication.

[CV-73] LatentAM: Real-Time Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning

【速读】:该论文旨在解决开放词汇场景下机器人感知中特征表示的可扩展性与实时性问题,即如何从连续的RGB-D观测流中构建高保真、语义丰富的潜在特征地图,同时支持不同视觉语言模型(VLM)的即插即用集成。其解决方案的关键在于提出LatentAM框架,采用在线字典学习方法实现模型无关且无需预训练的潜在特征映射:通过为每个高斯原语(Gaussian primitive)关联一个紧凑查询向量,并利用可学习字典结合注意力机制生成近似VLM嵌入;同时引入信任域正则化优化策略以适应动态场景语义变化,并基于体素哈希设计高效的全局-局部地图管理机制,在保证GPU内存受限的前提下实现长轨迹下的实时运行(12–35 FPS)。

链接: https://arxiv.org/abs/2602.12314
作者: Junwoon Lee,Yulun Tian
机构: University of Michigan (密歇根大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:We present LatentAM, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception. Instead of distilling high-dimensional Vision-Language Model (VLM) embeddings using model-specific decoders, LatentAM proposes an online dictionary learning approach that is both model-agnostic and pretraining-free, enabling plug-and-play integration with different VLMs at test time. Specifically, our approach associates each Gaussian primitive with a compact query vector that can be converted into approximate VLM embeddings using an attention mechanism with a learnable dictionary. The dictionary is initialized efficiently from streaming observations and optimized online to adapt to evolving scene semantics under trust-region regularization. To scale to long trajectories and large environments, we further propose an efficient map management strategy based on voxel hashing, where optimization is restricted to an active local map on the GPU, while the global map is stored and indexed on the CPU to maintain bounded GPU memory usage. Experiments on public benchmarks and a large-scale custom dataset demonstrate that LatentAM attains significantly better feature reconstruction fidelity compared to state-of-the-art methods, while achieving near-real-time speed (12-35 FPS) on the evaluated datasets. Our project page is at: this https URL

[CV-74] Language-Guided Invariance Probing of Vision-Language Models

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在面对可控语言扰动时的鲁棒性评估问题,尤其是其对语义不变改写(paraphrase)和语义改变翻转(semantic flip)的响应是否可靠。传统检索指标难以揭示模型在语言细微变化下的行为差异,导致对模型真实语义理解能力的评估存在盲区。解决方案的关键在于提出一种名为Language-Guided Invariance Probing (LGIP) 的基准测试框架,通过在MS COCO数据集上自动构建语义保持的同义句与基于规则的语义翻转样本(如对象类别、颜色或数量的修改),量化模型在图像-文本匹配任务中的不变性误差(invariance error)、语义敏感性差距(semantic sensitivity gap)以及正向率统计量(positive-rate statistic),从而实现对VLMs语言鲁棒性的无模型依赖诊断。

链接: https://arxiv.org/abs/2511.13494
作者: Jae Joong Lee
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.13494 [cs.CV] (or arXiv:2511.13494v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.13494 Focus to learn more arXiv-issued DOI via DataCite

[CV-75] Represent Micro-Doppler Signature in Orders

【速读】:该论文旨在解决复杂环境中通过墙雷达(Through-the-Wall Radar, TWR)对人类活动进行非视距感知时,因微多普勒(micro-Doppler)特征区分度低而导致的识别困难问题,尤其是在枪支携带与正常行走等相似活动之间。传统方法依赖时间-频率谱图作为输入,但其高维特性导致模型训练和推理效率低下。解决方案的关键在于提出一种基于切比雪夫多项式(Chebyshev polynomial)的时频特征表示方法——“切比雪夫-时间映射”(Chebyshev-time map),该方法通过正交切比雪夫多项式分解将时间-频率谱切片映射到鲁棒的系数空间中,有效保留了时频谱的多阶形态细节信息,并显著压缩了输入数据维度,在保证识别精度的同时提升了计算效率。

链接: https://arxiv.org/abs/2602.12985
作者: Weicheng Gao
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Non-line-of-sight sensing of human activities in complex environments is enabled by multiple-input multiple-output through-the-wall radar (TWR). However, the distinctiveness of micro-Doppler signature between similar indoor human activities such as gun carrying and normal walking is minimal, while the large scale of input images required for effective identification utilizing time-frequency spectrograms creates challenges for model training and inference efficiency. To address this issue, the Chebyshev-time map is proposed in this paper, which is a method characterizing micro-Doppler signature using polynomial orders. The parametric kinematic models for human motion and the TWR echo model are first established. Then, a time-frequency feature representation method based on orthogonal Chebyshev polynomial decomposition is proposed. The kinematic envelopes of the torso and limbs are extracted, and the time-frequency spectrum slices are mapped into a robust Chebyshev-time coefficient space, preserving the multi-order morphological detail information of time-frequency spectrum. Numerical simulations and experiments are conducted to verify the effectiveness of the proposed method, which demonstrates the capability to characterize armed and unarmed indoor human activities while effectively compressing the scale of the time-frequency spectrum to achieve a balance between recognition accuracy and input data dimensions. The open-source code of this paper can be found in: this https URL.

[CV-76] Statistical Opportunities in Neuroimaging

【速读】:该论文旨在解决神经影像学研究中因大脑多尺度复杂性及高维数据特性所引发的统计挑战,包括测量噪声、运动伪影、个体间与扫描站点间的显著变异,以及现代研究规模庞大带来的分析难题。其解决方案的关键在于推动统计学与神经科学、临床医学的紧密协作,通过开发更先进的数据建模方法和统计工具,提升对脑发育、成年与衰老大脑、神经退行性疾病与神经精神障碍,以及大脑编码与解码机制的理解,从而实现从神经影像学发现到精准诊断、机制洞察和个性化治疗的转化。

链接: https://arxiv.org/abs/2602.12974
作者: Jian Kang,Thomas Nichols,Lexin Li,Martin A. Lindquist,Hongtu Zhu
机构: University of Michigan (密歇根大学); University of Oxford (牛津大学); University of California, Berkeley (加州大学伯克利分校); Johns Hopkins University (约翰霍普金斯大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
备注: 33 pages, 3 figures

点击查看摘要

Abstract:Neuroimaging has profoundly enhanced our understanding of the human brain by characterizing its structure, function, and connectivity through modalities like MRI, fMRI, EEG, and PET. These technologies have enabled major breakthroughs across the lifespan, from early brain development to neurodegenerative and neuropsychiatric disorders. Despite these advances, the brain is a complex, multiscale system, and neuroimaging measurements are correspondingly high-dimensional. This creates major statistical challenges, including measurement noise, motion-related artifacts, substantial inter-subject and site/scanner variability, and the sheer scale of modern studies. This paper explores statistical opportunities and challenges in neuroimaging across four key areas: (i) brain development from birth to age 20, (ii) the adult and aging brain, (iii) neurodegeneration and neuropsychiatric disorders, and (iv) brain encoding and decoding. After a quick tutorial on major imaging technologies, we review cutting-edge studies, underscore data and modeling challenges, and highlight research opportunities for statisticians. We conclude by emphasizing that close collaboration among statisticians, neuroscientists, and clinicians is essential for translating neuroimaging advances into improved diagnostics, deeper mechanistic insight, and more personalized treatments.

[CV-77] Dual-Phase Cross-Modal Contrastive Learning for CMR-Guided ECG Representations for Cardiovascular Disease Assessment

【速读】:该论文旨在解决心电图(ECG)在临床中难以直接反映心脏结构与机械功能的问题,从而限制其对心脏表型的全面评估。现有方法多依赖二维心脏磁共振成像(CMR)数据进行学习,无法充分捕捉心脏三维解剖特征及其在心动周期中的动态变化。解决方案的关键在于提出一种基于对比学习(contrastive learning)的框架,通过联合对齐ECG信号与心脏磁共振成像中舒张末期(end-diastole, ED)和收缩末期(end-systole, ES)的三维体积数据,在共享潜在空间中建立更精细的表征映射。该方法利用双相对比损失函数,使同一ECG样本同时锚定于两个心动周期阶段的3D CMR表示,从而实现结构与功能属性的灵活解耦,显著提升了从ECG中提取功能参数的能力(提升9.2%),为低成本、高可扩展性的心脏表型推断提供了新路径。

链接: https://arxiv.org/abs/2602.12883
作者: Laura Alvarez-Florez,Angel Bujalance-Gomez,Femke Raijmakers,Samuel Ruiperez-Campillo,Maarten Z. H. Kolk,Jesse Wiers,Julia Vogt,Erik J. Bekkers,Ivana Išgum,Fleur V. Y. Tjong
机构: Amsterdam University Medical Center, The Netherlands(阿姆斯特丹大学医学中心, 荷兰); University of Amsterdam, The Netherlands(阿姆斯特丹大学, 荷兰); ETH Zurich, Switzerland(苏黎世联邦理工学院, 瑞士); Mayo Clinic, Rochester, United States of America(梅奥诊所, 罗切斯特, 美国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted at SPIE Medical Imaging 2026 Conference

点击查看摘要

Abstract:Cardiac magnetic resonance imaging (CMR) offers detailed evaluation of cardiac structure and function, but its limited accessibility restricts use to selected patient populations. In contrast, the electrocardiogram (ECG) is ubiquitous and inexpensive, and provides rich information on cardiac electrical activity and rhythm, yet offers limited insight into underlying cardiac structure and mechanical function. To address this, we introduce a contrastive learning framework that improves the extraction of clinically relevant cardiac phenotypes from ECG by learning from paired ECG-CMR data. Our approach aligns ECG representations with 3D CMR volumes at end-diastole (ED) and end-systole (ES), with a dual-phase contrastive loss to anchor each ECG jointly with both cardiac phases in a shared latent space. Unlike prior methods limited to 2D CMR representations with or without a temporal component, our framework models 3D anatomy at both ED and ES phases as distinct latent representations, enabling flexible disentanglement of structural and functional cardiac properties. Using over 34,000 ECG-CMR pairs from the UK Biobank, we demonstrate improved extraction of image-derived phenotypes from ECG, particularly for functional parameters ( \uparrow 9.2%), while improvements in clinical outcome prediction remained modest ( \uparrow 0.7%). This strategy could enable scalable and cost-effective extraction of image-derived traits from ECG. The code for this research is publicly available.

[CV-78] 3DLAND: 3D Lesion Abdominal Anomaly Localization Dataset

【速读】:该论文旨在解决现有腹部CT医学影像数据集普遍缺乏三维标注、多器官覆盖范围有限以及病灶与器官关联不精确的问题,这些问题限制了鲁棒表示学习和临床应用的进展。其解决方案的关键在于构建一个大规模基准数据集3DLAND,包含超过6000例增强CT体积图像和超过20000个高保真三维病灶标注,并明确关联至七个腹部器官(肝、肾、胰腺、脾、胃、胆囊等),通过三阶段自动化流程——自动空间推理、提示优化二维分割与记忆引导三维传播——实现高效且精准的标注生成,经放射科专家验证,表面Dice分数超过0.75,从而为异常检测、定位及跨器官迁移学习提供可扩展评估平台。

链接: https://arxiv.org/abs/2602.12820
作者: Mehran Advand,Zahra Dehghanian,Navid Faraji,Reza Barati,Seyed Amir Ahmad Safavi-Naini,Hamid R. Rabiee
机构: Sharif University of Technology (伊朗沙里夫理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing medical imaging datasets for abdominal CT often lack three-dimensional annotations, multi-organ coverage, or precise lesion-to-organ associations, hindering robust representation learning and clinical applications. To address this gap, we introduce 3DLAND, a large-scale benchmark dataset comprising over 6,000 contrast-enhanced CT volumes with over 20,000 high-fidelity 3D lesion annotations linked to seven abdominal organs: liver, kidneys, pancreas, spleen, stomach, and gallbladder. Our streamlined three-phase pipeline integrates automated spatial reasoning, prompt-optimized 2D segmentation, and memory-guided 3D propagation, validated by expert radiologists with surface dice scores exceeding 0.75. By providing diverse lesion types and patient demographics, 3DLAND enables scalable evaluation of anomaly detection, localization, and cross-organ transfer learning for medical AI. Our dataset establishes a new benchmark for evaluating organ-aware 3D segmentation models, paving the way for advancements in healthcare-oriented AI. To facilitate reproducibility and further research, the 3DLAND dataset and implementation code are publicly available at this https URL.

[CV-79] VineetVC: Adaptive Video Conferencing Under Severe Bandwidth Constraints Using Audio-Driven Talking-Head Reconstruction

【速读】:该论文旨在解决消费级和受限网络中带宽严重消耗导致实时视频会议稳定性下降的问题,具体表现为编码器速率管理饱和、丢包率上升、帧率下降及端到端延迟显著增加。解决方案的关键在于构建一个自适应会议系统,该系统融合WebRTC媒体传输与基于音频驱动的说话头像重建路径,并通过遥测数据驱动模式切换机制实现动态优化;其核心组件包括WebSocket信令服务、可选的SFU(Selective Forwarding Unit)多点传输模块、支持实时WebRTC统计信息提取与CSV遥测导出的浏览器客户端,以及一个AI REST服务,该服务可根据参考人脸图像和录音生成合成MP4视频流,浏览器可将本地摄像头轨道替换为该低带宽(平均32.80 kbps)合成流,从而有效缓解网络压力并维持通话质量。

链接: https://arxiv.org/abs/2602.12758
作者: Vineet Kumar Rakesh,Soumya Mazumdar,Tapas Samanta,Hemendra Kumar Pandey,Amitabha Das,Sarbajit Pal
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Intense bandwidth depletion within consumer and constrained networks has the potential to undermine the stability of real-time video conferencing: encoder rate management becomes saturated, packet loss escalates, frame rates deteriorate, and end-to-end latency significantly increases. This work delineates an adaptive conferencing system that integrates WebRTC media delivery with a supplementary audio-driven talking-head reconstruction pathway and telemetry-driven mode regulation. The system consists of a WebSocket signaling service, an optional SFU for multi-party transmission, a browser client capable of real-time WebRTC statistics extraction and CSV telemetry export, and an AI REST service that processes a reference face image and recorded audio to produce a synthesized MP4; the browser can substitute its outbound camera track with the synthesized stream with a median bandwidth of 32.80 kbps. The solution incorporates a bandwidth-mode switching strategy and a client-side mode-state logger.

[CV-80] Lung nodule classification on CT scan patches using 3D convolutional neural networks

【速读】:该论文旨在解决肺癌早期诊断中肺结节检测与分类的难题,尤其是在面对大量CT影像数据、多发性结节及微小结节导致视觉评估困难的情况下。其解决方案的关键在于提出三项方法学改进:(1)一种先进的CT图像裁剪策略,聚焦目标结节并降低计算成本;(2)基于目标过滤的技术以去除噪声标签;(3)新型数据增强方法提升模型鲁棒性。这些技术的整合使得所开发的分类子系统具备跨不同扫描协议、设备类型和上游分割/检测模型的通用能力,在LIDC-IDRI数据集上实现了多类任务Macro ROC AUC 0.9176和F1-score 0.7658,以及二分类任务Binary ROC AUC 0.9383和F1-score 0.8668,性能优于现有方法,达到该任务的最先进水平。

链接: https://arxiv.org/abs/2602.12750
作者: Volodymyr Sydorskyi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Lung cancer remains one of the most common and deadliest forms of cancer worldwide. The likelihood of successful treatment depends strongly on the stage at which the disease is diagnosed. Therefore, early detection of lung cancer represents a critical medical challenge. However, this task poses significant difficulties for thoracic radiologists due to the large number of studies to review, the presence of multiple nodules within the lungs, and the small size of many nodules, which complicates visual assessment. Consequently, the development of automated systems that incorporate highly accurate and computationally efficient lung nodule detection and classification modules is essential. This study introduces three methodological improvements for lung nodule classification: (1) an advanced CT scan cropping strategy that focuses the model on the target nodule while reducing computational cost; (2) target filtering techniques for removing noisy labels; (3) novel augmentation methods to improve model robustness. The integration of these techniques enables the development of a robust classification subsystem within a comprehensive Clinical Decision Support System for lung cancer detection, capable of operating across diverse acquisition protocols, scanner types, and upstream models (segmentation or detection). The multiclass model achieved a Macro ROC AUC of 0.9176 and a Macro F1-score of 0.7658, while the binary model reached a Binary ROC AUC of 0.9383 and a Binary F1-score of 0.8668 on the LIDC-IDRI dataset. These results outperform several previously reported approaches and demonstrate state-of-the-art performance for this task.

[CV-81] Conference Proceedings of the Inaugural Conference of the International Society for Tractography (IST 2025 Bordeaux) WWW

【速读】:该论文旨在解决当前脑白质纤维束成像(tractography)领域中跨学科协作不足、方法创新与临床应用脱节的问题。其解决方案的关键在于通过举办首届国际纤维束成像学会(International Society for Tractography, IST)会议,汇聚全球神经解剖学、扩散磁共振成像(Diffusion MRI)及临床转化领域的顶尖专家,推动方法学进步与科学问题的深度融合,从而促进纤维束成像在神经精神疾病诊断、深部脑刺激靶点定位和脑发育研究等方向的应用发展。

链接: https://arxiv.org/abs/2602.12410
作者: Flavio Dell Acqua,Maxime Descoteaux,Graham Little,Laurent Petit,Dogu Baran Aydogan,Stephanie Forkel,Alexander Leemans,Simona Schiavi,Michel Thiebaut de Schotten
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注: Proceedings of the Inaugural Conference of the International Society for Tractography (IST Conference 2025). Held at the Institut des Maladies Neurodégénératives in Bordeaux, France, October 13-16, 2025. Society website: this http URL

点击查看摘要

Abstract:This collection comprises the abstracts presented during poster, power pitch and oral sessions at the Inaugural Conference of the International Society for Tractography (IST Conference 2025), held in Bordeaux, France, from October 13-16, 2025. The conference was designed to foster meaningful exchange and collaboration between disparate fields. The overall focus was on advancing research, innovation, and community in the common fields of interest: neuroanatomy, tractography methods and scientific/clinical applications of tractography. The included abstracts cover the latest advancements in tractography, Diffusion MRI, and related fields including new work on; neurological and psychiatric disorders, deep brain stimulation targeting, and brain development. This landmark event brought together world-leading experts to discuss critical challenges and chart the future direction of the field.

[CV-82] Quantum walk inspired JPEG compression of images

【速读】:该论文旨在解决传统JPEG压缩中量化表(Qtable)固定且难以适应不同图像内容的问题,从而限制了压缩效率与重建质量的平衡。其解决方案的关键在于提出一种受量子行走启发的自适应量化框架(Quantum Walk Inspired Optimization, QWIO),通过在连续频率带缩放因子参数空间中搜索最优Qtable,以统一率失真目标(rate-distortion objective)联合优化重建保真度与压缩效率。该方法无需修改解码器兼容性,保持JPEG标准合规性,并在MNIST、CIFAR10和ImageNet子集上实现平均3–6 dB的PSNR提升,同时改善边缘、轮廓和亮度过渡的结构保真度。

链接: https://arxiv.org/abs/2602.12306
作者: Abhishek Verma,Sahil Tomar,Sandeep Kumar
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Information Theory (cs.IT)
备注: 8 pages

点击查看摘要

Abstract:This work proposes a quantum inspired adaptive quantization framework that enhances the classical JPEG compression by introducing a learned, optimized Qtable derived using a Quantum Walk Inspired Optimization (QWIO) search strategy. The optimizer searches a continuous parameter space of frequency band scaling factors under a unified rate distortion objective that jointly considers reconstruction fidelity and compression efficiency. The proposed framework is evaluated on MNIST, CIFAR10, and ImageNet subsets, using Peak Signal to Noise Ratio (PSNR), Structural Similarity Index (SSIM), Bits Per Pixel (BPP), and error heatmap visual analysis as evaluation metrics. Experimental results show average gains ranging from 3 to 6 dB PSNR, along with better structural preservation of edges, contours, and luminance transitions, without modifying decoder compatibility. The structure remains JPEG compliant and can be implemented using accessible scientific packages making it ideal for deployment and practical research use.

人工智能

[AI-0] Optimal Take-off under Fuzzy Clearances

【速读】:该论文旨在解决无人航空器在不确定环境下的障碍物避让问题,特别是传统最优控制方法在处理不确定性时的局限性以及安全关键型航空系统对可解释决策的需求。其解决方案的关键在于构建一个融合最优控制与模糊规则系统的混合架构:通过三层Takagi-Sugeno-Kang(TSK)模糊推理系统(Fuzzy Rule Based System, FRBS),动态调节约束半径、紧迫等级和激活决策,依据FAA和EASA的规章最小间隔与适航指南生成软约束;这些模糊推导出的清晰度参数被嵌入到使用FALCON工具箱和IPOPT求解器优化的最优控制问题中,从而实现自适应约束处理与计算效率的平衡。

链接: https://arxiv.org/abs/2602.13166
作者: Hugo Henry,Arthur Tsai,Kelly Cohen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 12 figures, conference paper

点击查看摘要

Abstract:This paper presents a hybrid obstacle avoidance architecture that integrates Optimal Control under clearance with a Fuzzy Rule Based System (FRBS) to enable adaptive constraint handling for unmanned aircraft. Motivated by the limitations of classical optimal control under uncertainty and the need for interpretable decision making in safety critical aviation systems, we design a three stage Takagi Sugeno Kang fuzzy layer that modulates constraint radii, urgency levels, and activation decisions based on regulatory separation minima and airworthiness guidelines from FAA and EASA. These fuzzy-derived clearances are then incorporated as soft constraints into an optimal control problem solved using the FALCON toolbox and IPOPT. The framework aims to reduce unnecessary recomputations by selectively activating obstacle avoidance updates while maintaining compliance with aviation procedures. A proof of concept implementation using a simplified aircraft model demonstrates that the approach can generate optimal trajectories with computation times of 2,3 seconds per iteration in a single threaded MATLAB environment, suggesting feasibility for near real time applications. However, our experiments revealed a critical software incompatibility in the latest versions of FALCON and IPOPT, in which the Lagrangian penalty term remained identically zero, preventing proper constraint enforcement. This behavior was consistent across scenarios and indicates a solver toolbox regression rather than a modeling flaw. Future work includes validating this effect by reverting to earlier software versions, optimizing the fuzzy membership functions using evolutionary methods, and extending the system to higher fidelity aircraft models and stochastic obstacle environments.

[AI-1] In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach AAAI

【速读】:该论文旨在解决传统事件响应系统在应对快速演进的网络攻击时,因依赖人工构建仿真模型且难以从原始系统日志和告警中提取语义信息而导致的适应性差、效率低的问题。其解决方案的关键在于利用大语言模型(Large Language Model, LLM)的预训练安全知识与上下文学习能力,构建一个集成感知、推理、规划与行动功能于一体的轻量级代理(agentic)系统,无需显式建模即可实现端到端的自主响应策略生成,并通过与实际观测结果的对比不断优化攻击假设与响应方案,从而实现上下文自适应。

链接: https://arxiv.org/abs/2602.13156
作者: Yiran Gao,Kim Hammar,Tao Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 2026 AAAI Summer Symposium on Human-Aware AI Agents for the Cyber Battlefield

点击查看摘要

Abstract:Rapidly evolving cyberattacks demand incident response systems that can autonomously learn and adapt to changing threats. Prior work has extensively explored the reinforcement learning approach, which involves learning response strategies through extensive simulation of the incident. While this approach can be effective, it requires handcrafted modeling of the simulator and suppresses useful semantics from raw system logs and alerts. To address these limitations, we propose to leverage large language models’ (LLM) pre-trained security knowledge and in-context learning to create an end-to-end agentic solution for incident response planning. Specifically, our agent integrates four functionalities, perception, reasoning, planning, and action, into one lightweight LLM (14b model). Through fine-tuning and chain-of-thought reasoning, our LLM agent is capable of processing system logs and inferring the underlying network state (perception), updating its conjecture of attack models (reasoning), simulating consequences under different response strategies (planning), and generating an effective response (action). By comparing LLM-simulated outcomes with actual observations, the LLM agent repeatedly refines its attack conjecture and corresponding response, thereby demonstrating in-context adaptation. Our agentic approach is free of modeling and can run on commodity hardware. When evaluated on incident logs reported in the literature, our agent achieves recovery up to 23% faster than those of frontier LLMs.

[AI-2] Constrained Assumption-Based Argumentation Frameworks KR AAMAS2026

【速读】:该论文旨在解决传统假设基论证(Assumption-based Argumentation, ABA)在表达能力上的局限性问题,即其仅支持基于命题原子构建的无变量(ground)论点和攻击关系,难以处理涉及约束变量(constrained variables)的复杂场景。解决方案的关键在于提出一种新的受限假设基论证(Constrained Assumption-based Argumentation, CABA)框架,允许论点及其攻击关系中包含取值范围可能为无限域的约束变量,并定义了非地面语义(non-ground semantics),通过多种非地面攻击概念实现对CABA的语义刻画,从而在保持与标准ABA语义一致性的前提下,显著扩展了ABA的表示能力和适用范围。

链接: https://arxiv.org/abs/2602.13135
作者: Emanuele De Angelis(1),Fabio Fioravanti(2),Maria Chiara Meo(2),Alberto Pettorossi(3),Maurizio Proietti(1),Francesca Toni(4) ((1) CNR-IASI, Rome, Italy, (2) DEc, University ‘G. d’Annunzio’, Chieti-Pescara, Italy, (3) DICII, University of Rome ‘Tor Vergata’, Italy, (4) Imperial, London, UK)
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Extended version with proofs and additional results of the full paper accepted at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026). DOI: this https URL

点击查看摘要

Abstract:Assumption-based Argumentation (ABA) is a well-established form of structured argumentation. ABA frameworks with an underlying atomic language are widely studied, but their applicability is limited by a representational restriction to ground (variable-free) arguments and attacks built from propositional atoms. In this paper, we lift this restriction and propose a novel notion of constrained ABA (CABA), whose components, as well as arguments built from them, may include constrained variables, ranging over possibly infinite domains. We define non-ground semantics for CABA, in terms of various notions of non-ground attacks. We show that the new semantics conservatively generalise standard ABA semantics.

[AI-3] Which Algorithms Can Graph Neural Networks Learn?

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在学习离散算法时的泛化能力问题,尤其是标准消息传递图神经网络(Message-Passing Graph Neural Networks, MPNNs)在训练集规模有限的情况下能否对任意大小的输入实例进行有效外推。其核心挑战在于:现有研究或缺乏理论保障,或仅关注表达能力而忽视实际泛化性能。论文提出了一套通用的理论框架,明确了MPNNs能够从少量小规模实例中学习算法并可证明地逼近任意规模输入行为的充分条件;该框架适用于多种经典算法(如单源最短路径、最小生成树及0-1背包等动态规划问题),并通过不可行性结果揭示了标准MPNN的局限性,并设计了更具表达力的变体以克服这些限制。关键突破在于将算法学习与泛化能力形式化建模,从而为神经算法推理提供了可验证的理论基础。

链接: https://arxiv.org/abs/2602.13106
作者: Solveig Wittig,Antonis Vasileiou,Robert R. Nerem,Timo Stoll,Floris Geerts,Yusu Wang,Christopher Morris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:In recent years, there has been growing interest in understanding neural architectures’ ability to learn to execute discrete algorithms, a line of work often referred to as neural algorithmic reasoning. The goal is to integrate algorithmic reasoning capabilities into larger neural pipelines. Many such architectures are based on (message-passing) graph neural networks (MPNNs), owing to their permutation equivariance and ability to deal with sparsity and variable-sized inputs. However, existing work is either largely empirical and lacks formal guarantees or it focuses solely on expressivity, leaving open the question of when and how such architectures generalize beyond a finite training set. In this work, we propose a general theoretical framework that characterizes the sufficient conditions under which MPNNs can learn an algorithm from a training set of small instances and provably approximate its behavior on inputs of arbitrary size. Our framework applies to a broad class of algorithms, including single-source shortest paths, minimum spanning trees, and general dynamic programming problems, such as the 0 - 1 knapsack problem. In addition, we establish impossibility results for a wide range of algorithmic tasks, showing that standard MPNNs cannot learn them, and we derive more expressive MPNN-like architectures that overcome these limitations. Finally, we refine our analysis for the Bellman-Ford algorithm, yielding a substantially smaller required training set and significantly extending the recent work of Nerem et al. [2025] by allowing for a differentiable regularization loss. Empirical results largely support our theoretical findings.

[AI-4] How cyborg propaganda reshapes collective action

【速读】:该论文试图解决的问题是:在数字时代,传统基于自动化机器人(bot)的虚假信息传播与真实草根行动之间的界限日益模糊,一种融合了政党协调应用和人工智能技术的新型政治操控形式——“赛博格宣传”(cyborg propaganda)正在兴起,其利用大量经验证实的个体用户与自适应算法自动化相结合,形成闭环系统,从而规避现有针对自动化网络的监管框架。解决方案的关键在于识别并区分有机的信息扩散与人为协调的信息传播,并提出治理框架以应对人工智能辅助下的集体表达所带来的监管挑战,核心在于重新审视公民行为在算法驱动下的角色转变,即从民主参与的主体变为认知代理(cognitive proxies),进而重塑数字公共领域的权力结构与话语竞争机制。

链接: https://arxiv.org/abs/2602.13088
作者: Jonas R. Kunst,Kinga Bierwiaczonek,Meeyoung Cha,Omid V. Ebrahimi,Marc Fawcett-Atkinson,Asbjørn Følstad,Anton Gollwitzer,Nils Köbis,Gary Marcus,Jon Roozenbeek,Daniel Thilo Schroeder,Jay J. Van Bavel,Sander van der Linden,Rory White,Live Leonhardsen Wilhelmsen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:The distinction between genuine grassroots activism and automated influence operations is collapsing. While policy debates focus on bot farms, a distinct threat to democracy is emerging via partisan coordination apps and artificial intelligence-what we term ‘cyborg propaganda.’ This architecture combines large numbers of verified humans with adaptive algorithmic automation, enabling a closed-loop system. AI tools monitor online sentiment to optimize directives and generate personalized content for users to post online. Cyborg propaganda thereby exploits a critical legal shield: by relying on verified citizens to ratify and disseminate messages, these campaigns operate in a regulatory gray zone, evading liability frameworks designed for automated botnets. We explore the collective action paradox of this technology: does it democratize power by ‘unionizing’ influence (pooling the reach of dispersed citizens to overcome the algorithmic invisibility of isolated voices), or does it reduce citizens to ‘cognitive proxies’ of a central directive? We argue that cyborg propaganda fundamentally alters the digital public square, shifting political discourse from a democratic contest of individual ideas to a battle of algorithmic campaigns. We outline a research agenda to distinguish organic from coordinated information diffusion and propose governance frameworks to address the regulatory challenges of AI-assisted collective expression.

[AI-5] EXCODER: EXplainable Classification Of DiscretE time series Representations PAKDD2026

【速读】:该论文旨在解决深度学习模型在时间序列分类任务中缺乏可解释性的问题,尤其是传统可解释人工智能(Explainable AI, XAI)方法因原始时间序列数据的高维度和噪声而效果受限。其解决方案的关键在于将时间序列转换为离散的潜在表示(discrete latent representations),通过向量量化变分自编码器(VQ-VAE)或离散变分自编码器(DVAE)等方法压缩冗余信息,从而聚焦于最具判别性的模式。这一转换不仅保留了分类所需的本质特征,还使XAI方法生成的解释更加简洁、结构化且忠实于模型决策,同时提出相似子序列准确率(Similar Subsequence Accuracy, SSA)作为量化评估指标,系统验证XAI识别的关键子序列是否与训练数据标签分布一致,从而提升解释的可信度与实用性。

链接: https://arxiv.org/abs/2602.13087
作者: Yannik Hahn,Antonin Königsfeld,Hasan Tercan,Tobias Meisen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at PAKDD 2026

点击查看摘要

Abstract:Deep learning has significantly improved time series classification, yet the lack of explainability in these models remains a major challenge. While Explainable AI (XAI) techniques aim to make model decisions more transparent, their effectiveness is often hindered by the high dimensionality and noise present in raw time series data. In this work, we investigate whether transforming time series into discrete latent representations-using methods such as Vector Quantized Variational Autoencoders (VQ-VAE) and Discrete Variational Autoencoders (DVAE)-not only preserves but enhances explainability by reducing redundancy and focusing on the most informative patterns. We show that applying XAI methods to these compressed representations leads to concise and structured explanations that maintain faithfulness without sacrificing classification performance. Additionally, we propose Similar Subsequence Accuracy (SSA), a novel metric that quantitatively assesses the alignment between XAI-identified salient subsequences and the label distribution in the training data. SSA provides a systematic way to validate whether the features highlighted by XAI methods are truly representative of the learned classification patterns. Our findings demonstrate that discrete latent representations not only retain the essential characteristics needed for classification but also offer a pathway to more compact, interpretable, and computationally efficient explanations in time series analysis.

[AI-6] Bus-Conditioned Zero-Shot Trajectory Generation via Task Arithmetic

【速读】:该论文旨在解决在缺乏目标城市真实移动轨迹数据(mobility trajectory data)的情况下,如何生成符合目标城市特征的轨迹问题。传统轨迹生成方法通常依赖于目标城市的部分真实数据,限制了其在数据不可获取场景中的应用。解决方案的关键在于提出了一种新的问题设定——基于公交时刻表的零样本轨迹生成(bus-conditioned zero-shot trajectory generation),并设计了MobTA方法,首次将任务算术(task arithmetic)引入轨迹生成领域。MobTA通过建模源城市中基于公交时刻表生成轨迹与实际移动轨迹之间的参数偏移,并利用任务向量的算术运算将该偏移迁移到目标城市,从而在无需任何目标城市真实轨迹数据的前提下,生成反映目标城市移动模式的轨迹。

链接: https://arxiv.org/abs/2602.13071
作者: Shuai Liu,Ning Cao,Yile Chen,Yue Jiang,Gao Cong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobility trajectory data provide essential support for smart city applications. However, such data are often difficult to obtain. Meanwhile, most existing trajectory generation methods implicitly assume that at least a subset of real mobility data from target city is available, which limits their applicability in data-inaccessible scenarios. In this work, we propose a new problem setting, called bus-conditioned zero-shot trajectory generation, where no mobility trajectories from a target city are accessible. The generation process relies solely on source city mobility data and publicly available bus timetables from both cities. Under this setting, we propose MobTA, the first approach to introduce task arithmetic into trajectory generation. MobTA models the parameter shift from bus-timetable-based trajectory generation to mobility trajectory generation in source city, and applies this shift to target city through arithmetic operations on task vectors. This enables trajectory generation that reflects target-city mobility patterns without requiring any real mobility data from it. Furthermore, we theoretically analyze MobTA’s stability across base and instruction-tuned LLMs. Extensive experiments show that MobTA significantly outperforms existing methods, and achieves performance close to models finetuned using target city mobility trajectories.

[AI-7] Diverging Flows: Detecting Extrapolations in Conditional Generation

【速读】:该论文旨在解决流模型(Flow Matching, FM)在安全关键场景中因外推风险(extrapolation hazard)导致的“无声失败”问题:由于流模型具有平滑性偏好,即使输入处于数据分布之外(off-manifold),仍可能生成看似合理的输出,从而无法区分有效预测与错误预测。解决方案的关键在于提出发散流(Diverging Flows),通过结构化地强制对离群输入进行低效传输(inefficient transport),使模型在保持条件生成能力的同时,原生支持外推检测,从而实现高保真预测与可靠异常识别的统一。

链接: https://arxiv.org/abs/2602.13061
作者: Constantinos Tsakonas,Serena Ivaldi,Jean-Baptiste Mouret
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 19 pages, 8 figures, 2 algorithms, 8 tables

点击查看摘要

Abstract:The ability of Flow Matching (FM) to model complex conditional distributions has established it as the state-of-the-art for prediction tasks (e.g., robotics, weather forecasting). However, deployment in safety-critical settings is hindered by a critical extrapolation hazard: driven by smoothness biases, flow models yield plausible outputs even for off-manifold conditions, resulting in silent failures indistinguishable from valid predictions. In this work, we introduce Diverging Flows, a novel approach that enables a single model to simultaneously perform conditional generation and native extrapolation detection by structurally enforcing inefficient transport for off-manifold inputs. We evaluate our method on synthetic manifolds, cross-domain style transfer, and weather temperature forecasting, demonstrating that it achieves effective detection of extrapolations without compromising predictive fidelity or inference latency. These results establish Diverging Flows as a robust solution for trustworthy flow models, paving the way for reliable deployment in domains such as medicine, robotics, and climate science.

[AI-8] Geometric Manifold Rectification for Imbalanced Learning

【速读】:该论文旨在解决不平衡分类(imbalanced classification)问题,尤其是在表格数据中存在噪声和类别边界重叠时,传统欠采样技术(如Edited Nearest Neighbours, ENN)因采用对称清洗规则和均匀投票机制,难以捕捉局部流形结构,常误删具有信息量的少数类样本。解决方案的关键在于提出GMR(Geometric Manifold Rectification)框架,其核心创新为:(1)基于逆距离加权k近邻投票的几何置信度估计,通过自适应距离度量捕获局部可靠性;(2)采用非对称清洗策略,对多数类严格清洗,同时通过保护性阈值(safe-guarding cap)保守保留少数类样本,从而有效修复由多数类拓扑侵入导致的决策边界模糊问题。

链接: https://arxiv.org/abs/2602.13045
作者: Xubin Wang,Qing Li,Weijia Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imbalanced classification presents a formidable challenge in machine learning, particularly when tabular datasets are plagued by noise and overlapping class boundaries. From a geometric perspective, the core difficulty lies in the topological intrusion of the majority class into the minority manifold, which obscures the true decision boundary. Traditional undersampling techniques, such as Edited Nearest Neighbours (ENN), typically employ symmetric cleaning rules and uniform voting, failing to capture the local manifold structure and often inadvertently removing informative minority samples. In this paper, we propose GMR (Geometric Manifold Rectification), a novel framework designed to robustly handle imbalanced structured data by exploiting local geometric priors. GMR makes two contributions: (1) Geometric confidence estimation that uses inverse-distance weighted kNN voting with an adaptive distance metric to capture local reliability; and (2) asymmetric cleaning that is strict on majority samples while conservatively protecting minority samples via a safe-guarding cap on minority removal. Extensive experiments on multiple benchmark datasets show that GMR is competitive with strong sampling baselines.

[AI-9] Prior-Guided Symbolic Regression: Towards Scientific Consistency in Equation Discovery

【速读】:该论文旨在解决符号回归(Symbolic Regression, SR)中普遍存在的“伪方程陷阱”(Pseudo-Equation Trap)问题,即现有方法虽能拟合观测数据但缺乏与基础科学原理的一致性。其核心解决方案是提出PG-SR框架,采用三阶段流水线(预热、演化、精化),并在演化阶段引入Prior Annealing Constrained Evaluation (PACE)机制,通过显式编码领域先验为可执行约束程序,并逐步引导搜索空间向科学一致区域收敛。理论证明该方法可降低假设空间的Rademacher复杂度,从而提供对伪方程的泛化保证,实验表明其在多种场景下优于主流基线方法,且对先验质量、噪声和数据稀缺具有鲁棒性。

链接: https://arxiv.org/abs/2602.13021
作者: Jing Xiao,Xinhai Chen,Jiaming Peng,Qinglin Wang,Menghan Jia,Zhiquan Lai,Guangping Yu,Dongsheng Li,Tiejun Li,Jie Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Symbolic Regression (SR) aims to discover interpretable equations from observational data, with the potential to reveal underlying principles behind natural phenomena. However, existing approaches often fall into the Pseudo-Equation Trap: producing equations that fit observations well but remain inconsistent with fundamental scientific principles. A key reason is that these approaches are dominated by empirical risk minimization, lacking explicit constraints to ensure scientific consistency. To bridge this gap, we propose PG-SR, a prior-guided SR framework built upon a three-stage pipeline consisting of warm-up, evolution, and refinement. Throughout the pipeline, PG-SR introduces a prior constraint checker that explicitly encodes domain priors as executable constraint programs, and employs a Prior Annealing Constrained Evaluation (PACE) mechanism during the evolution stage to progressively steer discovery toward scientifically consistent regions. Theoretically, we prove that PG-SR reduces the Rademacher complexity of the hypothesis space, yielding tighter generalization bounds and establishing a guarantee against pseudo-equations. Experimentally, PG-SR outperforms state-of-the-art baselines across diverse domains, maintaining robustness to varying prior quality, noisy data, and data scarcity.

[AI-10] Synaptic Activation and Dual Liquid Dynamics for Interpretable Bio-Inspired Models

【速读】:该论文旨在解决生物启发式模型(bio-inspired models)在结构与功能差异上的理解难题,以及如何提升递归神经网络(RNN)政策的可解释性与准确性。其核心解决方案在于提出一个统一框架,并引入液态电容扩展(liquid-capacitance-extended)机制以增强密集全连接RNN策略的可解释性;进一步通过整合化学突触(chemical synapses)和突触激活(synaptic activation),显著提升了RNN模型的准确性和可解释性,尤其在复杂车道保持控制任务中表现出优越性能,涵盖多种量化指标如加权验证损失、神经活动相关性、注意力显著性图及其结构相似性鲁棒性。

链接: https://arxiv.org/abs/2602.13017
作者: Mónika Farsang,Radu Grosu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we present a unified framework for various bio-inspired models to better understand their structural and functional differences. We show that liquid-capacitance-extended models lead to interpretable behavior even in dense, all-to-all recurrent neural network (RNN) policies. We further demonstrate that incorporating chemical synapses improves interpretability and that combining chemical synapses with synaptic activation yields the most accurate and interpretable RNN models. To assess the accuracy and interpretability of these RNN policies, we consider the challenging lane-keeping control task and evaluate performance across multiple metrics, including turn-weighted validation loss, neural activity during driving, absolute correlation between neural activity and road trajectory, saliency maps of the networks’ attention, and the robustness of their saliency maps measured by the structural similarity index.

[AI-11] Learning Native Continuation for Action Chunking Flow Policies

【速读】:该论文旨在解决动作分块(Action Chunking)执行时在分块边界处出现的不连续性问题,以及现有实时分块(Real-Time Chunking, RTC)方法因外部策略导致的多模态切换伪影和轨迹非内在平滑的问题。其解决方案的关键在于提出一种训练阶段的动作延续方法——Legato,通过从调度形状混合已知动作与噪声的初始去噪分布出发,使模型暴露于部分动作信息中;同时重塑学习到的流动力学,确保在每步引导下训练与推理阶段的去噪过程保持一致性,并在训练中引入随机调度条件以支持不同推理延迟并实现可控平滑性。实验证明,Legato能显著提升轨迹平滑度、减少伪多模态切换,从而降低任务执行中的犹豫行为并缩短完成时间。

链接: https://arxiv.org/abs/2602.12978
作者: Yufeng Liu,Hang Yu,Juntu Zhao,Bocheng Li,Di Zhang,Mingzhu Li,Wenxuan Wu,Yingdong Hu,Junyuan Xie,Junliang Guo,Dequan Wang,Yang Gao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

[AI-12] Drift-Aware Variational Autoencoder-based Anomaly Detection with Two-level Ensembling

【速读】:该论文旨在解决流数据环境中因缺乏标签而导致异常检测困难,尤其是在非平稳环境下由于概念漂移(concept drift)引起模型性能下降的问题。解决方案的关键在于提出一种名为VAE++ESDD的新方法,其核心创新包括:采用增量学习机制以适应数据分布变化;构建两层集成结构——第一层为多个变分自编码器(Variational AutoEncoder, VAE)组成的集成用于异常预测,第二层为基于统计学原理的概念漂移检测器集成,从而实现对异常事件的高效识别与模型性能的持续优化。

链接: https://arxiv.org/abs/2602.12976
作者: Jin Li,Kleanthis Malialis,Christos G. Panayiotou,Marios M. Polycarpou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted

点击查看摘要

Abstract:In today’s digital world, the generation of vast amounts of streaming data in various domains has become ubiquitous. However, many of these data are unlabeled, making it challenging to identify events, particularly anomalies. This task becomes even more formidable in nonstationary environments where model performance can deteriorate over time due to concept drift. To address these challenges, this paper presents a novel method, VAE++ESDD, which employs incremental learning and two-level ensembling: an ensemble of Variational AutoEncoder(VAEs) for anomaly prediction, along with an ensemble of concept drift detectors. Each drift detector utilizes a statistical-based concept drift mechanism. To evaluate the effectiveness of VAE++ESDD, we conduct a comprehensive experimental study using real-world and synthetic datasets characterized by severely or extremely low anomalous rates and various drift characteristics. Our study reveals that the proposed method significantly outperforms both strong baselines and state-of-the-art methods.

[AI-13] Extending confidence calibration to generalised measures of variation

【速读】:该论文旨在解决机器学习分类器校准评估中现有指标的局限性问题,特别是传统期望校准误差(Expected Calibration Error, ECE)仅关注最大概率或置信度的校准性能,而未能充分反映整个概率分布的校准状态。解决方案的关键在于提出变分校准误差(Variation Calibration Error, VCE)这一新指标,它将ECE从仅评估置信度校准扩展至评估任意变分度量(如香农熵)的校准性能,从而更全面地衡量模型输出概率分布的校准质量。通过合成数据实验验证,VCE在理想校准场景下随样本量增加趋近于零,而对比的熵基校准指标(如不确定性校准误差,UCE)则不具备此性质,凸显了VCE的理论合理性与实用性。

链接: https://arxiv.org/abs/2602.12975
作者: Andrew Thompson,Vivek Desai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose the Variation Calibration Error (VCE) metric for assessing the calibration of machine learning classifiers. The metric can be viewed as an extension of the well-known Expected Calibration Error (ECE) which assesses the calibration of the maximum probability or confidence. Other ways of measuring the variation of a probability distribution exist which have the advantage of taking into account the full probability distribution, for example the Shannon entropy. We show how the ECE approach can be extended from assessing confidence calibration to assessing the calibration of any metric of variation. We present numerical examples upon synthetic predictions which are perfectly calibrated by design, demonstrating that, in this scenario, the VCE has the desired property of approaching zero as the number of data samples increases, in contrast to another entropy-based calibration metric (the UCE) which has been proposed in the literature.

[AI-14] Information-theoretic analysis of world models in optimal reward maximizers

【速读】:该论文试图解决的问题是:在人工智能领域中,成功行为是否需要对环境存在内部表征(即“隐式世界模型”),以及最优策略能提供多少关于环境的信息量。解决方案的关键在于通过信息论方法量化最优策略与环境之间的互信息,证明在状态数为 $ n $、动作数为 $ m $ 的受控马尔可夫过程(Controlled Markov Process, CMP)中,若策略对任意非恒定奖励函数均最优,则其提供的关于环境的信息恰好为 $ n \log m $ 比特,且这一下界适用于有限时域、无限折扣和时间平均等多种目标形式,从而确立了实现最优决策所需的最小隐式建模能力。

链接: https://arxiv.org/abs/2602.12963
作者: Alfred Harwood,Jose Faustino,Alex Altair
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 0 figures. Not submitted to any conference yet

点击查看摘要

Abstract:An important question in the field of AI is the extent to which successful behaviour requires an internal representation of the world. In this work, we quantify the amount of information an optimal policy provides about the underlying environment. We consider a Controlled Markov Process (CMP) with n states and m actions, assuming a uniform prior over the space of possible transition dynamics. We prove that observing a deterministic policy that is optimal for any non-constant reward function then conveys exactly n \log m bits of information about the environment. Specifically, we show that the mutual information between the environment and the optimal policy is n \log m bits. This bound holds across a broad class of objectives, including finite-horizon, infinite-horizon discounted, and time-averaged reward maximization. These findings provide a precise information-theoretic lower bound on the "implicit world model’’ necessary for optimality.

[AI-15] riGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

【速读】:该论文旨在解决在资源受限设备上部署基于Transformer的大语言模型(Large Language Models, LLMs)时面临的两大挑战:一是LLMs参数规模快速增长且参数重用率低,导致端侧推理效率低下;二是现有NPU架构难以在有限的计算和存储资源下实现高效、准确的端到端执行。解决方案的关键在于提出一种软硬件协同设计的新型NPU架构TriGen,其核心创新包括:(1)采用微缩放(microscaling, MX)低精度计算策略,在保持精度的同时挖掘额外优化空间并解决低精度带来的误差问题;(2)通过快速且高精度的查找表(Look-Up Table, LUT)替代专用硬件实现非线性运算,从而同时优化线性和非线性操作,提升性能并降低硬件开销;(3)结合实际硬件约束,引入调度技术以最大化计算单元利用率,即使在片上内存受限条件下也能实现高效执行。实验表明,TriGen相比基线NPU设计平均提速2.73倍,内存传输减少52%,且精度损失可忽略。

链接: https://arxiv.org/abs/2602.12962
作者: Jonghun Lee,Junghoon Lee,Hyeonjin Kim,Seoho Jeon,Jisup Yoon,Hyunbin Park,Meejeong Park,Heonjae Ha
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 13 pages, 14 figures

点击查看摘要

Abstract:Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based large language models (LLMs) have become dominant, with rapidly increasing model sizes but low degree of parameter reuse compared to conventional CNNs, making end-to-end execution on resource-limited devices extremely challenging. To address these challenges, we propose TriGen, a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design. Firstly, TriGen adopts low-precision computation using microscaling (MX) to enable additional optimization opportunities while preserving accuracy, and resolves the issues that arise by employing such precision. Secondly, to jointly optimize both nonlinear and linear operations, TriGen eliminates the need for specialized hardware for essential nonlinear operations by using fast and accurate LUT, thereby maximizing performance gains and reducing hardware-cost in on-device environments, and finally, by taking practical hardware constraints into account, further employs scheduling techniques to maximize computational utilization even under limited on-chip memory capacity. We evaluate the performance of TriGen on various LLMs and show that TriGen achieves an average 2.73x performance speedup and 52% less memory transfer over the baseline NPU design with negligible accuracy loss.

[AI-16] BrowseComp-V3: A Visual Vertical and Verifiable Benchmark for Multimodal Browsing Agents

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在开放世界环境中进行深度搜索(deep search)时,现有基准测试在任务复杂度、证据可获取性及评估粒度方面的局限性问题。其解决方案的关键在于提出一个名为BrowseComp-V³的新基准,包含300个精心设计的跨领域挑战性问题,强调多层级、跨模态的多跳推理(multi-hop reasoning),并要求所有支撑证据均需公开可搜,确保公平性和可复现性;同时引入专家验证的子目标驱动过程评估机制,实现对中间推理行为的细粒度分析,并结合OmniSeeker统一框架集成多种网络搜索与视觉感知工具,从而系统性地刻画模型能力边界。实验表明,即使是最先进的模型在该基准上准确率也仅为36%,揭示了当前模型在多模态信息融合与精细感知上的显著瓶颈。

链接: https://arxiv.org/abs/2602.12876
作者: Huanyao Zhang,Jiepeng Zhou,Bo Li,Bowen Zhou,Yanzhe Dan,Haishan Lu,Zhiyong Cao,Jiaoyang Chen,Yuqian Han,Zinan Sheng,Zhengwei Tao,Hao Liang,Jialong Wu,Yang Shi,Yuanpeng He,Jiaye Lin,Qintong Zhang,Guochen Yan,Runhao Zhao,Zhengpin Li,Xiaohan Yu,Lang Mei,Chong Chen,Wentao Zhang,Bin Cui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments. However, existing benchmarks for multimodal browsing remain limited in task complexity, evidence accessibility, and evaluation granularity, hindering comprehensive and reproducible assessments of deep search capabilities. To address these limitations, we introduce BrowseComp- V^3 , a novel benchmark consisting of 300 carefully curated and challenging questions spanning diverse domains. The benchmark emphasizes deep, multi-level, and cross-modal multi-hop reasoning, where critical evidence is interleaved across textual and visual modalities within and across web pages. All supporting evidence is strictly required to be publicly searchable, ensuring fairness and reproducibility. Beyond final-answer accuracy, we incorporate an expert-validated, subgoal-driven process evaluation mechanism that enables fine-grained analysis of intermediate reasoning behaviors and systematic characterization of capability boundaries. In addition, we propose OmniSeeker, a unified multimodal browsing agent framework integrating diverse web search and visual perception tools. Comprehensive experiments demonstrate that even state-of-the-art models achieve only 36% accuracy on our benchmark, revealing critical bottlenecks in multimodal information integration and fine-grained perception. Our results highlight a fundamental gap between current model capabilities and robust multimodal deep search in real-world settings.

[AI-17] A Microservice-Based Platform for Sustainable and Intelligent SLO Fulfilment and Service Management

【速读】:该论文旨在解决在计算连续体(Computing Continuum, CC)环境下,基于微服务架构(Microservices Architecture, MSA)的应用难以同时满足性能与可持续性等服务级别目标(Service Level Objectives, SLOs)的问题,尤其是在CC提供商需遵守应用开发者隐私要求的前提下,如何实现运行时的服务重构以达成SLO。解决方案的关键在于提出并实现了一个名为Carbon-Aware SLO and Control plAtform (CASCA)的开源MSA平台,该平台通过模块化、可扩展且隐私保护的设计,使CC提供商能够在不泄露开发者敏感信息的情况下动态调整微服务配置,从而在保障隐私的同时有效优化SLO达成效果。

链接: https://arxiv.org/abs/2602.12875
作者: Juan Luis Herrera,Daniel Wang,Schahram Dustdar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The Microservices Architecture (MSA) design pattern has become a staple for modern applications, allowing functionalities to be divided across fine-grained microservices, fostering reusability, distribution, and interoperability. As MSA-based applications are deployed to the Computing Continuum (CC), meeting their Service Level Objectives (SLOs) becomes a challenge. Trading off performance and sustainability SLOs is especially challenging. This challenge can be addressed with intelligent decision systems, able to reconfigure the services during runtime to meet the SLOs. However, developing these agents while adhering to the MSA pattern is complex, especially because CC providers, who have key know-how and information to fulfill these SLOs, must comply with the privacy requirements of application developers. This work presents the Carbon-Aware SLO and Control plAtform (CASCA), an open-source MSA-based platform that allows CC providers to reconfigure services and fulfill their SLOs while maintaining the privacy of developers. CASCA is architected to be highly reusable, distributable, and easy to use, extend, and modify. CASCA has been evaluated in a real CC testbed for a media streaming service, where decision systems implemented in Bash, Rust, and Python successfully reconfigured the service, unaffected by upholding privacy.

[AI-18] WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning

【速读】:该论文旨在解决当前基于网络代理(web agent)的深度研究系统在复杂信息检索任务中搜索效率低下的问题,尤其是现有开源模型常依赖冗长的工具调用轨迹、循环推理和无效分支探索,导致资源浪费。其解决方案的关键在于提出WebClipper框架,通过图结构对代理的搜索过程进行建模,并将轨迹优化转化为最小必要有向无环图(Minimum-Necessary Directed Acyclic Graph, DAG)挖掘问题,从而剪枝掉冗余步骤,保留核心推理路径;在此基础上持续训练可使代理逐步演化出更高效的搜索模式,在减少约20%工具调用轮次的同时提升准确性。

链接: https://arxiv.org/abs/2602.12852
作者: Junjie Wang,Zequn Xie,Dan Yang,Jie Feng,Yue Shen,Duolin Sun,Meixiu Long,Yihan Jiao,Zhehao Tan,Jian Wang,Peng Wei,Jinjie Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in Progress

点击查看摘要

Abstract:Deep Research systems based on web agents have shown strong potential in solving complex information-seeking tasks, yet their search efficiency remains underexplored. We observe that many state-of-the-art open-source web agents rely on long tool-call trajectories with cyclic reasoning loops and exploration of unproductive branches. To address this, we propose WebClipper, a framework that compresses web agent trajectories via graph-based pruning. Concretely, we model the agent’s search process as a state graph and cast trajectory optimization as a minimum-necessary Directed Acyclic Graph (DAG) mining problem, yielding pruned trajectories that preserve essential reasoning while eliminating redundant steps. Continued training on these refined trajectories enables the agent to evolve toward more efficient search patterns and reduces tool-call rounds by about 20% while improving accuracy. Furthermore, we introduce a new metric called F-AE Score to measure the model’s overall performance in balancing accuracy and efficiency. Experiments demonstrate that WebClipper compresses tool-call rounds under excellent performance, providing practical insight into balancing effectiveness and efficiency in web agent design.

[AI-19] Chimera: Neuro-Symbolic Attention Primitives for Trustworthy Dataplane Intelligence

【速读】:该论文旨在解决在可编程数据平面(programmable dataplane)上部署表达能力强的神经网络模型时所面临的硬件资源受限与行为不可预测性问题,从而实现线速、低延迟的流量分析。其解决方案的关键在于提出了一种名为Chimera的神经符号(neuro-symbolic)框架,该框架将注意力机制导向的神经计算与符号约束映射到数据平面原语上,通过核化线性注意力近似、两层键选择层次结构和级联融合机制,在保证神经模型表达能力的同时,强制执行硬性符号保障(hard symbolic guarantees),并结合硬件感知的映射协议与双时间尺度更新策略,实现了稳定且符合商用可编程交换机资源预算的线速推理。

链接: https://arxiv.org/abs/2602.12851
作者: Rong Fu,Wenxin Zhang,Xiaowen Ma,Kun Liu,Wangyu Wu,Ziyu Kong,Jia Yee Tan,Tailong Luo,Xianda Li,Zeli Su,Youjin Wang,Yongtai Liu,Simon Fong
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 23 pages, 11 figures

点击查看摘要

Abstract:Deploying expressive learning models directly on programmable dataplanes promises line-rate, low-latency traffic analysis but remains hindered by strict hardware constraints and the need for predictable, auditable behavior. Chimera introduces a principled framework that maps attention-oriented neural computations and symbolic constraints onto dataplane primitives, enabling trustworthy inference within the match-action pipeline. Chimera combines a kernelized, linearized attention approximation with a two-layer key-selection hierarchy and a cascade fusion mechanism that enforces hard symbolic guarantees while preserving neural expressivity. The design includes a hardware-aware mapping protocol and a two-timescale update scheme that together permit stable, line-rate operation under realistic dataplane budgets. The paper presents the Chimera architecture, a hardware mapping strategy, and empirical evidence showing that neuro-symbolic attention primitives can achieve high-fidelity inference within the resource envelope of commodity programmable switches.

[AI-20] Amortized Reasoning Tree Search: Decoupling Proposal and Decision in Large Language Models

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在对齐大型语言模型时所引发的“归一化挤压”(Normalization Squeeze)问题,即由于模式导向的策略梯度与有限采样之间的相互作用,导致低概率但正确的推理路径被系统性抑制甚至统计灭绝。解决方案的关键在于提出** amortized reasoning tree search (ARTS)**,其核心创新是将生成过程与验证过程解耦,通过引入一种流匹配(Flow Matching)目标函数,利用验证器估计概率流守恒,从而在稀疏高熵搜索空间中实现鲁棒导航,避免传统判别式目标失效。该方法无需修改生成模型参数即可显著提升长尾推理任务性能,在MATH-500基准上达到74.6%的BoN@16准确率,且在RLVR失效的长尾子集上恢复了有效性能。

链接: https://arxiv.org/abs/2602.12846
作者: Zesheng Hong,Jiadong Yu,Hui Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has established itself as the dominant paradigm for instilling rigorous reasoning capabilities in Large Language Models. While effective at amplifying dominant behaviors, we identify a critical pathology in this alignment process: the systematic suppression of valid but rare (low-likelihood under the base model distribution) reasoning paths. We theoretically characterize this phenomenon as a “Normalization Squeeze,” where the interplay between mode-seeking policy gradients and finite sampling acts as a high-pass likelihood filter, driving the probability of rare correct traces to statistical extinction. To counteract this collapse without discarding the base model’s latent diversity, we propose Amortized Reasoning Tree Search (ARTS). Unlike standard approaches that force internalization via parameter updates, ARTS prioritizes deliberation by decoupling generation from verification. We introduce a Flow Matching objective that repurposes the verifier to estimate the conservation of probability flow, enabling robust navigation through sparse, high-entropy search spaces where traditional discriminative objectives fail. Extensive experiments on the MATH-500 benchmark demonstrate that ARTS achieves a performance of 74.6% (BoN@16), effectively matching fully fine-tuned policies (74.7%) without modifying the generative backbone. Crucially, on the long-tail subset where coupled RL optimization collapses to 0% pass@k, ARTS uniquely recovers significant performance, suggesting that disentangling verification from generation offers a more robust pathway for solving complex reasoning tasks.

[AI-21] FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching

【速读】:该论文旨在解决在连续控制任务中,基于最大熵强化学习(Maximum Entropy Reinforcement Learning)时,迭代生成式策略(如扩散模型和流匹配)因无法直接获取动作对数密度而导致的优化困难问题。其解决方案的关键在于提出了一种无需似然估计的框架——场最小能量演员-评论家(Field Least-Energy Actor-Critic, FLAC),通过惩罚速度场的动能来调节策略的随机性。核心洞察是将策略优化建模为相对于高熵参考过程(如均匀分布)的广义薛定谔桥(Generalized Schrödinger Bridge, GSB)问题,使得最大熵原理自然地表现为在保持与高熵参考过程接近的同时优化回报,而无需显式计算动作密度;在此框架下,动能作为物理上合理的参考偏离度量,最小化路径空间能量可有效约束终端动作分布的偏差,进而推导出一种能量正则化的策略迭代算法,并设计了通过拉格朗日对偶机制自动调整动能的离策略实现方法。

链接: https://arxiv.org/abs/2602.12829
作者: Lei Lv,Yunfei Li,Yu Luo,Fuchun Sun,Xiao Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action log-densities are not directly accessible. To address this, we propose Field Least-Energy Actor-Critic (FLAC), a likelihood-free framework that regulates policy stochasticity by penalizing the kinetic energy of the velocity field. Our key insight is to formulate policy optimization as a Generalized Schrödinger Bridge (GSB) problem relative to a high-entropy reference process (e.g., uniform). Under this view, the maximum-entropy principle emerges naturally as staying close to a high-entropy reference while optimizing return, without requiring explicit action densities. In this framework, kinetic energy serves as a physically grounded proxy for divergence from the reference: minimizing path-space energy bounds the deviation of the induced terminal action distribution. Building on this view, we derive an energy-regularized policy iteration scheme and a practical off-policy algorithm that automatically tunes the kinetic energy via a Lagrangian dual mechanism. Empirically, FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines, while avoiding explicit density estimation.

[AI-22] GRAIL: Geometry-Aware Retrieval-Augmented Inference with LLM s over Hyperbolic Representations of Patient Trajectories

【速读】:该论文旨在解决从纵向电子健康记录(Electronic Health Records, EHRs)中预测未来临床事件的挑战,具体包括多类型临床事件稀疏性、医疗术语的层级结构复杂性,以及大语言模型(Large Language Models, LLMs)在处理长期结构化病史时易产生幻觉的问题。其解决方案的关键在于提出GRAIL框架,该框架通过构建融合确定性编码系统层级与数据驱动的时间关联的统一临床图谱,并将其嵌入双曲空间以捕捉层级关系;同时,将每次就诊表示为一个概率性的中心事件(Central Event),用于去噪稀疏观测;在推理阶段,基于结构感知检索生成符合层级和时间演进规律的未来事件集合,并可选地利用LLM作为约束性重排序器优化结果排序,从而实现更准确且语义一致的多类型下一访视事件预测。

链接: https://arxiv.org/abs/2602.12828
作者: Zhan Qu,Michael Färber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting future clinical events from longitudinal electronic health records (EHRs) is challenging due to sparse multi-type clinical events, hierarchical medical vocabularies, and the tendency of large language models (LLMs) to hallucinate when reasoning over long structured histories. We study next-visit event prediction, which aims to forecast a patient’s upcoming clinical events based on prior visits. We propose GRAIL, a framework that models longitudinal EHRs using structured geometric representations and structure-aware retrieval. GRAIL constructs a unified clinical graph by combining deterministic coding-system hierarchies with data-driven temporal associations across event types, embeds this graph in hyperbolic space, and summarizes each visit as a probabilistic Central Event that denoises sparse observations. At inference time, GRAIL retrieves a structured set of clinically plausible future events aligned with hierarchical and temporal progression, and optionally refines their ranking using an LLM as a constrained inference-time reranker. Experiments on MIMIC-IV show that GRAIL consistently improves multi-type next-visit prediction and yields more hierarchy-consistent forecasts.

[AI-23] Can Neural Networks Provide Latent Embeddings for Telemetry-Aware Greedy Routing?

【速读】:该论文试图解决传统路由算法在应对网络流量突增时响应效率低、缺乏对复杂网络状态依赖关系建模能力的问题。现有基于机器学习的路由方法虽能捕捉网络状态与路由决策间的非线性关系,但因采用黑箱神经网络模块而牺牲了决策可解释性。解决方案的关键在于提出Placer算法,其利用消息传递网络(Message Passing Networks)将网络状态映射为节点潜在嵌入(latent node embeddings),从而实现无需直接求解全对最短路径问题的快速贪婪下一跳路由,并支持通过可视化手段揭示特定网络事件如何影响路由决策过程。

链接: https://arxiv.org/abs/2602.12798
作者: Andreas Boltres,Niklas Freymuth,Gerhard Neumann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Telemetry-Aware routing promises to increase efficacy and responsiveness to traffic surges in computer networks. Recent research leverages Machine Learning to deal with the complex dependency between network state and routing, but sacrifices explainability of routing decisions due to the black-box nature of the proposed neural routing modules. We propose \emphPlacer, a novel algorithm using Message Passing Networks to transform network states into latent node embeddings. These embeddings facilitate quick greedy next-hop routing without directly solving the all-pairs shortest paths problem, and let us visualize how certain network events shape routing decisions.

[AI-24] ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

【速读】:该论文旨在解决大型基础视觉-语言-动作(Vision-Language-Action, VLA)系统在真实世界环境中通过在线强化学习(Online Reinforcement Learning, RL)进行后训练时,如何有效评估当前高容量策略的行为质量这一核心挑战。传统方法因采用保守的在策略(on-policy)估计以保证稳定性,导致无法直接评估当前策略性能,从而限制了学习效率。解决方案的关键在于提出ALOE(Action-Level Off-Policy Evaluation)框架,其核心创新是基于分块(chunking)的时间差分(Temporal-Difference, TD)bootstrap机制,将价值函数估计从任务最终结果预测转向对单个动作片段的逐级评估,从而实现稀疏奖励场景下的精准信用分配,并支持稳定策略优化。实验表明,ALOE在多个真实世界操作任务中显著提升了学习效率,且不牺牲执行速度,验证了离策略RL在VLA后训练中的可靠应用潜力。

链接: https://arxiv.org/abs/2602.12691
作者: Rushuai Yang,Hecheng Wang,Chiming Liu,Xiaohan Yan,Yunlong Wang,Xuan Du,Shuoyu Yue,Yongcheng Liu,Chuheng Zhang,Lizhe Qi,Yi Chen,Wei Shan,Maoqing Yao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based temporal-difference bootstrapping to evaluate individual action sequences instead of predicting final task outcomes. This design improves effective credit assignment to critical action chunks under sparse rewards and supports stable policy improvement. We evaluate our method on three real-world manipulation tasks, including smartphone packing as a high-precision task, laundry folding as a long-horizon deformable-object task, and bimanual pick-and-place involving multi-object perception. Across all tasks, ALOE improves learning efficiency without compromising execution speed, showing that off-policy RL can be reintroduced in a reliable manner for real-world VLA post-training. Videos and additional materials are available at our project website.

[AI-25] rust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

【速读】:该论文旨在解决知识蒸馏中教师模型因采用传统交叉熵训练导致的过自信问题,即教师输出的概率分布趋于尖锐且缺乏对不确定性的合理建模,从而使得学生模型难以有效学习到蕴含在“暗知识”(dark knowledge)中的类别间关系与不确定性分布。这一问题在高基数任务中尤为显著,因为细微的类间差异对指导紧凑型学生模型至关重要。解决方案的关键在于提出校准不确定性蒸馏(Calibrated Uncertainty Distillation, CUD),通过从分布角度重新审视蒸馏过程,引导教师在信息丰富处显式表达不确定性,并促使学生从校准后的目标分布中学习,而非简单复制教师的过度确信。CUD直接在蒸馏前优化教师预测分布,实现准确性和校准性的平衡,使学生既能从易例中受益于置信信号,也能从难例中学习结构化的不确定性,从而提升在分布偏移和长尾输入下的鲁棒性与可靠性。

链接: https://arxiv.org/abs/2602.12687
作者: Jeonghyun Kim,SooKyung Kim,Richeng Xuan,Hyunsoo Cho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The core of knowledge distillation lies in transferring the teacher’s rich ‘dark knowledge’-subtle probabilistic patterns that reveal how classes are related and the distribution of uncertainties. While this idea is well established, teachers trained with conventional cross-entropy often fail to preserve such signals. Their distributions collapse into sharp, overconfident peaks that appear decisive but are in fact brittle, offering little beyond the hard label or subtly hindering representation-level transfer. This overconfidence is especially problematic in high-cardinality tasks, where the nuances among many plausible classes matter most for guiding a compact student. Moreover, such brittle targets reduce robustness under distribution shift, leaving students vulnerable to miscalibration in real-world conditions. To address this limitation, we revisit distillation from a distributional perspective and propose Calibrated Uncertainty Distillation (CUD), a framework designed to make dark knowledge more faithfully accessible. Instead of uncritically adopting the teacher’s overconfidence, CUD encourages teachers to reveal uncertainty where it is informative and guides students to learn from targets that are calibrated rather than sharpened certainty. By directly shaping the teacher’s predictive distribution before transfer, our approach balances accuracy and calibration, allowing students to benefit from both confident signals on easy cases and structured uncertainty on hard ones. Across diverse benchmarks, CUD yields students that are not only more accurate, but also more calibrated under shift and more reliable on ambiguous, long-tail inputs.

[AI-26] SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)代理在推理时缺乏统一评估标准的问题,即如何量化“技能”(Skills)对代理性能的实际提升效果。其解决方案的关键在于提出 SkillsBench——一个包含86个任务、覆盖11个领域的基准测试平台,每个任务均配备精心设计的技能包(curated Skills)和确定性验证器,并在三种条件下进行评估:无技能、使用 curated Skills 和自动生成技能(self-generated Skills)。实验结果表明,curated Skills 平均可将通过率提升16.2个百分点,但效果因领域差异显著(从软件工程的+4.5pp到医疗健康领域的+51.9pp),且部分任务反而因引入技能而性能下降;更重要的是,自动生成技能未能带来平均收益,说明模型难以可靠地生成自身受益的程序性知识,从而揭示了技能结构化封装与精准配置的重要性。

链接: https://arxiv.org/abs/2602.12670
作者: Xiangyi Li,Wenbo Chen,Yimin Liu,Shenghan Zheng,Xiaokun Chen,Yifeng He,Yubo Li,Bingran You,Haotian Shen,Jiankai Sun,Shuyi Wang,Qunhong Zeng,Di Wang,Xuandong Zhao,Yuanli Wang,Roey Ben Chaim,Zonglin Di,Yipeng Gao,Junwei He,Yizhuo He,Liqiang Jing,Luyang Kong,Xin Lan,Jiachen Li,Songlin Li,Yijiang Li,Yueqian Lin,Xinyi Liu,Xuanqing Liu,Haoran Lyu,Ze Ma,Bowei Wang,Runhui Wang,Tianyu Wang,Wengao Ye,Yue Zhang,Hanwen Xing,Yiqi Xue,Steven Dillmann,Han-chung Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2–3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

[AI-27] Evaluating Robustness of Reasoning Models on Parameterized Logical Problems

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在逻辑推理任务中对结构敏感性认知不足的问题,尤其是标准SAT类基准测试因表面特征(如长度、措辞、子句顺序)与结构性决定因素混杂而难以准确评估模型真实推理能力。其解决方案的关键在于构建一个诊断性2-SAT基准,基于参数化的结构化2-CNF公式生成器,通过控制可解释的结构维度(如矛盾环大小、自由变量比例、植入骨架、延迟桥接子句及对称冗余等),精确隔离不同推理能力与失败模式。实验表明,即使保持表面统计不变,LLM在特定结构干预下仍出现性能突变,揭示了现有模型在结构敏感性上的脆弱性,从而为系统性评估和改进LLM的逻辑推理能力提供了新范式。

链接: https://arxiv.org/abs/2602.12665
作者: Naïm Es-sebbani,Esteban Marquer,Yakoub Salhi,Zied Bouraoui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2–CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that test abstraction under renaming and redundant structure. We evaluate LLM-based reasoners on decision accuracy and assignment validity, and quantify robustness under semantics-preserving perturbations such as clause reordering, filler clauses, and variable renaming. Across models, we observe sharp performance transitions under targeted structural interventions even when surface statistics are held fixed, revealing brittleness regimes that are invisible to aggregate SAT accuracy.

[AI-28] PMG: Parameterized Motion Generator for Human-like Locomotion Control

【速读】:该论文旨在解决当前类人机器人运动控制中面临的实际挑战,即现有基于全身体参考引导的方法难以适应高层命令接口和多样化任务场景的问题,具体表现为对高质量大规模数据集的依赖、在不同速度与姿态条件下的鲁棒性差以及对机器人特定标定敏感。解决方案的关键在于提出参数化运动生成器(Parameterized Motion Generator, PMG),其基于对人体运动结构的分析,仅需少量参数化的运动数据即可合成参考轨迹,并结合高维控制指令实现实时运动生成;同时通过模仿学习管道与基于优化的仿真到现实(sim-to-real)电机参数识别模块,实现了自然、类人且可验证的运动控制,从而为类人机器人提供了一条实用且实验验证可行的控制路径。

链接: https://arxiv.org/abs/2602.12656
作者: Chenxi Han,Yuheng Min,Zihao Huang,Ao Hong,Hang Liu,Yi Cheng,Houde Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 2026 IEEE International Conference on Robotics Automation

点击查看摘要

Abstract:Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain. In particular, while low-level motion tracking and trajectory-following controllers are mature, whole-body reference-guided methods are difficult to adapt to higher-level command interfaces and diverse task contexts: they require large, high-quality datasets, are brittle across speed and pose regimes, and are sensitive to robot-specific calibration. To address these limitations, we propose the Parameterized Motion Generator (PMG), a real-time motion generator grounded in an analysis of human motion structure that synthesizes reference trajectories using only a compact set of parameterized motion data together with High-dimensional control commands. Combined with an imitation-learning pipeline and an optimization-based sim-to-real motor parameter identification module, we validate the complete approach on our humanoid prototype ZERITH Z1 and show that, within a single integrated system, PMG produces natural, human-like locomotion, responds precisely to high-dimensional control inputs-including VR-based teleoperation-and enables efficient, verifiable sim-to-real transfer. Together, these results establish a practical, experimentally validated pathway toward natural and deployable humanoid control.

[AI-29] Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics AAMAS2026

【速读】:该论文旨在解决强化学习中模型自由方法(model-free methods)效率不足与模型基于方法(model-based approaches)表示能力较强但存在规划开销之间的矛盾问题。解决方案的关键在于提出统一潜在动力学(Unified Latent Dynamics, ULD),通过将状态-动作对嵌入到一个潜在空间,在该空间中真实值函数近似线性,从而在不引入额外规划开销的前提下,融合模型自由方法的高效性和模型基于方法的表示优势。该方法利用编码器、价值网络和策略网络的同步更新机制,结合短 horizon 预测动力学辅助损失和奖励尺度归一化,实现跨域稳定学习,并在多个基准环境(包括Gym、DeepMind Control和Atari)上展现出与专用模型自由或通用模型基于基线相当甚至更优的性能,验证了价值对齐的潜在表征本身即可提供传统全模型基于规划所依赖的适应性和样本效率。

链接: https://arxiv.org/abs/2602.12643
作者: Jashaswimalya Acharjee,Balaraman Ravindran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 13 pages. Accepted at AAMAS 2026

点击查看摘要

Abstract:We present Unified Latent Dynamics (ULD), a novel reinforcement learning algorithm that unifies the efficiency of model-free methods with the representational strengths of model-based approaches, without incurring planning overhead. By embedding state-action pairs into a latent space in which the true value function is approximately linear, our method supports a single set of hyperparameters across diverse domains – from continuous control with low-dimensional and pixel inputs to high-dimensional Atari games. We prove that, under mild conditions, the fixed point of our embedding-based temporal-difference updates coincides with that of a corresponding linear model-based value expansion, and we derive explicit error bounds relating embedding fidelity to value approximation quality. In practice, ULD employs synchronized updates of encoder, value, and policy networks, auxiliary losses for short-horizon predictive dynamics, and reward-scale normalization to ensure stable learning under sparse rewards. Evaluated on 80 environments spanning Gym locomotion, DeepMind Control (proprioceptive and visual), and Atari, our approach matches or exceeds the performance of specialized model-free and general model-based baselines – achieving cross-domain competence with minimal tuning and a fraction of the parameter footprint. These results indicate that value-aligned latent representations alone can deliver the adaptability and sample efficiency traditionally attributed to full model-based planning.

[AI-30] nsorCommitments: A Lightweight Verifiable Inference for Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在云端推理过程中缺乏可验证性的问题,即如何让服务端(prover)向客户端(verifier)证明其执行的推理过程未被恶意篡改,而无需客户端重新运行整个模型。现有方案中,基于密码学的方法在LLM规模下效率过低,而非密码学方法则依赖强验证者GPU,限制了实用性。本文提出了一种名为TensorCommitments(TCs)的张量原生推理证明方案,其核心在于将LLM推理结果绑定到一个不可逆的承诺标签(commitment),该标签在遭受篡改时会失效,并通过多变量Terkle树结构组织这些承诺,从而实现高效且安全的验证。实验表明,对于LLaMA2模型,TCs仅增加0.97%的证明者计算时间和0.12%的验证者计算时间,同时相比依赖验证者GPU的最佳先前工作,在抵御定制化LLM攻击方面提升了最高达48%的鲁棒性。

链接: https://arxiv.org/abs/2602.12630
作者: Oguzhan Baser,Elahe Sadeghi,Eric Wang,David Ribeiro Alves,Sam Kazemian,Hong Kang,Sandeep P. Chinchali,Sriram Vishwanath
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 23 pages, 8 figures, under review

点击查看摘要

Abstract:Most large language models (LLMs) run on external clouds: users send a prompt, pay for inference, and must trust that the remote GPU executes the LLM without any adversarial tampering. We critically ask how to achieve verifiable LLM inference, where a prover (the service) must convince a verifier (the client) that an inference was run correctly without rerunning the LLM. Existing cryptographic works are too slow at the LLM scale, while non-cryptographic ones require a strong verifier GPU. We propose TensorCommitments (TCs), a tensor-native proof-of-inference scheme. TC binds the LLM inference to a commitment, an irreversible tag that breaks under tampering, organized in our multivariate Terkle Trees. For LLaMA2, TC adds only 0.97% prover and 0.12% verifier time over inference while improving robustness to tailored LLM attacks by up to 48% over the best prior work requiring a verifier GPU.

[AI-31] GeoAgent : Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

【速读】:该论文旨在解决现有基于强化学习(Reinforcement Learning, RL)的方法在地理定位任务中因依赖AI生成的思维链(Chain-of-Thought, CoT)数据和训练策略而产生的性能与地理特性不一致的问题。解决方案的关键在于提出GeoSeek数据集,该数据集由地理专家和专业玩家标注的CoT数据构成,并设计了两种新型奖励机制:地理相似性奖励(geo-similarity reward)和一致性奖励(consistency reward),后者通过一致性代理(consistency agent)评估,从而引导模型从地理角度收敛至正确答案,同时保障推理过程的完整性与一致性。

链接: https://arxiv.org/abs/2602.12617
作者: Modi Jin,Yiming Zhang,Boyuan Sun,Dingwen Zhang,MingMing Cheng,Qibin Hou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies, which conflict with geographic characteristics. To address these issues, we first introduce GeoSeek, a new geolocation dataset comprising CoT data annotated by geographic experts and professional players. We further thoroughly explore the inherent characteristics of geographic tasks and propose a geo-similarity reward and a consistency reward assessed by a consistency agent to assist training. This encourages the model to converge towards correct answers from a geographic perspective while ensuring the integrity and consistency of its reasoning process. Experimental results show that GeoAgent outperforms existing methods and a series of general VLLMs across multiple grains, while generating reasoning that closely aligns with humans.

[AI-32] Power Interpretable Causal ODE Networks: A Unified Model for Explainable Anomaly Detection and Root Cause Analysis in Power Systems

【速读】:该论文旨在解决当前时间序列异常检测模型在电力系统等复杂网络中缺乏可解释性的问题,即现有机器学习方法通常作为“黑箱”运行,仅提供二元异常判断结果,无法识别异常类型、定位根源或解释异常形态。其解决方案的关键在于提出一种统一的因果引导架构——Power Interpretable Causality Ordinary Differential Equation (PICODE) Networks,该架构通过融合因果推理与常微分方程建模,在实现高精度异常检测的同时,能够同步完成根因定位(root cause localization)、异常类型分类(anomaly type classification)和异常形状表征(anomaly shape characterization),显著提升模型的可解释性并减少对标注数据和外部因果图的依赖。

链接: https://arxiv.org/abs/2602.12592
作者: Yue Sun,Likai Wang,Rick S. Blum,Parv Venkitasubramaniam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly detection and root cause analysis (RCA) are critical for ensuring the safety and resilience of cyber-physical systems such as power grids. However, existing machine learning models for time series anomaly detection often operate as black boxes, offering only binary outputs without any explanation, such as identifying anomaly type and origin. To address this challenge, we propose Power Interpretable Causality Ordinary Differential Equation (PICODE) Networks, a unified, causality-informed architecture that jointly performs anomaly detection along with the explanation why it is detected as an anomaly, including root cause localization, anomaly type classification, and anomaly shape characterization. Experimental results in power systems demonstrate that PICODE achieves competitive detection performance while offering improved interpretability and reduced reliance on labeled data or external causal graphs. We provide theoretical results demonstrating the alignment between the shape of anomaly functions and the changes in the weights of the extracted causal graphs.

[AI-33] Can I Have Your Order? Monte-Carlo Tree Search for Slot Filling Ordering in Diffusion Language Models

【速读】:该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)在计划-填充(plan-and-infill)解码过程中因槽位(slot)填充顺序敏感而导致的输出方差大、性能不稳定的问题。解决方案的关键在于将槽位选择建模为决策过程,并利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)优化填充顺序:通过前瞻模拟评估部分完成状态,在不承诺前系统性探索生成顺序的组合空间,从而提升生成质量。实验表明,该方法相比自回归基线平均提升3.2%,相比基线计划-填充方案提升8.0%,尤其在MBPP和MATH500数据集上分别取得19.5%和4.9%的显著增益。分析进一步指出,尽管McDiffuSE主要遵循序列化顺序,但引入非序列生成对最大化性能至关重要,且更大的探索常数比增加模拟次数更能有效克服模型置信度偏差,发现更优生成顺序。

链接: https://arxiv.org/abs/2602.12586
作者: Joshua Ong Jun Leang,Yu Zhao,Mihaela Cătălina Stoian,Wenda Li,Shay B. Cohen,Eleonora Giunchiglia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, preprint

点击查看摘要

Abstract:While plan-and-infill decoding in Masked Diffusion Models (MDMs) shows promise for mathematical and code reasoning, performance remains highly sensitive to slot infilling order, often yielding substantial output variance. We introduce McDiffuSE, a framework that formulates slot selection as decision making and optimises infilling orders through Monte Carlo Tree Search (MCTS). McDiffuSE uses look-ahead simulations to evaluate partial completions before commitment, systematically exploring the combinatorial space of generation orders. Experiments show an average improvement of 3.2% over autoregressive baselines and 8.0% over baseline plan-and-infill, with notable gains of 19.5% on MBPP and 4.9% on MATH500. Our analysis reveals that while McDiffuSE predominantly follows sequential ordering, incorporating non-sequential generation is essential for maximising performance. We observe that larger exploration constants, rather than increased simulations, are necessary to overcome model confidence biases and discover effective orderings. These findings establish MCTS-based planning as an effective approach for enhancing generation quality in MDMs.

[AI-34] VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

【速读】:该论文旨在解决强化学习中依赖外部验证器(verifier)导致的可扩展性问题,尤其是在无验证器环境下标准策略优化方法(如Group Relative Policy Optimization)因破坏性梯度方差而引发训练崩溃的问题。解决方案的关键在于提出一种无需验证器的课程强化学习框架(Verifier-Independent Curriculum Reinforcement Learning, VI-CuRL),其核心创新是利用模型自身的置信度构建课程机制,通过优先选择高置信度样本有效控制动作和问题层面的方差,从而在不依赖外部验证器的情况下实现稳定且高效的训练,并在多个基准测试中显著优于现有无验证器基线方法。

链接: https://arxiv.org/abs/2602.12579
作者: Xin-Qiang Cai,Masashi Sugiyama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduceVerifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model’s intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-independent baselines across six challenging benchmarks with/without verifiers.

[AI-35] Monte Carlo Tree Search with Reasoning Path Refinement for Small Language Models in Conversational Text-to-NoSQL

【速读】:该论文旨在解决自然语言到NoSQL查询生成任务中缺乏对多轮对话上下文建模的问题,即现有方法主要局限于单轮交互,无法有效处理真实场景下用户与系统之间的连续对话。其解决方案的关键在于提出Stage-MCTS框架,该框架将查询生成建模为一个搜索问题,利用基于规则的奖励机制引导蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)生成分步推理数据,并结合渐进式监督微调(progressive supervised fine-tuning, SFT)与自训练策略,使小型语言模型(small language models, SLMs)具备NoSQL特定的推理能力,从而在保持模型轻量化的同时显著提升多轮对话场景下的查询准确性。

链接: https://arxiv.org/abs/2602.12574
作者: Xubang Xiong,Raymond Chi-Wing Wong,Yuanfeng Song
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:NoSQL databases have been widely adopted in big data analytics, geospatial applications, and healthcare services, due to their flexibility and scalability. However, querying NoSQL databases requires specialized technical expertise, creating a high barrier for users. While recent studies have explored text-to-NoSQL problem, they primarily focus on single-turn interactions, ignoring the conversational nature of real-world queries. To bridge this gap, we introduce the Conversational Text-to-NoSQL task, which generates NoSQL queries given a natural language question, a NoSQL database, and the dialogue history. To address this task, we propose Stage-MCTS, a framework that endows small language models (SLMs) with NoSQL-specific reasoning capabilities by formulating query generation as a search problem. The framework employs Monte Carlo Tree Search (MCTS) guided by a rule-based reward to produce stepwise reasoning data, followed by progressive supervised fine-tuning (SFT) and self-training strategies. We further construct CoNoSQL, a cross-domain dataset with over 2,000 dialogues and 150 databases, to support evaluation. Experiments demonstrate that our approach outperforms state-of-the-art large reasoning models, improving execution value match (EVM) accuracy by up to 7.93%.

[AI-36] o Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

【速读】:该论文旨在解决多领域强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的协同训练问题,即如何在多个不同任务域(如数学、编程、科学推理和指令遵循)之间有效整合RLVR策略,以实现通用型专家级模型。其解决方案的关键在于系统性比较两种主流训练范式——混合多任务RLVR与分离训练后模型合并(separate RLVR followed by model merging),并通过大量定性和定量实验发现:跨领域的RLVR表现出较少的相互干扰,且以推理密集型任务为主导的领域间存在显著的协同增益效应;进一步从权重空间几何结构、模型预测行为以及信息约束角度揭示了这种协同机制的本质,最终提出名为M2RL的统一框架,支持灵活选择训练方式以优化多领域性能。

链接: https://arxiv.org/abs/2602.12566
作者: Haoqing Wang,Xiang Long,Ziheng Li,Yilong Xu,Tingguang Li,Yehui Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, and instruction following) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, model prediction behavior, and information constraints. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at this https URL

[AI-37] SD-MoE: Spectral Decomposition for Effective Expert Specialization

【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)架构在实际应用中因专家专业化失败而导致的有效容量受限和模型性能下降的问题。具体表现为部分专家功能趋同,另一些则退化为共享专家,削弱了MoE通过条件计算实现高效扩展的优势。研究表明,这一现象的根本原因在于:(1)专家参数在谱空间中存在高度重叠的主导成分;(2)梯度子空间在不同专家间强对齐,源于人类语料库普遍存在的低秩结构;(3)门控机制倾向于沿这些主导方向路由输入,进一步抑制了专家分化。为此,作者提出谱解耦型MoE(Spectral-Decoupled MoE, SD-MoE),其核心创新在于在谱空间中对参数与梯度进行分解,从而打破专家间的冗余关联,促进有效专业化,且仅引入极少额外计算开销,可无缝集成至Qwen、DeepSeek等主流MoE架构中。

链接: https://arxiv.org/abs/2602.12556
作者: Ruijun Huang,Fang Dong,Xin Zhang,Hengjie Cao,Zhendong Huang,Anrui Chen,Jixian Zhou,Mengyi Chen,Yifeng Yang,Mingzhi Dong,Yujiang Wang,Jinlong Hou,Qin Lv,Robert P. Dick,Yuan Cheng,Fan Yang,Tun Lu,Chun Zhang,Li Shang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures scale Large Language Models via expert specialization induced by conditional computation. In practice, however, expert specialization often fails: some experts become functionally similar, while others functioning as de facto shared experts, limiting the effective capacity and model performance. In this work, we analysis from a spectral perspective on parameter and gradient spaces, uncover that (1) experts share highly overlapping dominant spectral components in their parameters, (2) dominant gradient subspaces are strongly aligned across experts, driven by ubiquitous low-rank structure in human corpus, and (3) gating mechanisms preferentially route inputs along these dominant directions, further limiting specialization. To address this, we propose Spectral-Decoupled MoE (SD-MoE), which decomposes both parameter and gradient in the spectral space. SD-MoE improves performance across downstream tasks, enables effective expert specialization, incurring minimal additional computation, and can be seamlessly integrated into a wide range of existing MoE architectures, including Qwen and DeepSeek.

[AI-38] Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

【速读】:该论文旨在解决生成高质量训练数据以提升网络代理(web agents)性能的挑战,尤其是针对轨迹评估(trajectory evaluation)这一核心难题——即如何量化任务完成过程中的进展。传统方法难以有效利用部分成功的交互轨迹,导致训练数据利用率低。其解决方案的关键在于提出一种基于约束的评估框架(constraint-based evaluation framework),能够对任务进展进行细粒度评估,从而挖掘并利用原本被忽略的部分成功轨迹,显著扩展可用训练数据规模。该方法在新提出的BookingArena基准上验证,通过蒸馏得到的学生模型在复杂预订任务中表现优于开源方案,并达到或超越商业系统水平,同时模型体积更小。

链接: https://arxiv.org/abs/2602.12544
作者: Lajanugen Logeswaran,Jaekyeom Kim,Sungryull Sohn,Creighton Glasscock,Honglak Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: COLM 2025

点击查看摘要

Abstract:We present a scalable pipeline for automatically generating high-quality training data for web agents. In particular, a major challenge in identifying high-quality training instances is trajectory evaluation - quantifying how much progress was made towards task completion. We introduce a novel constraint-based evaluation framework that provides fine-grained assessment of progress towards task completion. This enables us to leverage partially successful trajectories, which significantly expands the amount of usable training data. We evaluate our method on a new benchmark we propose called BookingArena, which consists of complex booking tasks across 20 popular websites, and demonstrate that our distilled student model outperforms open-source approaches and matches or exceeds commercial systems, while being a significantly smaller model. Our work addresses the challenge of efficiently creating diverse, realistic web interaction datasets and provides a systematic evaluation methodology for complex structured web tasks.

[AI-39] Exploring Accurate and Transparent Domain Adaptation in Predictive Healthcare via Concept-Grounded Orthogonal Inference

【速读】:该论文旨在解决深度学习模型在电子健康记录(Electronic Health Records, EHR)上进行临床事件预测时,因数据分布变化导致性能下降的问题。现有领域自适应(Domain Adaptation, DA)方法虽可缓解此类分布偏移,但其“黑箱”特性难以满足临床实践中对透明性、可解释性和安全性的要求。解决方案的关键在于提出ExtraCare框架,通过将患者表征分解为不变(invariant)和协变(covariant)两个组件,并在训练过程中对这两个组件进行监督约束及其正交性强制,从而在保留标签信息的同时显式暴露域特定变化,实现比多数特征对齐方法更精准的预测;更重要的是,该方法通过将稀疏潜在维度映射至医学概念并借助定向消融实验量化其贡献,提供了人类可理解的解释,显著提升了模型的可解释性与临床可信度。

链接: https://arxiv.org/abs/2602.12542
作者: Pengfei Hu,Chang Lu,Feifan Liu,Yue Ning
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models for clinical event prediction on electronic health records (EHR) often suffer performance degradation when deployed under different data distributions. While domain adaptation (DA) methods can mitigate such shifts, its “black-box” nature prevents widespread adoption in clinical practice where transparency is essential for trust and safety. We propose ExtraCare to decompose patient representations into invariant and covariant components. By supervising these two components and enforcing their orthogonality during training, our model preserves label information while exposing domain-specific variation at the same time for more accurate predictions than most feature alignment models. More importantly, it offers human-understandable explanations by mapping sparse latent dimensions to medical concepts and quantifying their contributions via targeted ablations. ExtraCare is evaluated on two real-world EHR datasets across multiple domain partition settings, demonstrating superior performance along with enhanced transparency, as evidenced by its accurate predictions and explanations from extensive case studies.

[AI-40] Favia: Forensic Agent for Vulnerability-fix Identification and Analysis

【速读】:该论文旨在解决大规模软件仓库中漏洞修复提交(vulnerability-fixing commits)识别的难题,尤其是在真实场景下候选提交已具备安全相关性且高度相似时,现有自动化方法(包括传统机器学习和基于大语言模型(LLM)的方法)普遍面临精度-召回率权衡不佳的问题。解决方案的关键在于提出Favia——一个基于代理(agent-based)的取证框架,其核心创新是将可扩展的候选排序与深度、迭代的语义推理相结合:首先通过高效排序阶段缩小搜索空间,随后利用基于ReAct机制的LLM代理,在预提交代码库环境中使用专用工具进行漏洞组件定位、代码库导航,并建立代码变更与漏洞根本原因之间的因果关联,从而实现对间接、多文件及复杂修复模式的鲁棒识别。

链接: https://arxiv.org/abs/2602.12500
作者: André Storhaug,Jiamou Sun,Jingyue Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 44 pages, 12 figures, 5 tables, 3 listings

点击查看摘要

Abstract:Identifying vulnerability-fixing commits corresponding to disclosed CVEs is essential for secure software maintenance but remains challenging at scale, as large repositories contain millions of commits of which only a small fraction address security issues. Existing automated approaches, including traditional machine learning techniques and recent large language model (LLM)-based methods, often suffer from poor precision-recall trade-offs. Frequently evaluated on randomly sampled commits, we uncover that they are substantially underestimating real-world difficulty, where candidate commits are already security-relevant and highly similar. We propose Favia, a forensic, agent-based framework for vulnerability-fix identification that combines scalable candidate ranking with deep and iterative semantic reasoning. Favia first employs an efficient ranking stage to narrow the search space of commits. Each commit is then rigorously evaluated using a ReAct-based LLM agent. By providing the agent with a pre-commit repository as environment, along with specialized tools, the agent tries to localize vulnerable components, navigates the codebase, and establishes causal alignment between code changes and vulnerability root causes. This evidence-driven process enables robust identification of indirect, multi-file, and non-trivial fixes that elude single-pass or similarity-based methods. We evaluate Favia on CVEVC, a large-scale dataset we made that comprises over 8 million commits from 3,708 real-world repositories, and show that it consistently outperforms state-of-the-art traditional and LLM-based baselines under realistic candidate selection, achieving the strongest precision-recall trade-offs and highest F1-scores.

[AI-41] Designing RNAs with Language Models

【速读】:该论文旨在解决RNA设计问题,即寻找能够折叠成目标二级结构的核苷酸序列,这一任务因序列空间呈指数级增长及竞争性折叠模式众多而具有高度计算挑战性。传统方法将其视为优化问题,依赖于实例特定的启发式策略或基于约束的搜索,效率和泛化能力受限。论文的关键解决方案是将RNA设计重新建模为条件序列生成任务,并引入一个可复用的神经近似器——自回归语言模型(autoregressive language model, LM),直接从目标结构映射到序列。模型首先在随机诱导的结构-序列对上进行监督训练,随后通过强化学习(reinforcement learning, RL)优化端到端指标,并提出一种高效的小样本选择策略以显著提升RL训练效率与质量。实验表明,该方法在四个数据集上优于现有最先进系统,在Boltzmann概率等关键指标上表现更优,且推理速度提升1.7倍,证明了条件语言模型生成作为RNA设计中一种可扩展、任务无关替代方案的潜力。

链接: https://arxiv.org/abs/2602.12470
作者: Milan Gautam,Ning Dai,Tianshuo Zhou,Bowen Xie,David Mathews,Liang Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:RNA design, the task of finding a sequence that folds into a target secondary structure, has broad biological and biomedical impact but remains computationally challenging due to the exponentially large sequence space and exponentially many competing folds. Traditional approaches treat it as an optimization problem, relying on per-instance heuristics or constraint-based search. We instead reframe RNA design as conditional sequence generation and introduce a reusable neural approximator, instantiated as an autoregressive language model (LM), that maps target structures directly to sequences. We first train our model in a supervised setting on random-induced structure-sequence pairs, and then use reinforcement learning (RL) to optimize end-to-end metrics. We also propose methods to select a small subset for RL that greatly improves RL efficiency and quality. Across four datasets, our approach outperforms state-of-the-art systems on key metrics such as Boltzmann probability while being 1.7x faster, establishing conditional LM generation as a scalable, task-agnostic alternative to per-instance optimization for RNA design. Our code and data are available at this https URL.

[AI-42] Safe Reinforcement Learning via Recovery-based Shielding with Gaussian Process Dynamics Models AAMAS2026

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在安全关键应用中缺乏可证明安全性保障的问题,尤其是在未知且非线性的连续动力系统中的安全控制挑战。其解决方案的关键在于提出一种基于恢复机制的屏蔽框架(recovery-based shielding framework),通过引入一个备份策略(shield)与RL代理协同工作,并利用高斯过程(Gaussian Process, GP)进行不确定性量化,以预测潜在的安全约束违反情况;仅在必要时动态恢复至安全轨迹,同时利用被屏蔽代理收集的经验构建GP模型,并通过内部基于模型的采样优化策略,在不牺牲安全性的前提下实现无限制探索和样本高效学习。

链接: https://arxiv.org/abs/2602.12444
作者: Alexander W. Goodall,Francesco Belardinelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at AAMAS 2026

点击查看摘要

Abstract:Reinforcement learning (RL) is a powerful framework for optimal decision-making and control but often lacks provable guarantees for safety-critical applications. In this paper, we introduce a novel recovery-based shielding framework that enables safe RL with a provable safety lower bound for unknown and non-linear continuous dynamical systems. The proposed approach integrates a backup policy (shield) with the RL agent, leveraging Gaussian process (GP) based uncertainty quantification to predict potential violations of safety constraints, dynamically recovering to safe trajectories only when necessary. Experience gathered by the ‘shielded’ agent is used to construct the GP models, with policy optimization via internal model-based sampling - enabling unrestricted exploration and sample efficient learning, without compromising safety. Empirically our approach demonstrates strong performance and strict safety-compliance on a suite of continuous control environments.

[AI-43] CacheMind: From Miss Rates to Why – Natural-Language Trace-Grounded Reasoning for Cache Replacement ASPLOS2026

【速读】:该论文旨在解决CPU微架构中缓存替换策略优化的难题,传统方法依赖人工设计的启发式规则,难以实现高效性能提升;同时,缓存数据的分析过程繁琐,需手动过滤数百万条trace记录,导致效率低下且缺乏交互性。解决方案的关键在于提出CacheMind,一个基于检索增强生成(Retrieval-Augmented Generation, RAG)和大语言模型(Large Language Models, LLMs)的对话式工具,使架构师能够以自然语言提问(如“为何PC X相关的内存访问导致更多缓存淘汰?”),并获得与程序语义关联的、可读性强的trace-grounded回答。其核心创新包括:1)引入SIEVE和RANGER两种高效检索器,显著优于现有RAG方案(如LlamaIndex仅10%检索成功率);2)构建首个针对缓存替换问题的LLM推理验证基准CacheMindBench,证明CacheMind在未见任务上的高准确率(最高达89.33%);3)通过实际案例验证其有效性,例如通过旁路(bypassing)使用场景提升缓存命中率7.66%、加速比达2.04%,显示其在非平凡查询中具备实用价值。

链接: https://arxiv.org/abs/2602.12422
作者: Kaushal Mhapsekar,Azam Ghanbari,Bita Aslrousta,Samira Mirbagher-Ajorpaz
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 13 figures, ASPLOS 2026

点击查看摘要

Abstract:Cache replacement remains a challenging problem in CPU microarchitecture, often addressed using hand-crafted heuristics, limiting cache performance. Cache data analysis requires parsing millions of trace entries with manual filtering, making the process slow and non-interactive. To address this, we introduce CacheMind, a conversational tool that uses Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to enable semantic reasoning over cache traces. Architects can now ask natural language questions like, “Why is the memory access associated with PC X causing more evictions?”, and receive trace-grounded, human-readable answers linked to program semantics for the first time. To evaluate CacheMind, we present CacheMindBench, the first verified benchmark suite for LLM-based reasoning for the cache replacement problem. Using the SIEVE retriever, CacheMind achieves 66.67% on 75 unseen trace-grounded questions and 84.80% on 25 unseen policy-specific reasoning tasks; with RANGER, it achieves 89.33% and 64.80% on the same evaluations. Additionally, with RANGER, CacheMind achieves 100% accuracy on 4 out of 6 categories in the trace-grounded tier of CacheMindBench. Compared to LlamaIndex (10% retrieval success), SIEVE achieves 60% and RANGER achieves 90%, demonstrating that existing Retrieval-Augmented Generation (RAGs) are insufficient for precise, trace-grounded microarchitectural reasoning. We provided four concrete actionable insights derived using CacheMind, wherein bypassing use case improved cache hit rate by 7.66% and speedup by 2.04%, software fix use case gives speedup of 76%, and Mockingjay replacement policy use case gives speedup of 0.7%; showing the utility of CacheMind on non-trivial queries that require a natural-language interface.

[AI-44] Intent-Driven Smart Manufacturing Integrating Knowledge Graphs and Large Language Models

【速读】:该论文旨在解决智能制造环境中人机交互的复杂性问题,即如何将高层次的人类意图(human intents)高效、准确地转化为机器可执行的操作指令。其核心挑战在于实现自然语言与制造流程语义之间的精准映射,以支持Manufacturing-as-a-Service(MaaS)生态系统中的意图驱动型交互。解决方案的关键在于构建一个统一框架,该框架融合了指令微调的大语言模型(instruction-tuned Large Language Models, LLMs)与基于ISA-95标准对齐的知识图谱(ontology-aligned Knowledge Graphs, KGs),通过将自然语言意图转化为结构化的JSON需求模型,并将其语义映射至Neo4j知识图谱,从而确保操作层面与制造过程、资源及约束的一致性。实验表明,该方法在精确匹配准确率和整体准确率上均显著优于零样本和三样本基线,为可扩展、可解释且自适应的人机协作提供了坚实基础。

链接: https://arxiv.org/abs/2602.12419
作者: Takoua Jradi,John Violos,Dimitrios Spatharakis,Lydia Mavraidi,Ioannis Dimolitsas,Aris Leivadeas,Symeon Papavassiliou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing complexity of smart manufacturing environments demands interfaces that can translate high-level human intents into machine-executable actions. This paper presents a unified framework that integrates instruction-tuned Large Language Models (LLMs) with ontology-aligned Knowledge Graphs (KGs) to enable intent-driven interaction in Manufacturing-as-a-Service (MaaS) ecosystems. We fine-tune Mistral-7B-Instruct-V02 on a domain-specific dataset, enabling the translation of natural language intents into structured JSON requirement models. These models are semantically mapped to a Neo4j-based knowledge graph grounded in the ISA-95 standard, ensuring operational alignment with manufacturing processes, resources, and constraints. Our experimental results demonstrate significant performance gains over zero-shot and 3-shots baselines, achieving 89.33% exact match accuracy and 97.27% overall accuracy. This work lays the foundation for scalable, explainable, and adaptive human-machine

[AI-45] Soft Contamination Means Benchmarks Test Shallow Generalization

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)训练数据中存在“软污染”(soft contamination)的问题,即训练集中包含与基准测试数据语义相同或近似但字符串不完全一致的样本。这种污染会导致基准性能评估对分布外(out-of-distribution, OOD)泛化能力产生偏差。解决方案的关键在于识别并量化此类语义重复项:通过嵌入技术对训练语料库进行向量表示,并检测其中与基准测试数据语义相似的样本。实验表明,这类软污染广泛存在,且包含这些语义重复项会提升基准性能,甚至在微调时也能提升对真正保留测试集的表现,说明当前基准性能提升部分源于测试数据的隐式吸收,而非纯粹的能力进步。

链接: https://arxiv.org/abs/2602.12413
作者: Ari Spiesberger,Juan J. Vazquez,Nicky Pochinkov,Tomáš Gavenčiak,Peli Grietzer,Gavin Leech,Nandi Schoots
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpora.

[AI-46] AstRL: Analog and Mixed-Signal Circuit Synthesis with Deep Reinforcement Learning

【速读】:该论文旨在解决模拟与混合信号(Analog and Mixed-Signal, AMS)集成电路设计自动化程度低的问题,特别是面对设计复杂度持续上升时,缺乏一种可泛化且适用于多样化电路设计空间的优化方法。其关键解决方案是将电路设计建模为图生成问题,并提出基于深度强化学习(Deep Reinforcement Learning, DRL)的新型AMS综合方法——AstRL。该方法通过策略梯度(policy-gradient)机制,在嵌入仿真器的环境中直接生成针对用户指定目标优化的电路拓扑,利用行为克隆(behavioral cloning)和判别器驱动的相似性奖励实现专家对齐的泛化生成能力。AstRL在晶体管级别进行操作,具备高表达力和细粒度拓扑生成能力,同时通过动作空间与环境中的强归纳偏置确保结构一致性和有效性,实验表明其在三项真实设计任务中显著优于现有基线方法,且生成电路100%结构正确、超90%具备预期功能。

链接: https://arxiv.org/abs/2602.12402
作者: Felicia B. Guo,Ken T. Ho,Andrei Vladimirescu,Borivoje Nikolic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Analog and mixed-signal (AMS) integrated circuits (ICs) lie at the core of modern computing and communications systems. However, despite the continued rise in design complexity, advances in AMS automation remain limited. This reflects the central challenge in developing a generalized optimization method applicable across diverse circuit design spaces, many of which are distinct, constrained, and non-differentiable. To address this, our work casts circuit design as a graph generation problem and introduces a novel method of AMS synthesis driven by deep reinforcement learning (AstRL). Based on a policy-gradient approach, AstRL generates circuits directly optimized for user-specified targets within a simulator-embedded environment that provides ground-truth feedback during training. Through behavioral-cloning and discriminator-based similarity rewards, our method demonstrates, for the first time, an expert-aligned paradigm for generalized circuit generation validated in simulation. Importantly, the proposed approach operates at the level of individual transistors, enabling highly expressive, fine-grained topology generation. Strong inductive biases encoded in the action space and environment further drive structurally consistent and valid generation. Experimental results for three realistic design tasks illustrate substantial improvements in conventional design metrics over state-of-the-art baselines, with 100% of generated designs being structurally correct and over 90% demonstrating required functionality.

[AI-47] Rational Neural Networks have Expressivity Advantages

【速读】:该论文旨在解决神经网络中激活函数表达能力与参数效率之间的权衡问题,即如何在保持模型性能的同时显著降低参数数量。其核心解决方案是引入可训练的低阶有理函数(rational activation functions)作为激活函数,相较于传统的分段线性或光滑激活函数(如ReLU、ELU、Mish等),有理激活函数在理论和实践上均展现出更强的逼近能力和更高的参数效率:理论上,对于给定误差目标ε₀,使用有理激活函数的网络可在紧凑域上以仅需O(poly(log log(1/ε)))大小的额外开销逼近标准固定激活网络,而反向逼近则至少需要Ω(log(1/ε))参数;实践中,有理激活函数可无缝集成至现有架构与训练流程中,并在相同条件下实现相当或更优的性能表现。

链接: https://arxiv.org/abs/2602.12390
作者: Maosen Tang,Alex Townsend
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:We study neural networks with trainable low-degree rational activation functions and show that they are more expressive and parameter-efficient than modern piecewise-linear and smooth activations such as ELU, LeakyReLU, LogSigmoid, PReLU, ReLU, SELU, CELU, Sigmoid, SiLU, Mish, Softplus, Tanh, Softmin, Softmax, and LogSoftmax. For an error target of \varepsilon0 , we establish approximation-theoretic separations: Any network built from standard fixed activations can be uniformly approximated on compact domains by a rational-activation network with only \mathrmpoly(\log\log(1/\varepsilon)) overhead in size, while the converse provably requires \Omega(\log(1/\varepsilon)) parameters in the worst case. This exponential gap persists at the level of full networks and extends to gated activations and transformer-style nonlinearities. In practice, rational activations integrate seamlessly into standard architectures and training pipelines, allowing rationals to match or outperform fixed activations under identical architectures and optimizers.

[AI-48] Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment

【速读】:该论文旨在解决深度神经网络中基于梯度的训练为何表现出强烈隐式偏好(implicit bias)这一难题,尤其在非平衡深度线性模型中难以获得可解析的奇异值动力学。其解决方案的关键在于提出一种新视角:通过分析深度雅可比矩阵(Jacobian)的两个理论可验证且经验可测的特征——有序奇异值的深度诱导指数级缩放和强谱分离(strong spectral separation),结合分段线性网络的固定门控(fixed-gates)视图,将雅可比矩阵建模为单一激活区域内掩码线性变换的乘积。作者证明了初始时刻顶部奇异值由李雅普诺夫指数(Lyapunov exponents)支配,并在可处理的掩码模型中给出闭式表达式及有限深度修正;进一步表明强谱分离会迫使矩阵乘积中的奇异向量对齐,从而导致中间雅可比矩阵近似共享同一奇异基底。这共同构建了一个奇异值动力学有效解耦的近似框架,无需依赖传统平衡假设即可复现经典深度线性模型的分析结果,实验验证了该机制下雅可比矩阵低秩结构的涌现及其作为隐式偏好的驱动因素。

链接: https://arxiv.org/abs/2602.12384
作者: Nathanaël Haas,Francçois Gatine,Augustin M Cosse,Zied Bouraoui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding why gradient-based training in deep networks exhibits strong implicit bias remains challenging, in part because tractable singular-value dynamics are typically available only for balanced deep linear models. We propose an alternative route based on two theoretically grounded and empirically testable signatures of deep Jacobians: depth-induced exponential scaling of ordered singular values and strong spectral separation. Adopting a fixed-gates view of piecewise-linear networks, where Jacobians reduce to products of masked linear maps within a single activation region, we prove the existence of Lyapunov exponents governing the top singular values at initialization, give closed-form expressions in a tractable masked model, and quantify finite-depth corrections. We further show that sufficiently strong separation forces singular-vector alignment in matrix products, yielding an approximately shared singular basis for intermediate Jacobians. Together, these results motivate an approximation regime in which singular-value dynamics become effectively decoupled, mirroring classical balanced deep-linear analyses without requiring balancing. Experiments in fixed-gates settings validate the predicted scaling, alignment, and resulting dynamics, supporting a mechanistic account of emergent low-rank Jacobian structure as a driver of implicit bias.

[AI-49] Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中探索效率低下的问题,特别是传统基于乐观价值估计的探索方法在首次访问状态-动作对时缺乏激励机制的问题。现有方法通常仅在观测到更高奖励后才增加价值奖励(value bonus),无法有效引导智能体主动探索未访问的状态-动作组合。解决方案的关键在于提出一种名为Value Bonuses with Ensemble errors (VBE) 的算法,其核心思想是通过维护一组随机初始化的动作价值函数(Random Action-Value Functions, RQFs)并利用它们的预测误差来构建具有“首次访问乐观性”(first-visit optimism)的价值奖励。这种设计确保了价值奖励在首次访问时较高,且随着多次访问逐渐衰减至零,从而实现更深层次的探索行为。实验表明,VBE在经典环境和Atari游戏等复杂场景中均优于Bootstrap DQN及RND、ACB等主流探索方法。

链接: https://arxiv.org/abs/2602.12375
作者: Abdul Wahab,Raksha Kumaraswamy,Martha White
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at Reinforcement Learning Conference (RLC) 2025

点击查看摘要

Abstract:Optimistic value estimates provide one mechanism for directed exploration in reinforcement learning (RL). The agent acts greedily with respect to an estimate of the value plus what can be seen as a value bonus. The value bonus can be learned by estimating a value function on reward bonuses, propagating local uncertainties around rewards. However, this approach only increases the value bonus for an action retroactively, after seeing a higher reward bonus from that state and action. Such an approach does not encourage the agent to visit a state and action for the first time. In this work, we introduce an algorithm for exploration called Value Bonuses with Ensemble errors (VBE), that maintains an ensemble of random action-value functions (RQFs). VBE uses the errors in the estimation of these RQFs to design value bonuses that provide first-visit optimism and deep exploration. The key idea is to design the rewards for these RQFs in such a way that the value bonus can decrease to zero. We show that VBE outperforms Bootstrap DQN and two reward bonus approaches (RND and ACB) on several classic environments used to test exploration and provide demonstrative experiments that it can scale easily to more complex environments like Atari.

[AI-50] Policy4OOD: A Knowledge-Guided World Model for Policy Intervention Simulation against the Opioid Overdose Crisis

【速读】:该论文旨在解决美国阿片类药物危机中政策干预效果难以评估的问题,尤其在多政策交互、动态系统背景下,单一干预可能无意中加剧其他风险路径。其核心挑战在于如何实现对未来结果的预测、对替代历史决策的反事实推理以及候选干预措施的优化选择。解决方案的关键是提出Policy4OOD——一个知识引导的时空世界模型(world model),通过联合编码政策知识图谱、州级空间依赖性和经济社会时间序列数据,构建一个政策条件化的Transformer架构,从而统一上述三项能力:模型训练完成后可作为模拟器,仅需前向传播即可预测未来趋势,替换历史政策编码可进行反事实分析,结合蒙特卡洛树搜索(Monte Carlo Tree Search)实现政策优化。实验表明,空间依赖性和结构化政策知识显著提升预测精度,验证了该框架在数据驱动公共卫生决策支持中的潜力。

链接: https://arxiv.org/abs/2602.12373
作者: Yijun Ma,Zehong Wang,Weixiang Sun,Zheyuan Zhang,Kaiwen Shi,Nitesh Chawla,Yanfang Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The opioid epidemic remains one of the most severe public health crises in the United States, yet evaluating policy interventions before implementation is difficult: multiple policies interact within a dynamic system where targeting one risk pathway may inadvertently amplify another. We argue that effective opioid policy evaluation requires three capabilities – forecasting future outcomes under current policies, counterfactual reasoning about alternative past decisions, and optimization over candidate interventions – and propose to unify them through world modeling. We introduce Policy4OOD, a knowledge-guided spatio-temporal world model that addresses three core challenges: what policies prescribe, where effects manifest, and when effects unfold.Policy4OOD jointly encodes policy knowledge graphs, state-level spatial dependencies, and socioeconomic time series into a policy-conditioned Transformer that forecasts future opioid this http URL trained, the world model serves as a simulator: forecasting requires only a forward pass, counterfactual analysis substitutes alternative policy encodings in the historical sequence, and policy optimization employs Monte Carlo Tree Search over the learned simulator. To support this framework, we construct a state-level monthly dataset (2019–2024) integrating opioid mortality, socioeconomic indicators, and structured policy encodings. Experiments demonstrate that spatial dependencies and structured policy knowledge significantly improve forecasting accuracy, validating each architectural component and the potential of world modeling for data-driven public health decision support.

[AI-51] A Theoretical Framework for Adaptive Utility-Weighted Benchmarking

【速读】:该论文试图解决传统基准测试(benchmarking)在评估现代人工智能系统(如大语言模型)时存在的局限性问题,即现有方法过于依赖统一的任务、指标和排行榜,难以反映复杂社会技术情境下多元利益相关者(stakeholder)的差异化价值诉求。解决方案的关键在于提出一个多层次、自适应的网络化理论框架,通过加权交互机制将评价指标、模型组件与利益相关者群体相连接,并利用联合效用(conjoint-derived utilities)和人机协同更新规则,将人类权衡偏好嵌入基准结构中,使基准能够动态演化同时保持稳定性和可解释性。该框架将经典排行榜视为特例,为构建更具情境感知能力的评估协议提供了理论基础,从而推动更负责任且与人类目标对齐的AI评估体系发展。

链接: https://arxiv.org/abs/2602.12356
作者: Philip Waggoner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 page, no figures, 40 equations

点击查看摘要

Abstract:Benchmarking has long served as a foundational practice in machine learning and, increasingly, in modern AI systems such as large language models, where shared tasks, metrics, and leaderboards offer a common basis for measuring progress and comparing approaches. As AI systems are deployed in more varied and consequential settings, though, there is growing value in complementing these established practices with a more holistic conceptualization of what evaluation should represent. Of note, recognizing the sociotechnical contexts in which these systems operate invites an opportunity for a deeper view of how multiple stakeholders and their unique priorities might inform what we consider meaningful or desirable model behavior. This paper introduces a theoretical framework that reconceptualizes benchmarking as a multilayer, adaptive network linking evaluation metrics, model components, and stakeholder groups through weighted interactions. Using conjoint-derived utilities and a human-in-the-loop update rule, we formalize how human tradeoffs can be embedded into benchmark structure and how benchmarks can evolve dynamically while preserving stability and interpretability. The resulting formulation generalizes classical leaderboards as a special case and provides a foundation for building evaluation protocols that are more context aware, resulting in new robust tools for analyzing the structural properties of benchmarks, which opens a path toward more accountable and human-aligned evaluation.

[AI-52] Intrinsic Credit Assignment for Long Horizon Interaction

【速读】:该论文旨在解决长时空中智能体在面对不确定性时如何有效进行导航与决策的问题。其核心挑战在于传统强化学习(Reinforcement Learning, RL)依赖于结果导向的奖励机制,难以对中间步骤提供有效的信用分配(credit assignment),从而限制了智能体在复杂、高不确定场景下的表现。解决方案的关键在于提出 \DeltaBelief-RL 方法,通过利用语言模型自身对目标解概率变化(\DeltaBelief)作为内在奖励信号,实现对中间行为的精准激励。该方法在合成交互数据上训练,使智能体学会信息获取能力,并显著优于纯结果奖励策略,且在分布外任务(如客户服务和个性化推荐)中展现出良好的泛化性能,同时随着测试时交互次数增加,效率持续提升,尤其在 Pass@k 指标上表现优异。

链接: https://arxiv.org/abs/2602.12342
作者: Ilze Amanda Auzina,Joschka Strüber,Sergio Hernández-Gutiérrez,Shashwat Goel,Ameya Prabhu,Matthias Bethge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 12 figures

点击查看摘要

Abstract:How can we train agents to navigate uncertainty over long horizons? In this work, we propose \DeltaBelief-RL, which leverages a language model’s own intrinsic beliefs to reward intermediate progress. Our method utilizes the change in the probability an agent assigns to the target solution for credit assignment. By training on synthetic interaction data, \DeltaBelief-RL teaches information-seeking capabilities that consistently outperform purely outcome-based rewards for Reinforcement Learning, with improvements generalizing to out-of-distribution applications ranging from customer service to personalization. Notably, the performance continues to improve as we scale test-time interactions beyond the training horizon, with interaction-efficiency increasing even on Pass@k metrics. Overall, our work introduces a scalable training strategy for navigating uncertainty over a long-horizon, by enabling credit assignment to intermediate actions via intrinsic \DeltaBelief rewards.

[AI-53] ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在开放世界环境中将高层语言指令转化为可执行动作的挑战。其核心问题是:VLA模型在复杂、动态的真实场景中难以准确推理并执行多步骤任务,尤其当任务涉及复杂的语义理解和精细的视觉-运动控制时。解决方案的关键在于提出一种名为“视觉前瞻规划”(Visual Foresight Planning, ForeAct)的通用且高效的规划框架,该框架通过生成未来观测图像(foresight image)和子任务描述(subtask descriptions),引导VLA逐步执行任务。其中,关键创新包括一个能在0.33秒内于H100 GPU上生成高质量640×480未来图像的高效前瞻生成模块,以及一个基于预训练多任务跨体感数据的视觉-语言模型用于任务推理与子任务分解。该方法无需修改现有VLA架构即可无缝集成,显著提升了任务成功率(平均达87.4%),相比基线提升超40个百分点。

链接: https://arxiv.org/abs/2602.12322
作者: Zhuoyang Zhang,Shang Yang,Qinghao Hu,Luke J. Huang,James Hou,Yufei Sun,Yao Lu,Song Han
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models convert high-level language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present Visual Foresight Planning (ForeAct), a general and efficient planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuo-motor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image generation module that predicts a high-quality 640 \times 480 future observation from the current visual input and language instruction within only 0.33s on an H100 GPU, together with a vision-language model that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on over 1 million multi-task, cross-embodiment episodes, enabling it to learn robust embodied dynamics. We evaluate our framework on a benchmark that consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4%, demonstrating a +40.9% absolute improvement over the \pi_0 baseline (46.5%) and a +30.3% absolute improvement over \pi_0 augmented with textual subtask guidance (57.1%).

[AI-54] Perceptual Self-Reflection in Agent ic Physics Simulation Code Generation

【速读】:该论文旨在解决物理仿真代码生成中“oracle gap”问题,即传统方法无法识别语法正确但物理行为错误的代码,导致生成结果在实际模拟中失真。其解决方案的关键在于提出一种基于感知自省(perceptual self-reflection)的多智能体框架,通过将渲染的动画帧输入视觉语言模型(vision-language model)进行视觉验证,而非依赖代码结构检查,从而实现对物理行为准确性的迭代优化。这一机制显著提升了生成代码的物理保真度,在多个物理领域均达到目标精度阈值,验证了视觉反馈驱动的自修正流程优于单次生成范式。

链接: https://arxiv.org/abs/2602.12311
作者: Prashant Shende,Bradley Camburn
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures, 2 tables. Introduces a multi-agent architecture for physics simulation code generation with perceptual self-reflection via vision-based validation. Includes qualitative evaluation across multiple physics domains

点击查看摘要

Abstract:We present a multi-agent framework for generating physics simulation code from natural language descriptions, featuring a novel perceptual self-reflection mechanism for validation. The system employs four specialized agents: a natural language interpreter that converts user requests into physics-based descriptions; a technical requirements generator that produces scaled simulation parameters; a physics code generator with automated self-correction; and a physics validator that implements perceptual self-reflection. The key innovation is perceptual validation, which analyzes rendered animation frames using a vision-capable language model rather than inspecting code structure directly. This approach addresses the ``oracle gap’’ where syntactically correct code produces physically incorrect behavior–a limitation that conventional testing cannot detect. We evaluate the system across seven domains including classical mechanics, fluid dynamics, thermodynamics, electromagnetics, wave physics, reaction-diffusion systems, and non-physics data visualization. The perceptual self-reflection architecture demonstrates substantial improvement over single-shot generation baselines, with the majority of tested scenarios achieving target physics accuracy thresholds. The system exhibits robust pipeline stability with consistent code self-correction capability, operating at approximately \ 0.20 per animation. These results validate our hypothesis that feeding visual simulation outputs back to a vision-language model for iterative refinement significantly outperforms single-shot code generation for physics simulation tasks and highlights the potential of agentic AI to support engineering workflows and physics data generation pipelines.

[AI-55] OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

【速读】:该论文试图解决现有主流视频定制方法仅能基于参考图像和文本提示生成身份一致视频的局限性,提出了一项更具挑战性的新任务——同步音频-视频定制(sync audio-video customization),旨在同时定制视频身份与音频音色(timbre),即在保持参考图像所代表的人物身份不变的前提下,使生成视频中人物的语音音色模仿参考音频的音色特征,且语音内容由用户提供的文本提示自由控制。解决方案的关键在于提出OmniCustom框架,其核心创新包括:(1)通过独立的参考身份与音频LoRA模块,在基础音频-视频生成模型的自注意力层中实现对身份与音色的解耦控制;(2)引入对比学习目标,以参考条件预测的运动流作为正例、无参考条件的运动流作为负例,增强模型对身份与音色一致性的保留能力;(3)在构建的大规模高质量音视频人像数据集上训练,从而实现零样本条件下对图像身份、音频音色及文本提示的联合控制生成。

链接: https://arxiv.org/abs/2602.12304
作者: Maomao Li,Zhen Li,Kaipeng Zhang,Guosheng Yin,Zhifeng Li,Dong Xu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 16 pages

点击查看摘要

Abstract:Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image I^r and a reference audio A^r , this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity.

[AI-56] Adaptive traffic signal control optimization using a novel road partition and multi-channel state representation method

【速读】:该论文旨在解决传统交通信号控制方法在动态交通流下优化性能不足的问题,尤其针对固定周期与静态状态表征难以适应复杂路网变化的局限性。解决方案的关键在于提出一种融合深度强化学习(Deep Q-Network, DQN)与近端策略优化(Proximal Policy Optimization, PPO)的自适应控制框架,并引入可变单元长度(variable cell length)和多通道状态表示(multi-channel state representation),其中状态向量包含车辆数、平均速度和空间占有率三个维度,通过加权归一化的奖励函数综合优化等待时间、车速和燃油消耗等指标,从而显著提升模型在不同场景下的泛化能力和控制效率。

链接: https://arxiv.org/abs/2602.12296
作者: Maojiang Deng,Shoufeng Lu,Jiazhao Shi,Wen Zhang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study proposes a novel adaptive traffic signal control method leveraging a Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) to optimize signal timing by integrating variable cell length and multi-channel state representation. A road partition formula consisting of the sum of logarithmic and linear functions was proposed. The state variables are a vector composed of three channels: the number of vehicles, the average speed, and space occupancy. The set of available signal phases constitutes the action space, the selected phase is executed with a fixed green time. The reward function is formulated using the absolute values of key traffic state metrics - waiting time, speed, and fuel consumption. Each metric is normalized by a typical maximum value and assigned a weight that reflects its priority and optimization direction. The simulation results, using Sumo-TensorFlow-Python, demonstrate a cross-range transferability evaluation and show that the proposed variable cell length and multi-channel state representation method excels compared to fixed cell length in optimization performance.

[AI-57] Energy-Aware Reinforcement Learning for Robotic Manipulation of Articulated Components in Infrastructure Operation and Maintenance

【速读】:该论文旨在解决智能基础设施运维(Operation and Maintenance, OM)中机器人对各类铰接部件(如门、抽屉和阀门)进行操作时,现有方法缺乏显式能耗建模与多目标优化的问题,从而限制了其在长期实际部署中的可扩展性和可持续性。解决方案的关键在于提出一种“无关铰接结构且能耗感知”的强化学习框架,通过融合基于部件引导的3D感知、加权点采样及PointNet编码,获得跨异构铰接对象的紧凑几何表征;并将操作任务建模为带有能量约束的马尔可夫决策过程(Constrained Markov Decision Process, CMDP),利用拉格朗日约束的Soft Actor-Critic算法显式调控执行能耗,实现端到端训练下的低能耗、高成功率和少步数的铰接物体操作策略。

链接: https://arxiv.org/abs/2602.12288
作者: Xiaowen Tao,Yinuo Wang,Haitao Ding,Yuanyang Qi,Ziyu Song
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:With the growth of intelligent civil infrastructure and smart cities, operation and maintenance (OM) increasingly requires safe, efficient, and energy-conscious robotic manipulation of articulated components, including access doors, service drawers, and pipeline valves. However, existing robotic approaches either focus primarily on grasping or target object-specific articulated manipulation, and they rarely incorporate explicit actuation energy into multi-objective optimisation, which limits their scalability and suitability for long-term deployment in real OM settings. Therefore, this paper proposes an articulation-agnostic and energy-aware reinforcement learning framework for robotic manipulation in intelligent infrastructure OM. The method combines part-guided 3D perception, weighted point sampling, and PointNet-based encoding to obtain a compact geometric representation that generalises across heterogeneous articulated objects. Manipulation is formulated as a Constrained Markov Decision Process (CMDP), in which actuation energy is explicitly modelled and regulated via a Lagrangian-based constrained Soft Actor-Critic scheme. The policy is trained end-to-end under this CMDP formulation, enabling effective articulated-object operation while satisfying a long-horizon energy budget. Experiments on representative OM tasks demonstrate 16%-30% reductions in energy consumption, 16%-32% fewer steps to success, and consistently high success rates, indicating a scalable and sustainable solution for infrastructure OM manipulation.

[AI-58] Peak Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

【速读】:该论文旨在解决多轮提示注入攻击(multi-turn prompt injection attacks)的检测难题,特别是如何在不调用大语言模型(LLM)的情况下,从每轮对话的局部风险得分中聚合出一个可靠的会话级风险评分。现有方法依赖于直观的加权平均策略,但存在根本性缺陷:其结果与对话轮数无关,导致持续20轮的恶意攻击与单轮可疑行为得分相同,无法体现攻击的持久性和累积效应。论文提出的关键解决方案是“峰值+累积评分”(peak + accumulation scoring),该公式融合了三个核心要素:单轮最高风险得分(peak single-turn risk)、攻击持续比例(persistence ratio)以及攻击类别多样性(category diversity),从而有效区分良性会话与隐蔽的多轮恶意攻击。实验表明,该方法在10,654条对话上实现了90.8%召回率和1.20%假阳性率,F1达85.9%,且对持续参数敏感度分析揭示了约ρ=0.4处的相变现象,召回率跃升12个百分点而误报率几乎不变。

链接: https://arxiv.org/abs/2602.11247
作者: J Alex Corll
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn prompt injection attacks distribute malicious intent across multiple conversation turns, exploiting the assumption that each turn is evaluated independently. While single-turn detection has been extensively studied, no published formula exists for aggregating per-turn pattern scores into a conversation-level risk score at the proxy layer – without invoking an LLM. We identify a fundamental flaw in the intuitive weighted-average approach: it converges to the per-turn score regardless of turn count, meaning a 20-turn persistent attack scores identically to a single suspicious turn. Drawing on analogies from change-point detection (CUSUM), Bayesian belief updating, and security risk-based alerting, we propose peak + accumulation scoring – a formula combining peak single-turn risk, persistence ratio, and category diversity. Evaluated on 10,654 multi-turn conversations – 588 attacks sourced from WildJailbreak adversarial prompts and 10,066 benign conversations from WildChat – the formula achieves 90.8% recall at 1.20% false positive rate with an F1 of 85.9%. A sensitivity analysis over the persistence parameter reveals a phase transition at rho ~ 0.4, where recall jumps 12 percentage points with negligible FPR increase. We release the scoring algorithm, pattern library, and evaluation harness as open source.

[AI-59] Ultrasound-Guided Real-Time Spinal Motion Visualization for Spinal Instability Assessment

【速读】:该论文旨在解决脊柱不稳(spinal instability)临床评估中缺乏高效、实时且低辐射的三维(3D)运动可视化手段的问题。传统金标准动态X射线成像仅提供二维(2D)运动信息,而CT或锥形束CT(CBCT)虽可获取3D结构但无法有效捕捉动态过程。其解决方案的关键在于构建一个基于机器人辅助超声(robotic ultrasound)的框架:通过将术前CBCT获取的中立位脊柱模型与术中最大弯曲状态下的超声数据进行配准,利用刚体运动学模型粗配准后结合迭代最近点(ICP)算法精配准,并优化运动参数;随后以实时超声跟踪为基础,插值生成连续的3D脊柱运动序列,从而实现低辐射、高精度的脊柱三维动态可视化。

链接: https://arxiv.org/abs/2602.12917
作者: Feng Li,Yuan Bi,Tianyu Song,Zhongliang Jiang,Nassir Navab
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purpose: Spinal instability is a widespread condition that causes pain, fatigue, and restricted mobility, profoundly affecting patients’ quality of life. In clinical practice, the gold standard for diagnosis is dynamic X-ray imaging. However, X-ray provides only 2D motion information, while 3D modalities such as computed tomography (CT) or cone beam computed tomography (CBCT) cannot efficiently capture motion. Therefore, there is a need for a system capable of visualizing real-time 3D spinal motion while minimizing radiation exposure. Methods: We propose ultrasound as an auxiliary modality for 3D spine visualization. Due to acoustic limitations, ultrasound captures only the superficial spinal surface. Therefore, the partially compounded ultrasound volume is registered to preoperative 3D imaging. In this study, CBCT provides the neutral spine configuration, while robotic ultrasound acquisition is performed at maximal spinal bending. A kinematic model is applied to the CBCT-derived spine model for coarse registration, followed by ICP for fine registration, with kinematic parameters optimized based on the registration results. Real-time ultrasound motion tracking is then used to estimate continuous 3D spinal motion by interpolating between the neutral and maximally bent states. Results: The pipeline was evaluated on a bendable 3D-printed lumbar spine phantom. The registration error was 1.941 \pm 0.199 mm and the interpolated spinal motion error was 2.01 \pm 0.309 mm (median). Conclusion: The proposed robotic ultrasound framework enables radiation-reduced, real-time 3D visualization of spinal motion, offering a promising 3D alternative to conventional dynamic X-ray imaging for assessing spinal instability. Subjects: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.12917 [physics.med-ph] (or arXiv:2602.12917v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2602.12917 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Feng Li [view email] [v1] Fri, 13 Feb 2026 13:23:29 UTC (2,726 KB)

[AI-60] A consequence of failed sequential learning: A computational account of developmental amnesia

【速读】:该论文旨在解决发育性遗忘症(developmental amnesia)这一特殊认知障碍的机制解释问题,即为何患儿在海马体萎缩的情况下仍能保留相对正常的语义记忆(semantic memory),同时出现严重的自传体记忆(episodic memory)缺陷。其解决方案的关键在于构建一个计算模型,通过模拟海马体在序列/空间学习能力受损时对记忆系统的影响,揭示了这种特定模式的认知表型:受损的顺序性/空间性学习导致严重的情节记忆召回障碍,但不影响语义学习和再认能力;同时提出语义学习可能依赖于情节记忆的随机激活而非顺序整合,从而解释为何语义记忆得以保留。

链接: https://arxiv.org/abs/2602.12547
作者: Qi Zhang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 30 pages, 5 figures and 2 tables

点击查看摘要

Abstract:Developmental amnesia, featured with severely impaired episodic memory and almost normal semantic memory, has been discovered to occur in children with hippocampal atrophy. This unique combination of characteristics seems to challenge the understanding that early loss of episodic memory may impede cognitive development and result in severe mental retardation. Although a few underlying mechanisms have been suggested, no computational model has been reported that is able to mimic the unique combination of characteristics. In this study, a cognitive system is presented, and developmental amnesia is demonstrated computationally in terms of impaired episodic recall, spared recognition and spared semantic learning. Impaired sequential/spatial learning ability of the hippocampus is suggested to be the cause of such amnesia. Simulation shows that impaired sequential leaning may only result in severe impairment of episodic recall, but affect neither recognition ability nor semantic learning. The spared semantic learning is inline with the view that semantic learning is largely associated with the consolidation of episodic memory, a process in which episodic memory may be mostly activated randomly, instead of sequentially. Furthermore, retrograded amnesia is also simulated, and the result and its mechanism are in agreement with most computational models of amnesia reported previously.

[AI-61] Correctness Artificial Intelligence and the Epistemic Value of Mathematical Proof

【速读】:该论文试图解决的核心问题是:数学证明是否必须具备形式正确性(formal correctness)才能具有认识论价值(epistemic value)。传统观点认为,只有可形式化于形式证明系统中的证明才具有认知价值,而本文挑战了这一看法,指出形式正确性既非必要也非充分条件。解决方案的关键在于提出一种新的数学与逻辑关系观,强调形式正确性在数学实践中仅是工具性角色,而非认知价值的决定因素,从而为自动化定理证明器和生成式 AI 在数学研究中的应用提供了理论支持。

链接: https://arxiv.org/abs/2602.12463
作者: James Owen Weatherall,Jesse Wolfson
机构: 未知
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI)
备注: 43 pages

点击查看摘要

Abstract:We argue that it is neither necessary nor sufficient for a mathematical proof to have epistemic value that it be “correct”, in the sense of formalizable in a formal proof system. We then present a view on the relationship between mathematics and logic that clarifies the role of formal correctness in mathematics. Finally, we discuss the significance of these arguments for recent discussions about automated theorem provers and applications of AI to mathematics.

[AI-62] Free Lunch in Medical Image Foundation Model Pre-training via Randomized Synthesis and Disentanglement

【速读】:该论文旨在解决医学图像基础模型(Medical Image Foundation Models, MIFMs)在预训练过程中面临的数据稀缺性、异构性和高标注成本问题。为应对这一挑战,作者提出了一种名为RaSD(Randomized Synthesis and Disentanglement)的可扩展框架,其核心创新在于完全基于合成数据进行预训练:通过随机化高斯分布建模解剖结构与外观变化,使模型暴露于多尺度的结构和外观扰动中,从而迫使模型依赖于不变且任务相关的解剖学线索,而非特定数据集的纹理特征,实现鲁棒且可迁移的表示学习。该方案的关键在于利用合成数据生成足够多样化的扰动模式,以模拟真实世界中的复杂变异,从而有效提升模型的泛化能力与临床适用性。

链接: https://arxiv.org/abs/2602.12317
作者: Yuhan Wei,Yuting He,Linshan Wu,Fuxiang Huang,Junlin Hou,Hao Chen
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical image foundation models (MIFMs) have demonstrated remarkable potential for a wide range of clinical tasks, yet their development is constrained by the scarcity, heterogeneity, and high cost of large-scale annotated datasets. Here, we propose RaSD (Randomized Synthesis and Disentanglement), a scalable framework for pre-training MIFMs entirely on synthetic data. By modeling anatomical structures and appearance variations with randomized Gaussian distributions, RaSD exposes models to sufficient multi-scale structural and appearance perturbations, forcing them to rely on invariant and task-relevant anatomical cues rather than dataset-specific textures, thereby enabling robust and transferable representation learning. We pre-trained RaSD on 1.2 million 3D volumes and 9.6 million 2D images, and extensively evaluated the resulting models across 6 imaging modalities, 48 datasets, and 56 downstream tasks. Across all evaluated downstream tasks, RaSD consistently outperforms training-from-scratch models, achieves the best performance on 17 tasks, and remains comparable to models pre-trained on large real datasets in most others. These results demonstrate that the capacity of synthetic data alone to drive robust representation learning. Our findings establish a paradigm shift in medical AI, demonstrating that synthetic data can serve as a “free lunch” for scalable, privacy-preserving, and clinically generalizable foundation models.

[AI-63] Visible and Hyperspectral Imaging for Quality Assessment of Milk: Property Characterisation and Identification

【速读】:该论文旨在解决传统牛奶品质检测方法耗时、破坏性高且成本昂贵的问题,提出一种快速、无损的替代方案以实现对牛奶关键生化指标(如多酚含量、抗氧化能力及脂肪酸组成)的精准评估。其解决方案的关键在于结合可见光与高光谱成像技术,并利用十一种机器学习算法构建分析框架,通过图像特征与化学测量值之间的关联建模,实现了对牛奶品质参数的高精度预测和分类——例如,可见光成像在区分新鲜与储存12天牛奶及抗生素处理组与对照组方面达到100%准确率,而高光谱成像结合随机森林模型对特定脂肪酸的分类准确率超过95%,验证了成像技术与机器学习融合在牛奶质量快速无损检测中的强大潜力。

链接: https://arxiv.org/abs/2602.12313
作者: Massimo Martinelli,Elena Tomassi,Nafiou Arouna,Morena Gabriele,Laryssa Perez Fabbri,Luisa Pozzo,Giuseppe Conte,Davide Moroni,Laura Pucci
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to Journal of Food Engineering

点击查看摘要

Abstract:Rapid and non-destructive assessment of milk quality is crucial to ensuring both nutritional value and food safety. In this study, we investigated the potential of visible and hyperspectral imaging as cost-effective and quick-response alternatives to conventional chemical analyses for characterizing key properties of cowś milk. A total of 52 milk samples were analysed to determine their biochemical composition (polyphenols, antioxidant capacity, and fatty acids) using spectrophotometer methods and standard gas-liquid and high-performance liquid chromatography (GLC/HPLC). Concurrently, visible (RGB) images were captured using a standard smartphone, and hyperspectral data were acquired in the near-infrared range. A comprehensive analytical framework, including eleven different machine learning algorithms, was employed to correlate imaging features with biochemical measurements. Analysis of visible images accurately distinguished between fresh samples and those stored for 12 days (100 percent accuracy) and achieved perfect discrimination between antibiotic-treated and untreated groups (100 percent accuracy). Moreover, image-derived features enabled perfect prediction of the polyphenols content and the antioxidant capacity using an XGBoost model. Hyperspectral imaging further achieved classification accuracies exceeding 95 percent for several individual fatty acids and 94.8 percent for treatment groups using a Random Forest model. These findings demonstrate that both visible and hyperspectral imaging, when coupled with machine learning, are powerful, non-invasive tools for the rapid assessment of milkś chemical and nutritional profiles, highlighting the strong potential of imaging-based approaches for milk quality assessment.

机器学习

[LG-0] Learning functional components of PDEs from data using neural networks

链接: https://arxiv.org/abs/2602.13174
作者: Torkel E. Loman,Yurij Salmaniw,Antonio Leon Villares,Jose A. Carrillo,Ruth E. Baker
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注: 16 pages with 6 figures. Additional 24 pages and 19 figures supplementary information

点击查看摘要

Abstract:Partial differential equations often contain unknown functions that are difficult or impossible to measure directly, hampering our ability to derive predictions from the model. Workflows for recovering scalar PDE parameters from data are well studied: here we show how similar workflows can be used to recover functions from data. Specifically, we embed neural networks into the PDE and show how, as they are trained on data, they can approximate unknown functions with arbitrary accuracy. Using nonlocal aggregation-diffusion equations as a case study, we recover interaction kernels and external potentials from steady state data. Specifically, we investigate how a wide range of factors, such as the number of available solutions, their properties, sampling density, and measurement noise, affect our ability to successfully recover functions. Our approach is advantageous because it can utilise standard parameter-fitting workflows, and in that the trained PDE can be treated as a normal PDE for purposes such as generating system predictions.

[LG-1] Learning to Approximate Uniform Facility Location via Graph Neural Networks

链接: https://arxiv.org/abs/2602.13155
作者: Chendi Qian,Christopher Morris,Stefanie Jegelka,Christian Sohler
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:There has been a growing interest in using neural networks, especially message-passing neural networks (MPNNs), to solve hard combinatorial optimization problems heuristically. However, existing learning-based approaches for hard combinatorial optimization tasks often rely on supervised training data, reinforcement learning, or gradient estimators, leading to significant computational overhead, unstable training, or a lack of provable performance guarantees. In contrast, classical approximation algorithms offer such performance guarantees under worst-case inputs but are non-differentiable and unable to adaptively exploit structural regularities in natural input distributions. We address this dichotomy with the fundamental example of Uniform Facility Location (UniFL), a variant of the combinatorial facility location problem with applications in clustering, data summarization, logistics, and supply chain design. We develop a fully differentiable MPNN model that embeds approximation-algorithmic principles while avoiding the need for solver supervision or discrete relaxations. Our approach admits provable approximation and size generalization guarantees to much larger instances than seen during training. Empirically, we show that our approach outperforms standard non-learned approximation algorithms in terms of solution quality, closing the gap with computationally intensive integer linear programming approaches. Overall, this work provides a step toward bridging learning-based methods and approximation algorithms for discrete optimization.

[LG-2] FlashSchNet: Fast and Accurate Coarse-Grained Neural Network Molecular Dynamics

链接: https://arxiv.org/abs/2602.13140
作者: Pingzhi Li,Hongxuan Li,Zirui Liu,Xingcheng Lin,Tianlong Chen
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: Code is at this https URL

点击查看摘要

Abstract:Graph neural network (GNN) potentials such as SchNet improve the accuracy and transferability of molecular dynamics (MD) simulation by learning many-body interactions, but remain slower than classical force fields due to fragmented kernels and memory-bound pipelines that underutilize GPUs. We show that a missing principle is making GNN-MD IO-aware, carefully accounting for reads and writes between GPU high-bandwidth memory (HBM) and on-chip SRAM. We present FlashSchNet, an efficient and accurate IO-aware SchNet-style GNN-MD framework built on four techniques: (1) flash radial basis, which fuses pairwise distance computation, Gaussian basis expansion, and cosine envelope into a single tiled pass, computing each distance once and reusing it across all basis functions; (2) flash message passing, which fuses cutoff, neighbor gather, filter multiplication, and reduction to avoid materializing edge tensors in HBM; (3) flash aggregation, which reformulates scatter-add via CSR segment reduce, reducing atomic writes by a factor of feature dimension and enabling contention-free accumulation in both forward and backward passes; (4) channel-wise 16-bit quantization that exploits the low per-channel dynamic range in SchNet MLP weights to further improve throughput with negligible accuracy loss. On a single NVIDIA RTX PRO 6000, FlashSchNet achieves 1000 ns/day aggregate simulation throughput over 64 parallel replicas on coarse-grained (CG) protein containing 269 beads (6.5x faster than CGSchNet baseline with 80% reduction of peak memory), surpassing classical force fields (e.g. MARTINI) while retaining SchNet-level accuracy and transferability.

[LG-3] Order Matters in Retrosynthesis: Structure-aware Generation via Reaction-Center-Guided Discrete Flow Matching

链接: https://arxiv.org/abs/2602.13136
作者: Chenguang Wang,Zihan Zhou,Lei Bai,Tianshu Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Template-free retrosynthesis methods treat the task as black-box sequence generation, limiting learning efficiency, while semi-template approaches rely on rigid reaction libraries that constrain generalization. We address this gap with a key insight: atom ordering in neural representations matters. Building on this insight, we propose a structure-aware template-free framework that encodes the two-stage nature of chemical reactions as a positional inductive bias. By placing reaction center atoms at the sequence head, our method transforms implicit chemical knowledge into explicit positional patterns that the model can readily capture. The proposed RetroDiT backbone, a graph transformer with rotary position embeddings, exploits this ordering to prioritize chemically critical regions. Combined with discrete flow matching, our approach decouples training from sampling and enables generation in 20–50 steps versus 500 for prior diffusion methods. Our method achieves state-of-the-art performance on both USPTO-50k (61.2% top-1) and the large-scale USPTO-Full (51.3% top-1) with predicted reaction centers. With oracle centers, performance reaches 71.1% and 63.4% respectively, surpassing foundation models trained on 10 billion reactions while using orders of magnitude less data. Ablation studies further reveal that structural priors outperform brute-force scaling: a 280K-parameter model with proper ordering matches a 65M-parameter model without it.

[LG-4] Eventizing Traditionally Opaque Binary Neural Networks as 1-safe Petri net Models

链接: https://arxiv.org/abs/2602.13128
作者: Mohamed Tarraf,Alex Chan,Alex Yakovlev,Rishad Shafik
类目: Machine Learning (cs.LG)
*备注: Pre-print of latest work

点击查看摘要

Abstract:Binary Neural Networks (BNNs) offer a low-complexity and energy-efficient alternative to traditional full-precision neural networks by constraining their weights and activations to binary values. However, their discrete, highly non-linear behavior makes them difficult to explain, validate and formally verify. As a result, BNNs remain largely opaque, limiting their suitability in safety-critical domains, where causal transparency and behavioral guarantees are essential. In this work, we introduce a Petri net (PN)-based framework that captures the BNN’s internal operations as event-driven processes. By “eventizing” their operations, we expose their causal relationships and dependencies for a fine-grained analysis of concurrency, ordering, and state evolution. Here, we construct modular PN blueprints for core BNN components including activation, gradient computation and weight updates, and compose them into a complete system-level model. We then validate the composed PN against a reference software-based BNN, verify it against reachability and structural checks to establish 1-safeness, deadlock-freeness, mutual exclusion and correct-by-construction causal sequencing, before we assess its scalability and complexity at segment, component, and system levels using the automated measurement tools in Workcraft. Overall, this framework enables causal introspection of transparent and event-driven BNNs that are amenable to formal reasoning and verification.

[LG-5] R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training

链接: https://arxiv.org/abs/2602.13103
作者: Gengsheng Li,Jinghan He,Shijie Wang,Dan Zhang,Ruiqi Liu,Renrui Zhang,Zijun Yao,Junfeng Fang,Haiyun Guo,Jinqiao Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-play bootstraps LLM reasoning through an iterative Challenger-Solver loop: the Challenger is trained to generate questions that target the Solver’s capabilities, and the Solver is optimized on the generated data to expand its reasoning skills. However, existing frameworks like R-Zero often exhibit non-sustained improvement, where early gains degrade as self-play continues. We identify a key failure mode, Diversity Illusion, where the Solver’s training signals appear diverse yet collapse into recurring underlying patterns. It manifests as (1) Local Diversity Illusion, where diversity is enforced only within-batch, inducing cross-iteration mode cycling; and (2) Surface Diversity Illusion, where questions vary superficially but require near-identical reasoning skills. To mitigate them, we propose R-Diverse with two aligned innovations: Memory-Augmented Penalty (MAP), which uses a persistent memory bank to discourage recycling across iterations, and Skill-Aware Measurement (SAM), which evaluates diversity by the reasoning skills exercised rather than surface variation of questions. Across 10 math and general reasoning benchmarks, R-Diverse sustains gains over more iterations and consistently outperforms prior self-play methods. Code is available at this https URL.

[LG-6] Unified Multi-Domain Graph Pre-training for Homogeneous and Heterogeneous Graphs via Domain-Specific Expert Encoding

链接: https://arxiv.org/abs/2602.13075
作者: Chundong Liang,Yongqi Huang,Dongxiao He,Peiyuan Li,Yawen Li,Di Jin,Weixiong Zhang
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Graph pre-training has achieved remarkable success in recent years, delivering transferable representations for downstream adaptation. However, most existing methods are designed for either homogeneous or heterogeneous graphs, thereby hindering unified graph modeling across diverse graph types. This separation contradicts real-world applications, where mixed homogeneous and heterogeneous graphs are ubiquitous, and distribution shifts between upstream pre-training and downstream deployment are common. In this paper, we empirically demonstrate that a balanced mixture of homogeneous and heterogeneous graph pre-training benefits downstream tasks and propose a unified multi-domain \textbfGraph \textbfPre-training method across \textbfHomogeneous and \textbfHeterogeneous graphs ( \mathbfGPH^2 ). To address the lack of a unified encoder for homogeneous and heterogeneous graphs, we propose a Unified Multi-View Graph Construction that simultaneously encodes both without explicit graph-type-specific designs. To cope with the increased cross-domain distribution discrepancies arising from mixed graphs, we introduce domain-specific expert encoding. Each expert is independently pre-trained on a single graph to capture domain-specific knowledge, thereby shielding the pre-training encoder from the adverse effects of cross-domain discrepancies. For downstream tasks, we further design a Task-oriented Expert Fusion Strategy that adaptively integrates multiple experts based on their discriminative strengths. Extensive experiments on mixed graphs demonstrate that \textGPH^2 enables stable transfer across graph types and domains, significantly outperforming existing graph pre-training methods.

[LG-7] Backdoor Attacks on Contrastive Continual Learning for IoT Systems

链接: https://arxiv.org/abs/2602.13062
作者: Alfous Tim,Kuniyilh Simi D
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The Internet of Things (IoT) systems increasingly depend on continual learning to adapt to non-stationary environments. These environments can include factors such as sensor drift, changing user behavior, device aging, and adversarial dynamics. Contrastive continual learning (CCL) combines contrastive representation learning with incremental adaptation, enabling robust feature reuse across tasks and domains. However, the geometric nature of contrastive objectives, when paired with replay-based rehearsal and stability-preserving regularization, introduces new security vulnerabilities. Notably, backdoor attacks can exploit embedding alignment and replay reinforcement, enabling the implantation of persistent malicious behaviors that endure through updates and deployment cycles. This paper provides a comprehensive analysis of backdoor attacks on CCL within IoT systems. We formalize the objectives of embedding-level attacks, examine persistence mechanisms unique to IoT deployments, and develop a layered taxonomy tailored to IoT. Additionally, we compare vulnerabilities across various learning paradigms and evaluate defense strategies under IoT constraints, including limited memory, edge computing, and federated aggregation. Our findings indicate that while CCL is effective for enhancing adaptive IoT intelligence, it may also elevate long-lived representation-level threats if not adequately secured.

[LG-8] Quantization-Aware Collaborative Inference for Large Embodied AI Models

链接: https://arxiv.org/abs/2602.13052
作者: Zhonghao Lyu,Ming Xiao,Mikael Skoglund,Merouane Debbah,H. Vincent Poor
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Large artificial intelligence models (LAIMs) are increasingly regarded as a core intelligence engine for embodied AI applications. However, the massive parameter scale and computational demands of LAIMs pose significant challenges for resource-limited embodied agents. To address this issue, we investigate quantization-aware collaborative inference (co-inference) for embodied AI systems. First, we develop a tractable approximation for quantization-induced inference distortion. Based on this approximation, we derive lower and upper bounds on the quantization rate-inference distortion function, characterizing its dependence on LAIM statistics, including the quantization bit-width. Next, we formulate a joint quantization bit-width and computation frequency design problem under delay and energy constraints, aiming to minimize the distortion upper bound while ensuring tightness through the corresponding lower bound. Extensive evaluations validate the proposed distortion approximation, the derived rate-distortion bounds, and the effectiveness of the proposed joint design. Particularly, simulations and real-world testbed experiments demonstrate the effectiveness of the proposed joint design in balancing inference quality, latency, and energy consumption in edge embodied AI systems.

[LG-9] GPT Zero: Robust Detection of LLM -Generated Texts

链接: https://arxiv.org/abs/2602.13042
作者: George Alexandru Adam,Alexander Cui,Edwin Thomas,Emily Napier,Nazar Shmatko,Jacob Schnell,Jacob Junqi Tian,Alekhya Dronavalli,Edward Tian,Dongwon Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While historical considerations surrounding text authenticity revolved primarily around plagiarism, the advent of large language models (LLMs) has introduced a new challenge: distinguishing human-authored from AI-generated text. This shift raises significant concerns, including the undermining of skill evaluations, the mass-production of low-quality content, and the proliferation of misinformation. Addressing these issues, we introduce GPTZero a state-of-the-art industrial AI detection solution, offering reliable discernment between human and LLM-generated text. Our key contributions include: introducing a hierarchical, multi-task architecture enabling a flexible taxonomy of human and AI texts, demonstrating state-of-the-art accuracy on a variety of domains with granular predictions, and achieving superior robustness to adversarial attacks and paraphrasing via multi-tiered automated red teaming. GPTZero offers accurate and explainable detection, and educates users on its responsible use, ensuring fair and transparent assessment of text.

[LG-10] CRL: Temporal-Coupled Adversarial Training for Robust Constrained Reinforcement Learning in Worst-Case Scenarios

链接: https://arxiv.org/abs/2602.13040
作者: Wentao Xu,Zhongming Yao,Weihao Li,Zhenghang Song,Yumeng Song,Tianyi Li,Yushuai Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Constrained Reinforcement Learning (CRL) aims to optimize decision-making policies under constraint conditions, making it highly applicable to safety-critical domains such as autonomous driving, robotics, and power grid management. However, existing robust CRL approaches predominantly focus on single-step perturbations and temporally independent adversarial models, lacking explicit modeling of robustness against temporally coupled perturbations. To tackle these challenges, we propose TCRL, a novel temporal-coupled adversarial training framework for robust constrained reinforcement learning (TCRL) in worst-case scenarios. First, TCRL introduces a worst-case-perceived cost constraint function that estimates safety costs under temporally coupled perturbations without the need to explicitly model adversarial attackers. Second, TCRL establishes a dual-constraint defense mechanism on the reward to counter temporally coupled adversaries while maintaining reward unpredictability. Experimental results demonstrate that TCRL consistently outperforms existing methods in terms of robustness against temporally coupled perturbation attacks across a variety of CRL tasks.

[LG-11] Probabilistic Wind Power Forecasting with Tree-Based Machine Learning and Weather Ensembles

链接: https://arxiv.org/abs/2602.13010
作者: Max Bruninx,Diederik van Binsbergen,Timothy Verstraeten,Ann Nowé,Jan Helsen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate production forecasts are essential to continue facilitating the integration of renewable energy sources into the power grid. This paper illustrates how to obtain probabilistic day-ahead forecasts of wind power generation via gradient boosting trees using an ensemble of weather forecasts. To this end, we perform a comparative analysis across three state-of-the-art probabilistic prediction methods-conformalised quantile regression, natural gradient boosting and conditional diffusion models-all of which can be combined with tree-based machine learning. The methods are validated using four years of data for all wind farms present within the Belgian offshore zone. Additionally, the point forecasts are benchmarked against deterministic engineering methods, using either the power curve or an advanced approach incorporating a calibrated analytical wake model. The experimental results show that the machine learning methods improve the mean absolute error by up to 53% and 33% compared to the power curve and the calibrated wake model. Considering the three probabilistic prediction methods, the conditional diffusion model is found to yield the best overall probabilistic and point estimate of wind power generation. Moreover, the findings suggest that the use of an ensemble of weather forecasts can improve point forecast accuracy by up to 23%.

[LG-12] Machine Learning-Based Classification of Jhana Advanced Concentrative Absorption Meditation (ACAM-J) using 7T fMRI

链接: https://arxiv.org/abs/2602.13008
作者: Puneet Kumar,Winson F. Z. Yang,Alakhsimar Singh,Xiaobai Li,Matthew D. Sacchet
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Jhana advanced concentration absorption meditation (ACAM-J) is related to profound changes in consciousness and cognitive processing, making the study of their neural correlates vital for insights into consciousness and well-being. This study evaluates whether functional MRI-derived regional homogeneity (ReHo) can be used to classify ACAM-J using machine-learning approaches. We collected group-level fMRI data from 20 advanced meditators to train the classifiers, and intensive single-case data from an advanced practitioner performing ACAM-J and control tasks to evaluate generalization. ReHo maps were computed, and features were extracted from predefined brain regions of interest. We trained multiple machine learning classifiers using stratified cross-validation to evaluate whether ReHo patterns distinguish ACAM-J from non-meditative states. Ensemble models achieved 66.82% (p 0.05) accuracy in distinguishing ACAM-J from control conditions. Feature-importance analysis indicated that prefrontal and anterior cingulate areas contributed most to model decisions, aligning with established involvement of these regions in attentional regulation and metacognitive processes. Moreover, moderate agreement reflected in Cohen’s kappa supports the feasibility of using machine learning to distinguish ACAM-J from non-meditative states. These findings advocate machine-learning’s feasibility in classifying advanced meditation states, future research on neuromodulation and mechanistic models of advanced meditation.

[LG-13] Uncertainty in Federated Granger Causality: From Origins to Systemic Consequences

链接: https://arxiv.org/abs/2602.13004
作者: Ayush Mohanty,Nazal Mohamed,Nagi Gebraeel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Manuscript under review

点击查看摘要

Abstract:Granger Causality (GC) provides a rigorous framework for learning causal structures from time-series data. Recent federated variants of GC have targeted distributed infrastructure applications (e.g., smart grids) with distributed clients that generate high-dimensional data bound by data-sovereignty constraints. However, Federated GC algorithms only yield deterministic point estimates of causality and neglect uncertainty. This paper establishes the first methodology for rigorously quantifying uncertainty and its propagation within federated GC frameworks. We systematically classify sources of uncertainty, explicitly differentiating aleatoric (data noise) from epistemic (model variability) effects. We derive closed-form recursions that model the evolution of uncertainty through client-server interactions and identify four novel cross-covariance components that couple data uncertainties with model parameter uncertainties across the federated architecture. We also define rigorous convergence conditions for these uncertainty recursions and obtain explicit steady-state variances for both server and client model parameters. Our convergence analysis demonstrates that steady-state variances depend exclusively on client data statistics, thus eliminating dependence on initial epistemic priors and enhancing robustness. Empirical evaluations on synthetic benchmarks and real-world industrial datasets demonstrate that explicitly characterizing uncertainty significantly improves the reliability and interpretability of federated causal inference.

[LG-14] Multi-Dimensional Visual Data Recovery: Scale-Aware Tensor Modeling and Accelerated Randomized Computation

链接: https://arxiv.org/abs/2602.12982
作者: Wenjin Qin,Hailin Wang,Jiangjun Peng,Jianjun Wang,Tingwen Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recently proposed fully-connected tensor network (FCTN) decomposition has demonstrated significant advantages in correlation characterization and transpositional invariance, and has achieved notable achievements in multi-dimensional data processing and analysis. However, existing multi-dimensional data recovery methods leveraging FCTN decomposition still have room for further enhancement, particularly in computational efficiency and modeling capability. To address these issues, we first propose a FCTN-based generalized nonconvex regularization paradigm from the perspective of gradient mapping. Then, reliable and scalable multi-dimensional data recovery models are investigated, where the model formulation is shifted from unquantized observations to coarse-grained quantized observations. Based on the alternating direction method of multipliers (ADMM) framework, we derive efficient optimization algorithms with convergence guarantees to solve the formulated models. To alleviate the computational bottleneck encountered when processing large-scale multi-dimensional data, fast and efficient randomized compression algorithms are devised in virtue of sketching techniques in numerical linear algebra. These dimensionality-reduction techniques serve as the computational acceleration core of our proposed algorithm framework. Theoretical results on approximation error upper bounds and convergence analysis for the proposed method are derived. Extensive numerical experiments illustrate the effectiveness and superiority of the proposed algorithm over other state-of-the-art methods in terms of quantitative metrics, visual quality, and running time.

[LG-15] MAUNet-Light: A Concise MAUNet Architecture for Bias Correction and Downscaling of Precipitation Estimates

链接: https://arxiv.org/abs/2602.12980
作者: Sumanta Chandra Mishra Sharma,Adway Mitra,Auroop Ratan Ganguly
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Satellite-derived data products and climate model simulations of geophysical variables like precipitation, often exhibit systematic biases compared to in-situ measurements. Bias correction and spatial downscaling are fundamental components to develop operational weather forecast systems, as they seek to improve the consistency between coarse-resolution climate model simulations or satellite-based estimates and ground-based observations. In recent years, deep learning-based models have been increasingly replaced traditional statistical methods to generate high-resolution, bias free projections of climate variables. For example, Max-Average U-Net (MAUNet) architecture has been demonstrated for its ability to downscale precipitation estimates. The versatility and adaptability of these neural models make them highly effective across a range of applications, though this often come at the cost of high computational and memory requirements. The aim of this research is to develop light-weight neural network architectures for both bias correction and downscaling of precipitation, for which the teacher-student based learning paradigm is explored. This research demonstrates the adaptability of MAUNet to the task of bias correction, and further introduces a compact, lightweight neural network architecture termed this http URL proposed MAUNet-Light model is developed by transferring knowledge from the trained MAUNet, and it is designed to perform both downscaling and bias correction with reduced computational requirements without any significant loss in accuracy compared to state-of-the-art.

[LG-16] Jointly Optimizing Debiased CTR and Uplift for Coupons Marketing: A Unified Causal Framework

链接: https://arxiv.org/abs/2602.12972
作者: Siyun Yang,Shixiao Yang,Jian Wang,Di Fan,Kehe Cai,Haoyan Fu,Jiaming Zhang,Wenjin Wu,Peng Jiang
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In online advertising, marketing interventions such as coupons introduce significant confounding bias into Click-Through Rate (CTR) prediction. Observed clicks reflect a mixture of users’ intrinsic preferences and the uplift induced by these interventions. This causes conventional models to miscalibrate base CTRs, which distorts downstream ranking and billing decisions. Furthermore, marketing interventions often operate as multi-valued treatments with varying magnitudes, introducing additional complexity to CTR prediction. To address these issues, we propose the \textbfUnified \textbfMulti-\textbfValued \textbfTreatment Network (UniMVT). Specifically, UniMVT disentangles confounding factors from treatment-sensitive representations, enabling a full-space counterfactual inference module to jointly reconstruct the debiased base CTR and intensity-response curves. To handle the complexity of multi-valued treatments, UniMVT employs an auxiliary intensity estimation task to capture treatment propensities and devise a unit uplift objective that normalizes the intervention effect. This ensures comparable estimation across the continuous coupon-value spectrum. UniMVT simultaneously achieves debiased CTR prediction for accurate system calibration and precise uplift estimation for incentive allocation. Extensive experiments on synthetic and industrial datasets demonstrate UniMVT’s superiority in both predictive accuracy and calibration. Furthermore, real-world A/B tests confirm that UniMVT significantly improves business metrics through more effective coupon distribution. Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG) Cite as: arXiv:2602.12972 [cs.SI] (or arXiv:2602.12972v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2602.12972 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Ca-MCF: Category-level Multi-label Causal Feature selection

链接: https://arxiv.org/abs/2602.12961
作者: Wanfu Gao,Yanan Wang,Yonghao Li
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures. Includes appendices

点击查看摘要

Abstract:Multi-label causal feature selection has attracted extensive attention in recent years. However, current methods primarily operate at the label level, treating each label variable as a monolithic entity and overlooking the fine-grained causal mechanisms unique to individual categories. To address this, we propose a Category-level Multi-label Causal Feature selection method named Ca-MCF. Ca-MCF utilizes label category flattening to decompose label variables into specific category nodes, enabling precise modeling of causal structures within the label space. Furthermore, we introduce an explanatory competition-based category-aware recovery mechanism that leverages the proposed Specific Category-Specific Mutual Information (SCSMI) and Distinct Category-Specific Mutual Information (DCSMI) to salvage causal features obscured by label correlations. The method also incorporates structural symmetry checks and cross-dimensional redundancy removal to ensure the robustness and compactness of the identified Markov Blankets. Extensive experiments across seven real-world datasets demonstrate that Ca-MCF significantly outperforms state-of-the-art benchmarks, achieving superior predictive accuracy with reduced feature dimensionality.

[LG-18] Nonparametric Contextual Online Bilateral Trade

链接: https://arxiv.org/abs/2602.12904
作者: Emanuele Coccia,Martino Bernasconi,Andrea Celli
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of contextual online bilateral trade. At each round, the learner faces a seller-buyer pair and must propose a trade price without observing their private valuations for the item being sold. The goal of the learner is to post prices to facilitate trades between the two parties. Before posting a price, the learner observes a d -dimensional context vector that influences the agent’s valuations. Prior work in the contextual setting has focused on linear models. In this work, we tackle a general nonparametric setting in which the buyer’s and seller’s valuations behave according to arbitrary Lipschitz functions of the context. We design an algorithm that leverages contextual information through a hierarchical tree construction and guarantees regret \widetildeO(T^(d-1)/d) . Remarkably, our algorithm operates under two stringent features of the setting: (1) one-bit feedback, where the learner only observes whether a trade occurred or not, and (2) strong budget balance, where the learner cannot subsidize or profit from the market participants. We further provide a matching lower bound in the full-feedback setting, demonstrating the tightness of our regret bound.

[LG-19] Contextual Online Bilateral Trade

链接: https://arxiv.org/abs/2602.12903
作者: Romain Cosson,Federico Fusco,Anupam Gupta,Stefano Leonardi,Renato Paes Leme,Matteo Russo
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study repeated bilateral trade when the valuations of the sellers and the buyers are contextual. More precisely, the agents’ valuations are given by the inner product of a context vector with two unknown d -dimensional vectors – one for the buyers and one for the sellers. At each time step t , the learner receives a context and posts two prices, one for the seller and one for the buyer, and the trade happens if both agents accept their price. We study two objectives for this problem, gain from trade and profit, proving no-regret with respect to a surprisingly strong benchmark: the best omniscient dynamic strategy. In the natural scenario where the learner observes \emphseparately whether the agents accept their price – the so-called \emphtwo-bit feedback – we design algorithms that achieve O(d\log d) regret for gain from trade, and O(d \log\log T + d\log d) regret for profit maximization. Both results are tight, up to the \log(d) factor, and implement per-step budget balance, meaning that the learner never incurs negative profit. In the less informative \emphone-bit feedback model, the learner only observes whether a trade happens or not. For this scenario, we show that the tight two-bit regret regimes are still attainable, at the cost of allowing the learner to possibly incur a small negative profit of order O(d\log d) , which is notably independent of the time horizon. As a final set of results, we investigate the combination of one-bit feedback and per-step budget balance. There, we design an algorithm for gain from trade that suffers regret independent of the time horizon, but \emphexponential in the dimension d . For profit maximization, we maintain this exponential dependence on the dimension, which gets multiplied by a \log T factor. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2602.12903 [cs.GT] (or arXiv:2602.12903v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2602.12903 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Federico Fusco [view email] [v1] Fri, 13 Feb 2026 13:03:10 UTC (37 KB)

[LG-20] Model-Aware Rate-Distortion Limits for Task-Oriented Source Coding

链接: https://arxiv.org/abs/2602.12866
作者: Andriy Enttsel,Vincent Corlay
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Task-Oriented Source Coding (TOSC) has emerged as a paradigm for efficient visual data communication in machine-centric inference systems, where bitrate, latency, and task performance must be jointly optimized under resource constraints. While recent works have proposed rate-distortion bounds for coding for machines, these results often rely on strong assumptions on task identifiability and neglect the impact of deployed task models. In this work, we revisit the fundamental limits of single-TOSC through the lens of indirect rate-distortion theory. We highlight the conditions under which existing rate-distortion bounds are achievable and show their limitations in realistic settings. We then introduce task model-aware rate-distortion bounds that account for task model suboptimality and architectural constraints. Experiments on standard classification benchmarks confirm that current learned TOSC schemes operate far from these limits, highlighting transmitter-side complexity as a key bottleneck.

[LG-21] Reliable Hierarchical Operating System Fingerprinting via Conformal Prediction

链接: https://arxiv.org/abs/2602.12825
作者: Rubén Pérez-Jove,Osvaldo Simeone,Alejandro Pazos,Jose Vázquez-Naya
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Submitted as a preprint (not peer reviewed). 16 pages, 10 figures. Code and datasets available at: this https URL

点击查看摘要

Abstract:Operating System (OS) fingerprinting is critical for network security, but conventional methods do not provide formal uncertainty quantification mechanisms. Conformal Prediction (CP) could be directly wrapped around existing methods to obtain prediction sets with guaranteed coverage. However, a direct application of CP would treat OS identification as a flat classification problem, ignoring the natural taxonomic structure of OSs and providing brittle point predictions. This work addresses these limitations by introducing and evaluating two distinct structured CP strategies: level-wise CP (L-CP), which calibrates each hierarchy level independently, and projection-based CP (P-CP), which ensures structural consistency by projecting leaf-level sets upwards. Our results demonstrate that, while both methods satisfy validity guarantees, they expose a fundamental trade-off between level-wise efficiency and structural consistency. L-CP yields tighter prediction sets suitable for human forensic analysis but suffers from taxonomic inconsistencies. Conversely, P-CP guarantees hierarchically consistent, nested sets ideal for automated policy enforcement, albeit at the cost of reduced efficiency at coarser levels.

[LG-22] Closing the Loop: A Control-Theoretic Framework for Provably Stable Time Series Forecasting with LLM s

链接: https://arxiv.org/abs/2602.12756
作者: Xingyu Zhang,Hanyun Du,Zeen Song,Jianqi Zhang,Changwen Zheng,Wenwen Qiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently shown exceptional potential in time series forecasting, leveraging their inherent sequential reasoning capabilities to model complex temporal dynamics. However, existing approaches typically employ a naive autoregressive generation strategy. We identify a critical theoretical flaw in this paradigm: during inference, the model operates in an open-loop manner, consuming its own generated outputs recursively. This leads to inevitable error accumulation (exposure bias), where minor early deviations cascade into significant trajectory drift over long horizons. In this paper, we reformulate autoregressive forecasting through the lens of control theory, proposing \textbfF-LLM (Feedback-driven LLM), a novel closed-loop framework. Unlike standard methods that passively propagate errors, F-LLM actively stabilizes the trajectory via a learnable residual estimator (Observer) and a feedback controller. Furthermore, we provide a theoretical guarantee that our closed-loop mechanism ensures uniformly bounded error, provided the base model satisfies a local Lipschitz constraint. Extensive experiments demonstrate that F-LLM significantly mitigates error propagation, achieving good performance on time series benchmarks.

[LG-23] Hierarchical Successor Representation for Robust Transfer

链接: https://arxiv.org/abs/2602.12753
作者: Changmin Yu,Máté Lengyel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The successor representation (SR) provides a powerful framework for decoupling predictive dynamics from rewards, enabling rapid generalisation across reward configurations. However, the classical SR is limited by its inherent policy dependence: policies change due to ongoing learning, environmental non-stationarities, and changes in task demands, making established predictive representations obsolete. Furthermore, in topologically complex environments, SRs suffer from spectral diffusion, leading to dense and overlapping features that scale poorly. Here we propose the Hierarchical Successor Representation (HSR) for overcoming these limitations. By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes. Applying non-negative matrix factorisation (NMF) to the HSR yields a sparse, low-rank state representation that facilitates highly sample-efficient transfer to novel tasks in multi-compartmental environments. Further analysis reveals that HSR-NMF discovers interpretable topological structures, providing a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility. Beyond providing a useful basis for task-transfer, we show that HSR’s temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

[LG-24] Adaptive Structured Pruning of Convolutional Neural Networks for Time Series Classification

链接: https://arxiv.org/abs/2602.12744
作者: Javidan Abdullayev,Maxime Devanne,Cyril Meyer,Ali Ismail-Fawaz,Jonathan Weber,Germain Forestier
类目: Machine Learning (cs.LG)
*备注: 12 pages, 16 figures. Accepted at ICAART 2026

点击查看摘要

Abstract:Deep learning models for Time Series Classification (TSC) have achieved strong predictive performance but their high computational and memory requirements often limit deployment on resource-constrained devices. While structured pruning can address these issues by removing redundant filters, existing methods typically rely on manually tuned hyperparameters such as pruning ratios which limit scalability and generalization across datasets. In this work, we propose Dynamic Structured Pruning (DSP), a fully automatic, structured pruning framework for convolution-based TSC models. DSP introduces an instance-wise sparsity loss during training to induce channel-level sparsity, followed by a global activation analysis to identify and prune redundant filters without needing any predefined pruning ratio. This work tackles computational bottlenecks of deep TSC models for deployment on resource-constrained devices. We validate DSP on 128 UCR datasets using two different deep state-of-the-art architectures: LITETime and InceptionTime. Our approach achieves an average compression of 58% for LITETime and 75% for InceptionTime architectures while maintaining classification accuracy. Redundancy analyses confirm that DSP produces compact and informative representations, offering a practical path for scalable and efficient deep TSC deployment.

[LG-25] ADEPT: RL-Aligned Agent ic Decoding of Emotion via Evidence Probing Tools – From Consensus Learning to Ambiguity-Driven Emotion Reasoning

链接: https://arxiv.org/abs/2602.12714
作者: Esther Sun,Bo-Hao Su,Abinay Reddy Naini,Shinji Watanabe,Carlos Busso
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Speech Large Language Models (SLLMs) enable high-level emotion reasoning but often produce ungrounded, text-biased judgments without verifiable acoustic evidence. In contrast, self-supervised speech encoders such as WavLM provide strong acoustic representations yet remain opaque discriminative models with limited interpretability. To bridge this gap, we introduce ADEPT (Agentic Decoding of Emotion via Evidence Probing Tools), a framework that reframes emotion recognition as a multi-turn inquiry process rather than a single-pass prediction. ADEPT transforms an SLLM into an agent that maintains an evolving candidate emotion set and adaptively invokes dedicated semantic and acoustic probing tools within a structured pipeline of candidate generation, evidence collection, and adjudication. Crucially, ADEPT enables a paradigm shift from consensus learning to ambiguity-driven emotion reasoning. Since human affect exhibits inherent complexity and frequent co-occurrence of emotions, we treat minority annotations as informative perceptual signals rather than discarding them as noise. Finally, we integrate Group Relative Policy Optimization (GRPO) with an Evidence Trust Gate to explicitly couple tool-usage behaviors with prediction quality and enforce evidence-grounded reasoning. Experiments show that ADEPT improves primary emotion accuracy in most settings while substantially improving minor emotion characterization, producing explanations grounded in auditable acoustic and semantic evidence.

[LG-26] Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

链接: https://arxiv.org/abs/2602.12708
作者: Jon Irureta,Gorka Azkune,Jon Imaz,Aizea Lojo,Javier Fernandez-Marques
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vertical Federated Learning (VFL) has emerged as a critical paradigm for collaborative model training in privacy-sensitive domains such as finance and healthcare. However, most existing VFL frameworks rely on the idealized assumption of full sample alignment across participants, a premise that rarely holds in real-world scenarios. To bridge this gap, this work introduces Split-MoPE, a novel framework that integrates Split Learning with a specialized Mixture of Predefined Experts (MoPE) architecture. Unlike standard Mixture of Experts (MoE), where routing is learned dynamically, MoPE uses predefined experts to process specific data alignments, effectively maximizing data usage during both training and inference without requiring full sample overlap. By leveraging pretrained encoders for target data domains, Split-MoPE achieves state-of-the-art performance in a single communication round, significantly reducing the communication footprint compared to multi-round end-to-end training. Furthermore, unlike existing proposals that address sample misalignment, this novel architecture provides inherent robustness against malicious or noisy participants and offers per-sample interpretability by quantifying each collaborator’s contribution to each prediction. Extensive evaluations on vision (CIFAR-10/100) and tabular (Breast Cancer Wisconsin) datasets demonstrate that Split-MoPE consistently outperforms state-of-the-art systems such as LASER and Vertical SplitNN, particularly in challenging scenarios with high data missingness.

[LG-27] Physics-Informed Laplace Neural Operator for Solving Partial Differential Equations

链接: https://arxiv.org/abs/2602.12706
作者: Heechang Kim,Qianying Cao,Hyomin Shin,Seungchul Lee,George Em Karniadakis,Minseok Choi
类目: Machine Learning (cs.LG)
*备注: 38 pages,19 figures

点击查看摘要

Abstract:Neural operators have emerged as fast surrogate solvers for parametric partial differential equations (PDEs). However, purely data-driven models often require extensive training data and can generalize poorly, especially in small-data regimes and under unseen (out-of-distribution) input functions that are not represented in the training data. To address these limitations, we propose the Physics-Informed Laplace Neural Operator (PILNO), which enhances the Laplace Neural Operator (LNO) by embedding governing physics into training through PDE, boundary condition, and initial condition residuals. To improve expressivity, we first introduce an Advanced LNO (ALNO) backbone that retains a pole-residue transient representation while replacing the steady-state branch with an FNO-style Fourier multiplier. To make physics-informed training both data-efficient and robust, PILNO further leverages (i) virtual inputs: an unlabeled ensemble of input functions spanning a broad spectral range that provides abundant physics-only supervision and explicitly targets out-of-distribution (OOD) regimes; and (ii) temporal-causality weighting: a time-decaying reweighting of the physics residual that prioritizes early-time dynamics and stabilizes optimization for time-dependent PDEs. Across four representative benchmarks – Burgers’ equation, Darcy flow, a reaction-diffusion system, and a forced KdV equation – PILNO consistently improves accuracy in small-data settings (e.g., N_train = 27), reduces run-to-run variability across random seeds, and achieves stronger OOD generalization than purely data-driven baselines.

[LG-28] QTabGAN: A Hybrid Quantum-Classical GAN for Tabular Data Synthesis

链接: https://arxiv.org/abs/2602.12704
作者: Subhangi Kumari,Rakesh Achutha,Vignesh Sivaraman
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 21 pages

点击查看摘要

Abstract:Synthesizing realistic tabular data is challenging due to heterogeneous feature types and high dimensionality. We introduce QTabGAN, a hybrid quantum-classical generative adversarial framework for tabular data synthesis. QTabGAN is especially designed for settings where real data are scarce or restricted by privacy constraints. The model exploits the expressive power of quantum circuits to learn complex data distributions, which are then mapped to tabular features using classical neural networks. We evaluate QTabGAN on multiple classification and regression datasets and benchmark it against leading state-of-the-art generative models. Experiments show that QTabGAN achieves up to 54.07% improvement across various classification datasets and evaluation metrics, thus establishing a scalable quantum approach to tabular data synthesis and highlighting its potential for quantum-assisted generative modelling.

[LG-29] SWING: Unlocking Implicit Graph Representations for Graph Random Features

链接: https://arxiv.org/abs/2602.12703
作者: Alessandro Manenti,Avinava Dubey,Arijit Sehanobish,Cesare Alippi,Krzysztof Choromanski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose SWING: Space Walks for Implicit Network Graphs, a new class of algorithms for computations involving Graph Random Features on graphs given by implicit representations (i-graphs), where edge-weights are defined as bi-variate functions of feature vectors in the corresponding nodes. Those classes of graphs include several prominent examples, such as: \epsilon -neighborhood graphs, used on regular basis in machine learning. Rather than conducting walks on graphs’ nodes, those methods rely on walks in continuous spaces, in which those graphs are embedded. To accurately and efficiently approximate original combinatorial calculations, SWING applies customized Gumbel-softmax sampling mechanism with linearized kernels, obtained via random features coupled with importance sampling techniques. This algorithm is of its own interest. SWING relies on the deep connection between implicitly defined graphs and Fourier analysis, presented in this paper. SWING is accelerator-friendly and does not require input graph materialization. We provide detailed analysis of SWING and complement it with thorough experiments on different classes of i-graphs.

[LG-30] Leverag e-Weighted Conformal Prediction

链接: https://arxiv.org/abs/2602.12693
作者: Shreyas Fadnavis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Split conformal prediction provides distribution-free prediction intervals with finite-sample marginal coverage, but produces constant-width intervals that overcover in low-variance regions and undercover in high-variance regions. Existing adaptive methods require training auxiliary models. We propose Leverage-Weighted Conformal Prediction (LWCP), which weights nonconformity scores by a function of the statistical leverage – the diagonal of the hat matrix – deriving adaptivity from the geometry of the design matrix rather than from auxiliary model fitting. We prove that LWCP preserves finite-sample marginal validity for any weight function; achieves asymptotically optimal conditional coverage at essentially no width cost when heteroscedasticity factors through leverage; and recovers the form and width of classical prediction intervals under Gaussian assumptions while retaining distribution-free guarantees. We further establish that randomized leverage approximations preserve coverage exactly with controlled width perturbation, and that vanilla CP suffers a persistent, sample-size-independent conditional coverage gap that LWCP eliminates. The method requires no hyperparameters beyond the choice of weight function and adds negligible computational overhead to vanilla CP. Experiments on synthetic and real data confirm the theoretical predictions, demonstrating substantial reductions in conditional coverage disparity across settings.

[LG-31] Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

链接: https://arxiv.org/abs/2602.12684
作者: Rui Cai,Jun Guo,Xinze He,Piaopiao Jin,Jie Li,Bingxuan Lin,Futeng Liu,Wei Liu,Fei Ma,Kun Ma,Feng Qiu,Heng Qu,Yifei Su,Qiao Sun,Dong Wang,Donghao Wang,Yunhong Wang,Rujie Wu,Diyun Xiang,Yu Yang,Hangjun Ye,Yuan Zhang,Quanyun Zhou
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at this https URL

[LG-32] Flow Matching from Viewpoint of Proximal Operators

链接: https://arxiv.org/abs/2602.12683
作者: Kenji Fukumizu,Wei Huang,Han Bao,Shuntuo Xu,Nisha Chandramoothy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 38 pages, 6 figures

点击查看摘要

Abstract:We reformulate Optimal Transport Conditional Flow Matching (OT-CFM), a class of dynamical generative models, showing that it admits an exact proximal formulation via an extended Brenier potential, without assuming that the target distribution has a density. In particular, the mapping to recover the target point is exactly given by a proximal operator, which yields an explicit proximal expression of the vector field. We also discuss the convergence of minibatch OT-CFM to the population formulation as the batch size increases. Finally, using second epi-derivatives of convex potentials, we prove that, for manifold-supported targets, OT-CFM is terminally normally hyperbolic: after time rescaling, the dynamics contracts exponentially in directions normal to the data manifold while remaining neutral along tangential directions.

[LG-33] Fool Me If You Can: On the Robustness of Binary Code Similarity Detection Models against Semantics-preserving Transformations

链接: https://arxiv.org/abs/2602.12681
作者: Jiyong Uhm,Minseok Kim,Michalis Polychronakis,Hyungjoon Koo
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 23 pages, 9 figures, 5 tables. The paper has been accepted by The ACM International Conference on the Foundations of Software Engineering (FSE 2026)

点击查看摘要

Abstract:Binary code analysis plays an essential role in cybersecurity, facilitating reverse engineering to reveal the inner workings of programs in the absence of source code. Traditional approaches, such as static and dynamic analysis, extract valuable insights from stripped binaries, but often demand substantial expertise and manual effort. Recent advances in deep learning have opened promising opportunities to enhance binary analysis by capturing latent features and disclosing underlying code semantics. Despite the growing number of binary analysis models based on machine learning, their robustness to adversarial code transformations at the binary level remains underexplored. We evaluate the robustness of deep learning models for the task of binary code similarity detection (BCSD) under semantics-preserving transformations. The unique nature of machine instructions presents distinct challenges compared to the typical input perturbations found in other domains. We introduce asmFooler, a system that evaluates the resilience of BCSD models using a diverse set of adversarial code transformations that preserve functional semantics. We construct a dataset of 9,565 binary variants from 620 baseline samples by applying eight semantics-preserving transformations across six representative BCSD models. Our major findings highlight several key insights: i) model robustness relies on the processing pipeline, including code pre-processing, architecture, and feature selection; ii) adversarial transformation effectiveness is bounded by a budget shaped by model-specific constraints like input size and instruction expressive capacity; iii) well-crafted transformations can be highly effective with minimal perturbations; and iv) such transformations efficiently disrupt model decisions (e.g., misleading to false positives or false negatives) by focusing on semantically significant instructions.

[LG-34] Uncovering spatial tissue domains and cell types in spatial omics through cross-scale profiling of cellular and genomic interactions

链接: https://arxiv.org/abs/2602.12651
作者: Rui Yan,Xiaohan Xing,Xun Wang,Zixia Zhou,Md Tauhidul Islam,Lei Xing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cellular identity and function are linked to both their intrinsic genomic makeup and extrinsic spatial context within the tissue microenvironment. Spatial transcriptomics (ST) offers an unprecedented opportunity to study this, providing in situ gene expression profiles at single-cell resolution and illuminating the spatial and functional organization of cells within tissues. However, a significant hurdle remains: ST data is inherently noisy, large, and structurally complex. This complexity makes it intractable for existing computational methods to effectively capture the interplay between spatial interactions and intrinsic genomic relationships, thus limiting our ability to discern critical biological patterns. Here, we present CellScape, a deep learning framework designed to overcome these limitations for high-performance ST data analysis and pattern discovery. CellScape jointly models cellular interactions in tissue space and genomic relationships among cells, producing comprehensive representations that seamlessly integrate spatial signals with underlying gene regulatory mechanisms. This technique uncovers biologically informative patterns that improve spatial domain segmentation and supports comprehensive spatial cellular analyses across diverse transcriptomics datasets, offering an accurate and versatile framework for deep analysis and interpretation of ST data.w

[LG-35] Dual-Granularity Contrastive Reward via Generated Episodic Guidance for Efficient Embodied RL

链接: https://arxiv.org/abs/2602.12636
作者: Xin Liu,Yixuan Li,Yuhui Chen,Yuxing Qin,Haoran Li,Dongbin Zhao
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Designing suitable rewards poses a significant challenge in reinforcement learning (RL), especially for embodied manipulation. Trajectory success rewards are suitable for human judges or model fitting, but the sparsity severely limits RL sample efficiency. While recent methods have effectively improved RL via dense rewards, they rely heavily on high-quality human-annotated data or abundant expert supervision. To tackle these issues, this paper proposes Dual-granularity contrastive reward via generated Episodic Guidance (DEG), a novel framework to seek sample-efficient dense rewards without requiring human annotations or extensive supervision. Leveraging the prior knowledge of large video generation models, DEG only needs a small number of expert videos for domain adaptation to generate dedicated task guidance for each RL episode. Then, the proposed dual-granularity reward that balances coarse-grained exploration and fine-grained matching, will guide the agent to efficiently approximate the generated guidance video sequentially in the contrastive self-supervised latent space, and finally complete the target task. Extensive experiments on 18 diverse tasks across both simulation and real-world settings show that DEG can not only serve as an efficient exploration stimulus to help the agent quickly discover sparse success rewards, but also guide effective RL and stable policy convergence independently.

[LG-36] Efficient Personalized Federated PCA with Manifold Optimization for IoT Anomaly Detection

链接: https://arxiv.org/abs/2602.12622
作者: Xianchao Xiu,Chenyi Huang,Wei Zhang,Wanquan Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Internet of things (IoT) networks face increasing security threats due to their distributed nature and resource constraints. Although federated learning (FL) has gained prominence as a privacy-preserving framework for distributed IoT environments, current federated principal component analysis (PCA) methods lack the integration of personalization and robustness, which are critical for effective anomaly detection. To address these limitations, we propose an efficient personalized federated PCA (FedEP) method for anomaly detection in IoT networks. The proposed model achieves personalization through introducing local representations with the \ell_1 -norm for element-wise sparsity, while maintaining robustness via enforcing local models with the \ell_2,1 -norm for row-wise sparsity. To solve this non-convex problem, we develop a manifold optimization algorithm based on the alternating direction method of multipliers (ADMM) with rigorous theoretical convergence guarantees. Experimental results confirm that the proposed FedEP outperforms the state-of-the-art FedPG, achieving excellent F1-scores and accuracy in various IoT security scenarios. Our code will be available at \hrefthis https URLthis https URL.

[LG-37] Coden: Efficient Temporal Graph Neural Networks for Continuous Prediction

链接: https://arxiv.org/abs/2602.12613
作者: Zulun Zhu,Siqiang Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal Graph Neural Networks (TGNNs) are pivotal in processing dynamic graphs. However, existing TGNNs primarily target one-time predictions for a given temporal span, whereas many practical applications require continuous predictions, that predictions are issued frequently over time. Directly adapting existing TGNNs to continuous-prediction scenarios introduces either significant computational overhead or prediction quality issues especially for large graphs. This paper revisits the challenge of continuous predictions in TGNNs, and introduces \sc Coden, a TGNN model designed for efficient and effective learning on dynamic graphs. \sc Coden innovatively overcomes the key complexity bottleneck in existing TGNNs while preserving comparable predictive accuracy. Moreover, we further provide theoretical analyses that substantiate the effectiveness and efficiency of \sc Coden, and clarify its duality relationship with both RNN-based and attention-based models. Our evaluations across five dynamic datasets show that \sc Coden surpasses existing performance benchmarks in both efficiency and effectiveness, establishing it as a superior solution for continuous prediction in evolving graph environments.

[LG-38] RelBench v2: A Large-Scale Benchmark and Repository for Relational Data

链接: https://arxiv.org/abs/2602.12606
作者: Justin Gu,Rishabh Ranjan,Charilaos Kanatsoulis,Haiming Tang,Martin Jurkovic,Valter Hudovernik,Mark Znidar,Pranshu Chaturvedi,Parth Shroff,Fengyu Li,Jure Leskovec
类目: Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:Relational deep learning (RDL) has emerged as a powerful paradigm for learning directly on relational databases by modeling entities and their relationships across multiple interconnected tables. As this paradigm evolves toward larger models and relational foundation models, scalable and realistic benchmarks are essential for enabling systematic evaluation and progress. In this paper, we introduce RelBench v2, a major expansion of the RelBench benchmark for RDL. RelBench v2 adds four large-scale relational datasets spanning scholarly publications, enterprise resource planning, consumer platforms, and clinical records, increasing the benchmark to 11 datasets comprising over 22 million rows across 29 tables. We further introduce autocomplete tasks, a new class of predictive objectives that require models to infer missing attribute values directly within relational tables while respecting temporal constraints, expanding beyond traditional forecasting tasks constructed via SQL queries. In addition, RelBench v2 expands beyond its native datasets by integrating external benchmarks and evaluation frameworks: we translate event streams from the Temporal Graph Benchmark into relational schemas for unified relational-temporal evaluation, interface with ReDeLEx to provide uniform access to 70+ real-world databases suitable for pretraining, and incorporate 4DBInfer datasets and tasks to broaden multi-table prediction coverage. Experimental results demonstrate that RDL models consistently outperform single-table baselines across autocomplete, forecasting, and recommendation tasks, highlighting the importance of modeling relational structure explicitly.

[LG-39] Block-Sample MAC-Bayes Generalization Bounds ICLR2026

链接: https://arxiv.org/abs/2602.12605
作者: Matthias Frey,Jingge Zhu,Michael C. Gastpar
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Accepted for publication at The Fourteenth International Conference on Learning Representations (ICLR 2026)

点击查看摘要

Abstract:We present a family of novel block-sample MAC-Bayes bounds (mean approximately correct). While PAC-Bayes bounds (probably approximately correct) typically give bounds for the generalization error that hold with high probability, MAC-Bayes bounds have a similar form but bound the expected generalization error instead. The family of bounds we propose can be understood as a generalization of an expectation version of known PAC-Bayes bounds. Compared to standard PAC-Bayes bounds, the new bounds contain divergence terms that only depend on subsets (or \emphblocks) of the training data. The proposed MAC-Bayes bounds hold the promise of significantly improving upon the tightness of traditional PAC-Bayes and MAC-Bayes bounds. This is illustrated with a simple numerical example in which the original PAC-Bayes bound is vacuous regardless of the choice of prior, while the proposed family of bounds are finite for appropriate choices of the block size. We also explore the question whether high-probability versions of our MAC-Bayes bounds (i.e., PAC-Bayes bounds of a similar form) are possible. We answer this question in the negative with an example that shows that in general, it is not possible to establish a PAC-Bayes bound which (a) vanishes with a rate faster than \mathcalO(1/\log n) whenever the proposed MAC-Bayes bound vanishes with rate \mathcalO(n^-1/2) and (b) exhibits a logarithmic dependence on the permitted error probability.

[LG-40] Vehicle behaviour estimation for abnormal event detection using distributed fiber optic sensing

链接: https://arxiv.org/abs/2602.12591
作者: Hemant Prasad,Daisuke Ikefuji,Shin Tominaga,Hitoshi Sakurai,Manabu Otani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The distributed fiber-optic sensing (DFOS) system is a cost-effective wide-area traffic monitoring technology that utilizes existing fiber infrastructure to effectively detect traffic congestions. However, detecting single-lane abnormalities, that lead to congestions, is still a challenge. These single-lane abnormalities can be detected by monitoring lane change behaviour of vehicles, performed to avoid congestion along the monitoring section of a road. This paper presents a method to detect single-lane abnormalities by tracking individual vehicle paths and detecting vehicle lane changes along a section of a road. We propose a method to estimate the vehicle position at all time instances and fit a path using clustering techniques. We detect vehicle lane change by monitoring any change in spectral centroid of vehicle vibrations by tracking a reference vehicle along a highway. The evaluation of our proposed method with real traffic data showed 80% accuracy for lane change detection events that represent presence of abnormalities.

[LG-41] Multi-Head Attention as a Source of Catastrophic Forgetting in MoE Transformers

链接: https://arxiv.org/abs/2602.12587
作者: Anrui Chen,Ruijun Huang,Xin Zhang,Fang Dong,Hengjie Cao,Zhendong Huang,Yifeng Yang,Mengyi Chen,Jixian Zhou,Mingzhi Dong,Yujiang Wang,Jinlong Hou,Qin Lv,Robert P. Dick,Yuan Cheng,Tun Lu,Fan Yang,Li Shang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures are often considered a natural fit for continual learning because sparse routing should localize updates and reduce interference, yet MoE Transformers still forget substantially even with sparse, well-balanced expert utilization. We attribute this gap to a pre-routing bottleneck: multi-head attention concatenates head-specific signals into a single post-attention router input, forcing routing to act on co-occurring feature compositions rather than separable head channels. We show that this router input simultaneously encodes multiple separately decodable semantic and structural factors with uneven head support, and that different feature compositions induce weakly aligned parameter-gradient directions; as a result, routing maps many distinct compositions to the same route. We quantify this collision effect via a route-wise effective composition number N_eff and find that higher N_eff is associated with larger old-task loss increases after continual training. Motivated by these findings, we propose MH-MoE, which performs head-wise routing over sub-representations to increase routing granularity and reduce composition collisions. On TRACE with Qwen3-0.6B/8B, MH-MoE effectively mitigates forgetting, reducing BWT on Qwen3-0.6B from 11.2% (LoRAMoE) to 4.5%.

[LG-42] Fractional Order Federated Learning for Battery Electric Vehicle Energy Consumption Modeling

链接: https://arxiv.org/abs/2602.12567
作者: Mohammad Partohaghighi,Roummel Marcia,Bruce J. West,YangQuan Chen
类目: Machine Learning (cs.LG)
*备注: This manuscript is under review in IEEE Transactions on Transportation Electrification

点击查看摘要

Abstract:Federated learning on connected electric vehicles (BEVs) faces severe instability due to intermittent connectivity, time-varying client participation, and pronounced client-to-client variation induced by diverse operating conditions. Conventional FedAvg and many advanced methods can suffer from excessive drift and degraded convergence under these realistic constraints. This work introduces Fractional-Order Roughness-Informed Federated Averaging (FO-RI-FedAvg), a lightweight and modular extension of FedAvg that improves stability through two complementary client-side mechanisms: (i) adaptive roughness-informed proximal regularization, which dynamically tunes the pull toward the global model based on local loss-landscape roughness, and (ii) non-integer-order local optimization, which incorporates short-term memory to smooth conflicting update directions. The approach preserves standard FedAvg server aggregation, adds only element-wise operations with amortizable overhead, and allows independent toggling of each component. Experiments on two real-world BEV energy prediction datasets, VED and its extended version eVED, show that FO-RI-FedAvg achieves improved accuracy and more stable convergence compared to strong federated baselines, particularly under reduced client participation.

[LG-43] AMPS: Adaptive Modality Preference Steering via Functional Entropy

链接: https://arxiv.org/abs/2602.12533
作者: Zihan Huang,Xintong Li,Rohan Surana,Tong Yu,Rui Wang,Julian McAuley,Jingbo Shang,Junda Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) often exhibit significant modality preference, which is a tendency to favor one modality over another. Depending on the input, they may over-rely on linguistic priors relative to visual evidence, or conversely over-attend to visually salient but facts in textual contexts. Prior work has applied a uniform steering intensity to adjust the modality preference of MLLMs. However, strong steering can impair standard inference and increase error rates, whereas weak steering is often ineffective. In addition, because steering sensitivity varies substantially across multimodal instances, a single global strength is difficult to calibrate. To address this limitation with minimal disruption to inference, we introduce an instance-aware diagnostic metric that quantifies each modality’s information contribution and reveals sample-specific susceptibility to steering. Building on these insights, we propose a scaling strategy that reduces steering for sensitive samples and a learnable module that infers scaling patterns, enabling instance-aware control of modality preference. Experimental results show that our instance-aware steering outperforms conventional steering in modulating modality preference, achieving effective adjustment while keeping generation error rates low.

[LG-44] Analytical Results for Two Exponential Family Distributions in Hierarchical Dirichlet Processes

链接: https://arxiv.org/abs/2602.12527
作者: Naiqi Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Hierarchical Dirichlet Process (HDP) provides a flexible Bayesian nonparametric framework for modeling grouped data with a shared yet unbounded collection of mixture components. While existing applications of the HDP predominantly focus on the Dirichlet-multinomial conjugate structure, the framework itself is considerably more general and, in principle, accommodates a broad class of conjugate prior-likelihood pairs. In particular, exponential family distributions offer a unified and analytically tractable modeling paradigm that encompasses many commonly used distributions. In this paper, we investigate analytic results for two important members of the exponential family within the HDP framework: the Poisson distribution and the normal distribution. We derive explicit closed-form expressions for the corresponding Gamma-Poisson and Normal-Gamma-Normal conjugate pairs under the hierarchical Dirichlet process construction. Detailed derivations and proofs are provided to clarify the underlying mathematical structure and to demonstrate how conjugacy can be systematically exploited in hierarchical nonparametric models. Our work extends the applicability of the HDP beyond the Dirichlet-multinomial setting and furnishes practical analytic results for researchers employing hierarchical Bayesian nonparametrics.

[LG-45] On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

链接: https://arxiv.org/abs/2602.12506
作者: Rosie Zhao,Anshul Shah,Xiaoyu Zhu,Xinke Deng,Zhongyu Jiang,Yang Yang,Joerg Liebelt,Arnab Mondal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) fine-tuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations–misleading captions or incorrect chain-of-thought (CoT) traces–cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. Entropy-based metrics further show that these perturbations reshape model uncertainty and probability mass on the correct option, exposing model-specific trends in miscalibration. To better understand these vulnerabilities, we further analyze RL fine-tuning dynamics and uncover an accuracy-faithfulness trade-off: fine-tuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT and its robustness to contextual shifts. Although adversarial augmentation improves robustness, it does not by itself prevent faithfulness drift. Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning, but when paired with augmentation, training risks collapsing onto shortcut strategies and robustness remains elusive. Together, these findings highlight the limitations of accuracy-only evaluations and motivate training and assessment protocols that jointly emphasize correctness, robustness, and the faithfulness of visually grounded reasoning.

[LG-46] A Theoretical Analysis of Mambas Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

链接: https://arxiv.org/abs/2602.12499
作者: Mugunthan Shandirasegaran,Hongkang Li,Songyang Zhang,Meng Wang,Shuai Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent empirical success of Mamba and other selective state space models (SSMs) has renewed interest in non-attention architectures for sequence modeling, yet their theoretical foundations remain underexplored. We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block: a single-layer, single-head selective SSM with input-dependent gating, followed by a two-layer MLP trained via gradient descent (GD). Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise and examines two canonical regimes: majority-voting and locality-structured data sequences. We prove that the model achieves guaranteed generalization by establishing non-asymptotic sample complexity and convergence rate bounds, which improve as the effective signal increases and the noise decreases. Furthermore, we show that the gating vector aligns with class-relevant features while ignoring irrelevant ones, thereby formalizing a feature-selection role similar to attention but realized through selective recurrence. Numerical experiments on synthetic data justify our theoretical results. Overall, our results provide principled insight into when and why Mamba-style selective SSMs learn efficiently, offering a theoretical counterpoint to Transformer-centric explanations.

[LG-47] Composable Model-Free RL for Navigation with Input-Affine Systems

链接: https://arxiv.org/abs/2602.12492
作者: Xinhuan Sang,Abdelrahman Abdelgawad,Roberto Tron
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 17 pages, 8 figures. Submitted to WAFR 2026 (under review)

点击查看摘要

Abstract:As autonomous robots move into complex, dynamic real-world environments, they must learn to navigate safely in real time, yet anticipating all possible behaviors is infeasible. We propose a composable, model-free reinforcement learning method that learns a value function and an optimal policy for each individual environment element (e.g., goal or obstacle) and composes them online to achieve goal reaching and collision avoidance. Assuming unknown nonlinear dynamics that evolve in continuous time and are input-affine, we derive a continuous-time Hamilton-Jacobi-Bellman (HJB) equation for the value function and show that the corresponding advantage function is quadratic in the action and optimal policy. Based on this structure, we introduce a model-free actor-critic algorithm that learns policies and value functions for static or moving obstacles using gradient descent. We then compose multiple reach/avoid models via a quadratically constrained quadratic program (QCQP), yielding formal obstacle-avoidance guarantees in terms of value-function level sets, providing a model-free alternative to CLF/CBF-based controllers. Simulations demonstrate improved performance over a PPO baseline applied to a discrete-time approximation.

[LG-48] Gradient-Enhanced Partitioned Gaussian Processes for Real-Time Quadrotor Dynamics Modeling

链接: https://arxiv.org/abs/2602.12487
作者: Xinhuan Sang,Adam Rozman,Sheryl Grace,Roberto Tron
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures. Submitted to IEEE Transactions on Robotics (under review)

点击查看摘要

Abstract:We present a quadrotor dynamics Gaussian Process (GP) with gradient information that achieves real-time inference via state-space partitioning and approximation, and that includes aerodynamic effects using data from mid-fidelity potential flow simulations. While traditional GP-based approaches provide reliable Bayesian predictions with uncertainty quantification, they are computationally expensive and thus unsuitable for real-time simulations. To address this challenge, we integrate gradient information to improve accuracy and introduce a novel partitioning and approximation strategy to reduce online computational cost. In particular, for the latter, we associate a local GP with each non-overlapping region; by splitting the training data into local near and far subsets, and by using Schur complements, we show that a large part of the matrix inversions required for inference can be performed offline, enabling real-time inference at frequencies above 30 Hz on standard desktop hardware. To generate a training dataset that captures aerodynamic effects, such as rotor-rotor interactions and apparent wind direction, we use the CHARM code, which is a mid-fidelity aerodynamic solver. It is applied to the SUI Endurance quadrotor to predict force and torque, along with noise at three specified locations. The derivative information is obtained via finite differences. Experimental results demonstrate that the proposed partitioned GP with gradient conditioning achieves higher accuracy than standard partitioned GPs without gradient information, while greatly reducing computational time. This framework provides an efficient foundation for real-time aerodynamic prediction and control algorithms in complex and unsteady environments.

[LG-49] Geometric separation and constructive universal approximation with two hidden layers

链接: https://arxiv.org/abs/2602.12482
作者: Chanyoung Sung
类目: Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA)
*备注:

点击查看摘要

Abstract:We give a geometric construction of neural networks that separate disjoint compact subsets of \Bbb R^n , and use it to obtain a constructive universal approximation theorem. Specifically, we show that networks with two hidden layers and either a sigmoidal activation (i.e., strictly monotone bounded continuous) or the ReLU activation can approximate any real-valued continuous function on an arbitrary compact set K\subset\Bbb R^n to any prescribed accuracy in the uniform norm. For finite K , the construction simplifies and yields a sharp depth-2 (single hidden layer) approximation result.

[LG-50] ght Bounds for Logistic Regression with Large Stepsize Gradient Descent in Low Dimension

链接: https://arxiv.org/abs/2602.12471
作者: Michael Crawshaw,Mingrui Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the optimization problem of minimizing the logistic loss with gradient descent to train a linear model for binary classification with separable data. With a budget of T iterations, it was recently shown that an accelerated 1/T^2 rate is possible by choosing a large step size \eta = \Theta(\gamma^2 T) (where \gamma is the dataset’s margin) despite the resulting non-monotonicity of the loss. In this paper, we provide a tighter analysis of gradient descent for this problem when the data is two-dimensional: we show that GD with a sufficiently large learning rate \eta finds a point with loss smaller than \mathcalO(1/(\eta T)) , as long as T \geq \Omega(n/\gamma + 1/\gamma^2) , where n is the dataset size. Our improved rate comes from a tighter bound on the time \tau that it takes for GD to transition from unstable (non-monotonic loss) to stable (monotonic loss), via a fine-grained analysis of the oscillatory dynamics of GD in the subspace orthogonal to the max-margin classifier. We also provide a lower bound of \tau matching our upper bound up to logarithmic factors, showing that our analysis is tight.

[LG-51] Regularized Meta-Learning for Improved Generalization

链接: https://arxiv.org/abs/2602.12469
作者: Noor Islam S. Mohammad,Md Muntaqim Meherab
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep ensemble methods often improve predictive performance, yet they suffer from three practical limitations: redundancy among base models that inflates computational cost and degrades conditioning, unstable weighting under multicollinearity, and overfitting in meta-learning pipelines. We propose a regularized meta-learning framework that addresses these challenges through a four-stage pipeline combining redundancy-aware projection, statistical meta-feature augmentation, and cross-validated regularized meta-models (Ridge, Lasso, and ElasticNet). Our multi-metric de-duplication strategy removes near-collinear predictors using correlation and MSE thresholds ( \tau_\textcorr=0.95 ), reducing the effective condition number of the meta-design matrix while preserving predictive diversity. Engineered ensemble statistics and interaction terms recover higher-order structure unavailable to raw prediction columns. A final inverse-RMSE blending stage mitigates regularizer-selection variance. On the Playground Series S6E1 benchmark (100K samples, 72 base models), the proposed framework achieves an out-of-fold RMSE of 8.582, improving over simple averaging (8.894) and conventional Ridge stacking (8.627), while matching greedy hill climbing (8.603) with substantially lower runtime (4 times faster). Conditioning analysis shows a 53.7% reduction in effective matrix condition number after redundancy projection. Comprehensive ablations demonstrate consistent contributions from de-duplication, statistical meta-features, and meta-ensemble blending. These results position regularized meta-learning as a stable and deployment-efficient stacking strategy for high-dimensional ensemble systems.

[LG-52] Continuous Diffusion Models Can Obey Formal Syntax

链接: https://arxiv.org/abs/2602.12468
作者: Jinwoo Kim,Taylor Berg-Kirkpatrick,Loris D’Antoni
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL)
*备注:

点击查看摘要

Abstract:Diffusion language models offer a promising alternative to autoregressive models due to their global, non-causal generation process, but their continuous latent dynamics make discrete constraints – e.g., the output should be a JSON file that matches a given schema – difficult to impose. We introduce a training-free guidance method for steering continuous diffusion language models to satisfy formal syntactic constraints expressed using regular expressions. Our approach constructs an analytic score estimating the probability that a latent state decodes to a valid string accepted by a given regular expression, and uses its gradient to guide sampling, without training auxiliary classifiers. The denoising process targets the base model conditioned on syntactic validity. We implement our method in Diffinity on top of the PLAID diffusion model and evaluate it on 180 regular-expression constraints over JSON and natural-language benchmarks. Diffinity achieves 68-96% constraint satisfaction while incurring only a small perplexity cost relative to unconstrained sampling, outperforming autoregressive constrained decoding in both constraint satisfaction and output quality. Subjects: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL) Cite as: arXiv:2602.12468 [cs.LG] (or arXiv:2602.12468v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.12468 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] Computationally sufficient statistics for Ising models

链接: https://arxiv.org/abs/2602.12449
作者: Abhijith Jayakumar,Shreya Shukla,Marc Vuffray,Andrey Y. Lokhov,Sidhant Misra
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning Gibbs distributions using only sufficient statistics has long been recognized as a computationally hard problem. On the other hand, computationally efficient algorithms for learning Gibbs distributions rely on access to full sample configurations generated from the model. For many systems of interest that arise in physical contexts, expecting a full sample to be observed is not practical, and hence it is important to look for computationally efficient methods that solve the learning problem with access to only a limited set of statistics. We examine the trade-offs between the power of computation and observation within this scenario, employing the Ising model as a paradigmatic example. We demonstrate that it is feasible to reconstruct the model parameters for a model with \ell_1 width \gamma by observing statistics up to an order of O(\gamma) . This approach allows us to infer the model’s structure and also learn its couplings and magnetic fields. We also discuss a setting where prior information about structure of the model is available and show that the learning problem can be solved efficiently with even more limited observational power.

[LG-54] Stabilizing Native Low-Rank LLM Pretraining

链接: https://arxiv.org/abs/2602.12429
作者: Paul Janson,Edouard Oyallon,Eugene Belilovsky
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Foundation models have achieved remarkable success, yet their growing parameter counts pose significant computational and memory challenges. Low-rank factorization offers a promising route to reduce training and inference costs, but the community lacks a stable recipe for training models from scratch using exclusively low-rank weights while matching the performance of the dense model. We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights for all non-embedding matrices without auxiliary “full-rank” guidance required by prior methods. While native low-rank training often suffers from instability and loss spikes, we identify uncontrolled growth in the spectral norm (largest singular value) of the weight matrix update as the dominant factor. To address this, we introduce Spectron: Spectral renormalization with orthogonalization, which dynamically bounds the resultant weight updates based on the current spectral norms of the factors. Our method enables stable, end-to-end factorized training with negligible overhead. Finally, we establish compute-optimal scaling laws for natively low-rank transformers, demonstrating predictable power-law behavior and improved inference efficiency relative to dense models.

[LG-55] Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning

链接: https://arxiv.org/abs/2602.12405
作者: Carl Qi,Xiaojie Wang,Silong Yong,Stephen Sheng,Huitan Mao,Sriram Srinivasan,Manikantan Nambi,Amy Zhang,Yesh Dattatreya
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reasoning about failures is crucial for building reliable and trustworthy robotic systems. Prior approaches either treat failure reasoning as a closed-set classification problem or assume access to ample human annotations. Failures in the real world are typically subtle, combinatorial, and difficult to enumerate, whereas rich reasoning labels are expensive to acquire. We address this problem by introducing ARMOR: Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning. We formulate detection and reasoning as a multi-task self-refinement process, where the model iteratively predicts detection outcomes and natural language reasoning conditioned on past outputs. During training, ARMOR learns from heterogeneous supervision - large-scale sparse binary labels and small-scale rich reasoning annotations - optimized via a combination of offline and online imitation learning. At inference time, ARMOR generates multiple refinement trajectories and selects the most confident prediction via a self-certainty metric. Experiments across diverse environments show that ARMOR achieves state-of-the-art performance by improving over the previous approaches by up to 30% on failure detection rate and up to 100% in reasoning measured through LLM fuzzy match score, demonstrating robustness to heterogeneous supervision and open-ended reasoning beyond predefined failure modes. We provide dditional visualizations on our website: this https URL

[LG-56] Synthetic Interaction Data for Scalable Personalization in Large Language Models

链接: https://arxiv.org/abs/2602.12394
作者: Yuchen Ma,Yue Huang,Wenjie Wang,Xiaonan Luo,Xiangliang Zhang,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized prompting offers large opportunities for deploying large language models (LLMs) to diverse users, yet existing prompt optimization methods primarily focus on task-level optimization while largely overlooking user-specific preferences and latent constraints of individual users. This gap is primarily due to (i) the absence of high-quality, privacy-sensitive data that capture personalized user-LLM interactions at scale, and (ii) the lack of robust reward signals for individual preferences. To overcome existing data limitations, we introduce a high-fidelity synthetic data generation framework called PersonaGym. Unlike prior work that treats personalization as static persona-preference pairs, PersonaGym models a dynamic preference process via an agentic LLM system to simulate realistic preference behaviors and semantic-aware noise in order to generate personalized multi-turn interaction trajectories. Using PersonaGym, we release PersonaAtlas, a large-scale, high-quality, and diverse synthetic dataset of high-fidelity multi-turn personalized interaction trajectories that closely mirror real-world preference expression and noise patterns. We further propose Personalized Prompt Optimization (PPOpt), a scalable and model-agnostic framework that optimizes user prompts based on interaction histories without modifying the deployed LLM. PPOpt adopts a reason-then-optimize paradigm that infers an explicit user profile and conditions prompt rewriting on the user profile to avoid reward hacking. Our training procedure for PPOpt integrates a cold-start supervised prior with outcome-driven multi-objective reinforcement learning. We present extensive experiments to demonstrate consistent improvements over state-of-the-art baselines in terms of task performance, personalization quality, and robustness to noisy as well as to sparse preference signals.

[LG-57] High-dimensional Level Set Estimation with Trust Regions and Double Acquisition Functions

链接: https://arxiv.org/abs/2602.12391
作者: Giang Ngo,Dat Phan Trong,Dang Nguyen,Sunil Gupta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Level set estimation (LSE) classifies whether an unknown function’s value exceeds a specified threshold for given inputs, a fundamental problem in many real-world applications. In active learning settings with limited initial data, we aim to iteratively acquire informative points to construct an accurate classifier for this task. In high-dimensional spaces, this becomes challenging where the search volume grows exponentially with increasing dimensionality. We propose TRLSE, an algorithm for high-dimensional LSE, which identifies and refines regions near the threshold boundary with dual acquisition functions operating at both global and local levels. We provide a theoretical analysis of TRLSE’s accuracy and show its superior sample efficiency against existing methods through extensive evaluations on multiple synthetic and real-world LSE problems.

[LG-58] Deep Doubly Debiased Longitudinal Effect Estimation with ICE G-Computation

链接: https://arxiv.org/abs/2602.12379
作者: Wenxin Chen,Weishen Pan,Kyra Gan,Fei Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating longitudinal treatment effects is essential for sequential decision-making but is challenging due to treatment-confounder feedback. While Iterative Conditional Expectation (ICE) G-computation offers a principled approach, its recursive structure suffers from error propagation, corrupting the learned outcome regression models. We propose D3-Net, a framework that mitigates error propagation in ICE training and then applies a robust final correction. First, to interrupt error propagation during learning, we train the ICE sequence using Sequential Doubly Robust (SDR) pseudo-outcomes, which provide bias-corrected targets for each regression. Second, we employ a multi-task Transformer with a covariate simulator head for auxiliary supervision, regularizing representations against corruption by noisy pseudo-outcomes, and a target network to stabilize training dynamics. For the final estimate, we discard the SDR correction and instead use the uncorrected nuisance models to perform Longitudinal Targeted Minimum Loss-Based Estimation (LTMLE) on the original outcomes. This second-stage, targeted debiasing ensures robustness and optimal finite-sample properties. Comprehensive experiments demonstrate that our model, D3-Net, robustly reduces bias and variance across different horizons, counterfactuals, and time-varying confoundings, compared to existing state-of-the-art ICE-based estimators.

[LG-59] A Machine Learning Approach to the Nirenberg Problem

链接: https://arxiv.org/abs/2602.12368
作者: Gianfranco Cortés,Maria Esteban-Casadevall,Yueqing Feng,Jonas Henkel,Edward Hirst,Tancredi Schettini Gherardini,Alexander G. Stapleton
类目: Machine Learning (cs.LG); High Energy Physics - Theory (hep-th); Analysis of PDEs (math.AP); Differential Geometry (math.DG)
*备注: 38 pages, 14 pages, 7 tables

点击查看摘要

Abstract:This work introduces the Nirenberg Neural Network: a numerical approach to the Nirenberg problem of prescribing Gaussian curvature on S^2 for metrics that are pointwise conformal to the round metric. Our mesh-free physics-informed neural network (PINN) approach directly parametrises the conformal factor globally and is trained with a geometry-aware loss enforcing the curvature equation. Additional consistency checks were performed via the Gauss-Bonnet theorem, and spherical-harmonic expansions were fit to the learnt models to provide interpretability. For prescribed curvatures with known realisability, the neural network achieves very low losses ( 10^-7 - 10^-10 ), while unrealisable curvatures yield significantly higher losses. This distinction enables the assessment of unknown cases, separating likely realisable functions from non-realisable ones. The current capabilities of the Nirenberg Neural Network demonstrate that neural solvers can serve as exploratory tools in geometric analysis, offering a quantitative computational perspective on longstanding existence questions. Comments: 38 pages, 14 pages, 7 tables Subjects: Machine Learning (cs.LG); High Energy Physics - Theory (hep-th); Analysis of PDEs (math.AP); Differential Geometry (math.DG) Reportnumber: MPIM-Bonn-2026 Cite as: arXiv:2602.12368 [cs.LG] (or arXiv:2602.12368v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.12368 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-60] Variational Greens Functions for Volumetric PDEs

链接: https://arxiv.org/abs/2602.12349
作者: Joao Teixeira,Eitan Grinspun,Otman Benchekroun
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Green’s functions characterize the fundamental solutions of partial differential equations; they are essential for tasks ranging from shape analysis to physical simulation, yet they remain computationally prohibitive to evaluate on arbitrary geometric discretizations. We present Variational Green’s Function (VGF), a method that learns a smooth, differentiable representation of the Green’s function for linear self-adjoint PDE operators, including the Poisson, the screened Poisson, and the biharmonic equations. To resolve the sharp singularities characteristic of the Green’s functions, our method decomposes the Green’s function into an analytic free-space component, and a learned corrector component. Our method leverages a variational foundation to impose Neumann boundary conditions naturally, and imposes Dirichlet boundary conditions via a projective layer on the output of the neural field. The resulting Green’s functions are fast to evaluate, differentiable with respect to source application, and can be conditioned on other signals parameterizing our geometry.

[LG-61] Wireless TokenCom: RL-Based Tokenizer Agreement for Multi-User Wireless Token Communications

链接: https://arxiv.org/abs/2602.12338
作者: Farshad Zeinali,Mahdi Boloursaz Mashhadi,Dusit Niyato,Rahim Tafazolli
类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE TVT for possible publication

点击查看摘要

Abstract:Token Communications (TokenCom) has recently emerged as an effective new paradigm, where tokens are the unified units of multimodal communications and computations, enabling efficient digital semantic- and goal-oriented communications in future wireless networks. To establish a shared semantic latent space, the transmitters/receivers in TokenCom need to agree on an identical tokenizer model and codebook. To this end, an initial Tokenizer Agreement (TA) process is carried out in each communication episode, where the transmitter/receiver cooperate to choose from a set of pre-trained tokenizer models/ codebooks available to them both for efficient TokenCom. In this correspondence, we investigate TA in a multi-user downlink wireless TokenCom scenario, where the base station equipped with multiple antennas transmits video token streams to multiple users. We formulate the corresponding mixed-integer non-convex problem, and propose a hybrid reinforcement learning (RL) framework that integrates a deep Q-network (DQN) for joint tokenizer agreement and sub-channel assignment, with a deep deterministic policy gradient (DDPG) for beamforming. Simulation results show that the proposed framework outperforms baseline methods in terms of semantic quality and resource efficiency, while reducing the freezing events in video transmission by 68% compared to the conventional H.265-based scheme.

[LG-62] he Appeal and Reality of Recycling LoRAs with Adaptive Merging

链接: https://arxiv.org/abs/2602.12323
作者: Haokun Liu,Gyung Hyun Je,Marco Ciccone,Zhenlin Xu,Prasanth YSS,Colin Raffel
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 24 pages, 14 figures, 5 tables. Preprint

点击查看摘要

Abstract:The widespread availability of fine-tuned LoRA modules for open pre-trained models has led to an interest in methods that can adaptively merge LoRAs to improve performance. These methods typically include some way of selecting LoRAs from a pool and tune merging coefficients based on a task-specific dataset. While adaptive merging methods have demonstrated improvements in some settings, no past work has attempted to recycle LoRAs found “in the wild” on model repositories like the Hugging Face Hub. To address this gap, we consider recycling from a pool of nearly 1,000 user-contributed LoRAs trained from the Llama 3.1 8B-Instruct language model. Our empirical study includes a range of adaptive and non-adaptive merging methods in addition to a new method designed via a wide search over the methodological design space. We demonstrate that adaptive merging methods can improve performance over the base model but provide limited benefit over training a new LoRA on the same data used to set merging coefficients. We additionally find not only that the specific choice of LoRAs to merge has little importance, but that using LoRAs with randomly initialized parameter values yields similar performance. This raises the possibility that adaptive merging from recycled LoRAs primarily works via some kind of regularization effect, rather than by enabling positive cross-task transfer. To better understand why past work has proven successful, we confirm that positive transfer is indeed possible when there are highly relevant LoRAs in the pool. We release the model checkpoints and code online.

[LG-63] Abstractive Red-Teaming of Language Model Character

链接: https://arxiv.org/abs/2602.12318
作者: Nate Rahn,Allison Qi,Avery Griffin,Jonathan Michala,Henry Sleight,Erik Jones
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We want language model assistants to conform to a character specification, which asserts how the model should act across diverse user interactions. While models typically follow these character specifications, they can occasionally violate them in large-scale deployments. In this work, we aim to identify types of queries that are likely to produce such character violations at deployment, using much less than deployment-level compute. To do this, we introduce abstractive red-teaming, where we search for natural-language query categories, e.g. “The query is in Chinese. The query asks about family roles,” that routinely elicit violations. These categories abstract over the many possible variants of a query which could appear in the wild. We introduce two algorithms for efficient category search against a character-trait-specific reward model: one based on reinforcement learning on a category generator LLM, and another which leverages a strong LLM to iteratively synthesize categories from high-scoring queries. Across a 12-principle character specification and 7 target models, we find that our algorithms consistently outperform baselines, and generate qualitatively interesting categories; for example, queries which ask Llama-3.1-8B-Instruct to predict the future lead to responses saying that AI will dominate humanity, and queries that ask GPT-4.1-Mini for essential prison survival items lead to enthusiastic recommendation of illegal weapons. Overall, we believe our results represent an important step towards realistic pre-deployment auditing of language model character.

[LG-64] String-Level Ground Fault Localization for TN-Earthed Three-Phase Photovoltaic Systems

链接: https://arxiv.org/abs/2602.12289
作者: Yuanliang Li,Xun Gong,Reza Iravani,Bo Cao,Heng Liu,Ziming Chen
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The DC-side ground fault (GF) poses significant risks to three-phase TN-earthed photovoltaic (PV) systems, as the resulting high fault current can directly damage both PV inverters and PV modules. Once a fault occurs, locating the faulty string through manual string-by-string inspection is highly time-consuming and inefficient. This work presents a comprehensive analysis of GF characteristics through fault-current analysis and a simulation-based case study covering multiple fault locations. Building on these insights, we propose an edge-AI-based GF localization approach tailored for three-phase TN-earthed PV systems. A PLECS-based simulation model that incorporates PV hysteresis effects is developed to generate diverse GF scenarios, from which correlation-based features are extracted throughout the inverter’s four-stage shutdown sequence. Using the simulated dataset, a lightweight Variational Information Bottleneck (VIB)-based localization model is designed and trained, achieving over 93% localization accuracy at typical sampling rates with low computational cost, demonstrating strong potential for deployment on resource-constrained PV inverters.

[LG-65] Selection of CMIP6 Models for Regional Precipitation Projection and Climate Change Assessment in the Jhelum and Chenab River Basins

链接: https://arxiv.org/abs/2602.13181
作者: Saad Ahmed Jamal,Ammara Nusrat,Muhammad Azmat,Muhammad Osama Nusrat
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 28 pages

点击查看摘要

Abstract:Effective water resource management depends on accurate projections of flows in water channels. For projected climate data, use of different General Circulation Models (GCM) simulates contrasting results. This study shows selection of GCM for the latest generation CMIP6 for hydroclimate change impact studies. Envelope based method was used for the selection, which includes components based on machine learning techniques, allowing the selection of GCMs without the need for in-situ reference data. According to our knowledge, for the first time, such a comparison was performed for the CMIP6 Shared Socioeconomic Pathway (SSP) scenarios data. In addition, the effect of climate change under SSP scenarios was studied, along with the calculation of extreme indices. Finally, GCMs were compared to quantify spatiotemporal differences between CMIP5 and CMIP6 data. Results provide NorESM2 LM, FGOALS g3 as selected models for the Jhelum and Chenab River. Highly vulnerable regions under the effect of climate change were highlighted through spatial maps, which included parts of Punjab, Jammu, and Kashmir. Upon comparison of CMIP5 and CMIP6, no discernible difference was found between the RCP and SSP scenarios precipitation projections. In the future, more detailed statistical comparisons could further reinforce the proposition.

[LG-66] Improved Regret Guarantees for Online Mirror Descent using a Portfolio of Mirror Maps

链接: https://arxiv.org/abs/2602.13177
作者: Swati Gupta,Jai Moondra,Mohit Singh
类目: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:OMD and its variants give a flexible framework for OCO where the performance depends crucially on the choice of the mirror map. While the geometries underlying OPGD and OEG, both special cases of OMD, are well understood, it remains a challenging open question on how to construct an optimal mirror map for any given constrained set and a general family of loss functions, e.g., sparse losses. Motivated by parameterizing a near-optimal set of mirror maps, we consider a simpler question: is it even possible to obtain polynomial gains in regret by using mirror maps for geometries that interpolate between L_1 and L_2 , which may not be possible by restricting to only OEG ( L_1 ) or OPGD ( L_2 ). Our main result answers this question positively. We show that mirror maps based on block norms adapt better to the sparsity of loss functions, compared to previous L_p (for p \in [1, 2] ) interpolations. In particular, we construct a family of online convex optimization instances in \mathbbR^d , where block norm-based mirror maps achieve a provable polynomial (in d ) improvement in regret over OEG and OPGD for sparse loss functions. We then turn to the setting in which the sparsity level of the loss functions is unknown. In this case, the choice of geometry itself becomes an online decision problem. We first show that naively switching between OEG and OPGD can incur linear regret, highlighting the intrinsic difficulty of geometry selection. To overcome this issue, we propose a meta-algorithm based on multiplicative weights that dynamically selects among a family of uniform block norms. We show that this approach effectively tunes OMD to the sparsity of the losses, yielding adaptive regret guarantees. Overall, our results demonstrate that online mirror-map selection can significantly enhance the ability of OMD to exploit sparsity in online convex optimization. Subjects: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) MSC classes: 68W27 (Primary) 90C25, 68Q25 (Secondary) Cite as: arXiv:2602.13177 [math.OC] (or arXiv:2602.13177v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2602.13177 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-67] AdaGrad-Diff: A New Version of the Adaptive Gradient Algorithm

链接: https://arxiv.org/abs/2602.13112
作者: Matia Bojovic,Saverio Salzo,Massimiliano Pontil
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 24 pages

点击查看摘要

Abstract:Vanilla gradient methods are often highly sensitive to the choice of stepsize, which typically requires manual tuning. Adaptive methods alleviate this issue and have therefore become widely used. Among them, AdaGrad has been particularly influential. In this paper, we propose an AdaGrad-style adaptive method in which the adaptation is driven by the cumulative squared norms of successive gradient differences rather than gradient norms themselves. The key idea is that when gradients vary little across iterations, the stepsize is not unnecessarily reduced, while significant gradient fluctuations, reflecting curvature or instability, lead to automatic stepsize damping. Numerical experiments demonstrate that the proposed method is more robust than AdaGrad in several practically relevant settings.

[LG-68] Random Forests as Statistical Procedures: Design Variance and Dependence

链接: https://arxiv.org/abs/2602.13104
作者: Nathaniel S. O’Connell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 26 pages, 2 figures. Supplementary material included

点击查看摘要

Abstract:Random forests are widely used prediction procedures, yet are typically described algorithmically rather than as statistical designs acting on a fixed dataset. We develop a finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function. This perspective yields an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation. We further decompose both single-tree dispersion and inter-tree covariance using the laws of total variance and covariance, isolating two fundamental design mechanisms-reuse of training observations and alignment of data-adaptive partitions. These mechanisms induce a strict covariance floor, demonstrating that predictive variability cannot be eliminated by increasing the number of trees alone. The resulting framework clarifies how resampling, feature-level randomization, and split selection govern resolution, tree variability, and dependence, and establishes random forests as explicit finite-sample statistical designs whose behavior is determined by their underlying randomized construction.

[LG-69] Barron-Wiener-Laguerre models

链接: https://arxiv.org/abs/2602.13098
作者: Rahul Manavalan,Filip Tronarp
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a probabilistic extension of Wiener-Laguerre models for causal operator learning. Classical Wiener-Laguerre models parameterize stable linear dynamics using orthonormal Laguerre bases and apply a static nonlinear map to the resulting features. While structurally efficient and interpretable, they provide only deterministic point estimates. We reinterpret the nonlinear component through the lens of Barron function approximation, viewing two-layer networks, random Fourier features, and extreme learning machines as discretizations of integral representations over parameter measures. This perspective naturally admits Bayesian inference on the nonlinear map and yields posterior predictive uncertainty. By combining Laguerre-parameterized causal dynamics with probabilistic Barron-type nonlinear approximators, we obtain a structured yet expressive class of causal operators equipped with uncertainty quantification. The resulting framework bridges classical system identification and modern measure-based function approximation, providing a principled approach to time-series modeling and nonlinear systems identification.

[LG-70] FTF: Training-Free Targeted Flow for Conditional Sampling

链接: https://arxiv.org/abs/2602.12932
作者: Qianqian Qu,Jun S. Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a training-free conditional sampling method for flow matching models based on importance sampling. Because a naïve application of importance sampling suffers from weight degeneracy in high-dimensional settings, we modify and incorporate a resampling technique in sequential Monte Carlo (SMC) during intermediate stages of the generation process. To encourage generated samples to diverge along distinct trajectories, we derive a stochastic flow with adjustable noise strength to replace the deterministic flow at the intermediate stage. Our framework requires no additional training, while providing theoretical guarantees of asymptotic accuracy. Experimentally, our method significantly outperforms existing approaches on conditional sampling tasks for MNIST and CIFAR-10. We further demonstrate the applicability of our approach in higher-dimensional, multimodal settings through text-to-image generation experiments on CelebA-HQ.

[LG-71] Annealing in variational inference mitigates mode collapse: A theoretical study on Gaussian mixtures

链接: https://arxiv.org/abs/2602.12923
作者: Luigi Fogliani,Bruno Loureiro,Marylou Gabrié
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mode collapse, the failure to capture one or more modes when targetting a multimodal distribution, is a central challenge in modern variational inference. In this work, we provide a mathematical analysis of annealing based strategies for mitigating mode collapse in a tractable setting: learning a Gaussian mixture, where mode collapse is known to arise. Leveraging a low dimensional summary statistics description, we precisely characterize the interplay between the initial temperature and the annealing rate, and derive a sharp formula for the probability of mode collapse. Our analysis shows that an appropriately chosen annealing scheme can robustly prevent mode collapse. Finally, we present numerical evidence that these theoretical tradeoffs qualitatively extend to neural network based models, RealNVP normalizing flows, providing guidance for designing annealing strategies mitigating mode collapse in practical variational inference pipelines.

[LG-72] Blessings of Multiple Good Arms in Multi-Objective Linear Bandits

链接: https://arxiv.org/abs/2602.12901
作者: Heesang Ann,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 58 pages

点击查看摘要

Abstract:The multi objective bandit setting has traditionally been regarded as more complex than the single objective case, as multiple objectives must be optimized simultaneously. In contrast to this prevailing view, we demonstrate that when multiple good arms exist for multiple objectives, they can induce a surprising benefit, implicit exploration. Under this condition, we show that simple algorithms that greedily select actions in most rounds can nonetheless achieve strong performance, both theoretically and empirically. To our knowledge, this is the first study to introduce implicit exploration in both multi objective and parametric bandit settings without any distributional assumptions on the contexts. We further introduce a framework for effective Pareto fairness, which provides a principled approach to rigorously analyzing fairness of multi objective bandit algorithms.

[LG-73] A Regularization-Sharpness Tradeoff for Linear Interpolators

链接: https://arxiv.org/abs/2602.12680
作者: Qingyi Hu,Liam Hodgkinson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages, 4 figures

点击查看摘要

Abstract:The rule of thumb regarding the relationship between the bias-variance tradeoff and model size plays a key role in classical machine learning, but is now well-known to break down in the overparameterized setting as per the double descent curve. In particular, minimum-norm interpolating estimators can perform well, suggesting the need for new tradeoff in these settings. Accordingly, we propose a regularization-sharpness tradeoff for overparameterized linear regression with an \ell^p penalty. Inspired by the interpolating information criterion, our framework decomposes the selection penalty into a regularization term (quantifying the alignment of the regularizer and the interpolator) and a geometric sharpness term on the interpolating manifold (quantifying the effect of local perturbations), yielding a tradeoff analogous to bias-variance. Building on prior analyses that established this information criterion for ridge regularizers, this work first provides a general expression of the interpolating information criterion for \ell^p regularizers where p \ge 2 . Subsequently, we extend this to the LASSO interpolator with \ell^1 regularizer, which induces stronger sparsity. Empirical results on real-world datasets with random Fourier features and polynomials validate our theory, demonstrating how the tradeoff terms can distinguish performant linear interpolators from weaker ones.

[LG-74] Linear Regression with Unknown Truncation Beyond Gaussian Features

链接: https://arxiv.org/abs/2602.12534
作者: Alexandros Kouridakis,Anay Mehrotra,Alkis Kalavasis,Constantine Caramanis
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In truncated linear regression, samples (x,y) are shown only when the outcome y falls inside a certain survival set S^\star and the goal is to estimate the unknown d -dimensional regressor w^\star . This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where S^\star is precisely known. The more practically relevant case, where S^\star is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a d^\mathrmpoly (1/\varepsilon) run time for achieving \varepsilon accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in \mathrmpoly (d/\varepsilon) time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest. Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2602.12534 [stat.ML] (or arXiv:2602.12534v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.12534 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-75] ask- and Metric-Specific Signal Quality Indices for Medical Time Series

链接: https://arxiv.org/abs/2602.12478
作者: Jad Haidamous,Christoph Hoog Antink
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, submitted to EUSIPCO 2026

点击查看摘要

Abstract:Medical time series such as electrocardiograms (ECGs) and photoplethysmograms (PPGs) are frequently affected by measurement artifacts due to challenging acquisition environments, such as in ambulances and during routine daily activities. Since automated algorithms for analyzing such signals increasingly inform clinically relevant decisions, identifying signal segments on which these algorithms may produce unreliable outputs is of critical importance. Signal quality indices (SQIs) are commonly used for this purpose. However, most existing SQIs are task agnostic and do not account for the specific algorithm and performance metric used downstream. In this work, we formalize signal quality as a task- and metric-dependent concept and propose a perturbation-based SQI (pSQI) that aims to detect an algorithm’s performance degradation on an input signal with respect to a metric. The pSQI is defined as the worst-case value of the performance metric under an additive, colored Gaussian noise perturbation with a lower-bounded signal-to-noise ratio. We introduce formal requirements for task- and metric-specific SQIs, including monotonicity of the metric in expectation and maximal separation under thresholding. Experiments on R-peak detection and atrial fibrillation classification benchmarks demonstrate that the proposed pSQI consistently outperforms existing feature- and deep learning-based SQIs in identifying unreliable inputs without requiring training.

[LG-76] Probabilistic Design of Parametrized Quantum Circuits through Local Gate Modifications

链接: https://arxiv.org/abs/2602.12465
作者: Grier M. Jones,Aviraj Newatia,Alexander Lao,Aditya K. Rao,Viki Kumar Prasad,Hans-Arno Jacobsen
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Within quantum machine learning, parametrized quantum circuits provide flexible quantum models, but their performance is often highly task-dependent, making manual circuit design challenging. Alternatively, quantum architecture search algorithms have been proposed to automate the discovery of task-specific parametrized quantum circuits using systematic frameworks. In this work, we propose an evolution-inspired heuristic quantum architecture search algorithm, which we refer to as the local quantum architecture search. The goal of the local quantum architecture search algorithm is to optimize parametrized quantum circuit architectures through a local, probabilistic search over a fixed set of gate-level actions applied to existing circuits. We evaluate the local quantum architecture search algorithm on two synthetic function-fitting regression tasks and two quantum chemistry regression datasets, including the BSE49 dataset of bond separation energies for first- and second-row elements and a dataset of water conformers generated using the data-driven coupled-cluster approach. Using state-vector simulation, our results highlight the applicability of local quantum architecture search algorithm for identifying competitive circuit architectures with desirable performance metrics. Lastly, we analyze the properties of the discovered circuits and demonstrate the deployment of the best-performing model on state-of-the-art quantum hardware.

[LG-77] Neural and numerical methods for mathrmG_2-structures on contact Calabi-Yau 7-manifolds

链接: https://arxiv.org/abs/2602.12438
作者: Elli Heyes,Edward Hirst,Henrique N. Sá Earp,Tomás S. R. Silva
类目: Differential Geometry (math.DG); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 8+5 pages, 9 figures

点击查看摘要

Abstract:A numerical framework for approximating \mathrmG_2 -structure 3-forms on contact Calabi-Yau manifolds is presented. The approach proceeds in three stages: first, existing neural network models are employed to compute an approximate Ricci-flat metric on a Calabi-Yau threefold. Second, using this metric and the explicit construction of a \mathrmG_2 -structure on the associated 7-dimensional Calabi-Yau link in the 9-sphere, numerical approximations of the 3-form are generated on a large set of sampled points. Finally, a dedicated neural architecture is trained to learn the 3-form and its induced Riemannian metric directly from data, validating the learned structure and its torsion via a numerical implementation of the exterior derivative, which may be of independent interest.

[LG-78] Accelerating Feedback-based Algorithms for Quantum Optimization Using Gradient Descent

链接: https://arxiv.org/abs/2602.12387
作者: Masih Mozakka,Mohsen Heidari
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Feedback-based methods have gained significant attention as an alternative training paradigm for the Quantum Approximate Optimization Algorithm (QAOA) in solving combinatorial optimization problems such as MAX-CUT. In particular, Quantum Lyapunov Control (QLC) employs feedback-driven control laws that guarantee monotonic non-decreasing objective values, can substantially reduce the training overhead of QAOA, and mitigate barren plateaus. However, these methods might require long control sequences, leading to sub-optimal convergence rates. In this work, we propose a hybrid method that incorporates per-layer gradient estimation to accelerate the convergence of QLC while preserving its low training overhead and stability guarantees. By leveraging layer-wise gradient information, the proposed approach selects near-optimal control parameters, resulting in significantly faster convergence and improved robustness. We validate the effectiveness of the method through extensive numerical experiments across a range of problem instances and optimization settings.

[LG-79] A Gradient Boosted Mixed-Model Machine Learning Framework for Vessel Speed in the U.S. Arctic

链接: https://arxiv.org/abs/2602.12292
作者: Mauli Pant,Linda Fernandez,Indranil Sahoo
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding how environmental and operational conditions influence vessel speed is crucial for characterizing navigational conditions in the Arctic. We analyzed Automatic Identification System (AIS) data from 2010-2019 to examine vessel speed over ground (SOG). Over half of the AIS records showed zero SOG, and treating zero and positive SOG as a single continuous process can obscure important patterns. We therefore applied a two-stage machine learning framework, first modeling the probability of SOG greater than zero and then modeling SOG conditional on being positive. AIS observations were integrated with sea ice concentration, course over ground, wind, bathymetric depth, distance to coast, vessel group, and navigational status. Gradient boosted decision trees with random effects captured nonlinear environmental responses while accounting for repeated observations. The positive SOG classifier achieved strong discrimination (AUC = 0.85), while the conditional speed model explained approximately 77 percent of out-of-fold variance. SHAP values quantified covariate effects by decomposing model predictions into additive contributions from individual variables. Distance to coast and bathymetric depth were dominant determinants of both the likelihood and magnitude of vessel speed, while changes in course, vessel group, and navigational status introduced secondary variation. Wind and sea ice effects were modest. Together, these results empirically characterize Arctic vessel operating regimes relevant to speed management and corridor-level assessment.

附件下载

点击下载今日全部论文列表