本篇博文主要内容为 2026-02-23 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-02-23)

今日共更新412篇论文,其中:

  • 自然语言处理49篇(Computation and Language (cs.CL))
  • 人工智能101篇(Artificial Intelligence (cs.AI))
  • 计算机视觉60篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习159篇(Machine Learning (cs.LG))
  • 多智能体系统5篇(Multiagent Systems (cs.MA))
  • 信息检索16篇(Information Retrieval (cs.IR))
  • 人机交互22篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Mean-Field Reinforcement Learning without Synchrony

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在异步决策场景下的可扩展性问题。传统均值场强化学习(Mean-Field Reinforcement Learning, MF-RL)依赖于每个智能体在每一步都参与决策,以计算均值动作作为全局状态的统计量;然而当部分智能体处于空闲状态时,该统计量无法定义,导致方法失效。为此,作者提出基于种群分布(population distribution μΔ(O)\mu \in \Delta(\mathcal{O}))的新框架——时序均值场(Temporal Mean Field, TMF),其关键在于:利用观察空间中各状态的占比来替代均值动作作为聚合统计量,该分布维度与智能体总数 NN 无关,并且在交换对称性下完全刻画单个智能体的奖励和转移函数。在此基础上,论文从零构建了覆盖同步到纯顺序决策的统一理论体系,证明了TMF均衡的存在唯一性、有限群体近似误差为 O(1/N)O(1/\sqrt{N}) 的界(无论每步激活多少智能体),并设计了政策梯度算法TMF-PG,确保收敛至唯一均衡点。实验验证了该方法在资源选择和动态排队博弈中的鲁棒性能,且逼近误差严格遵循理论预测速率。

链接: https://arxiv.org/abs/2602.18026
作者: Shan Yang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 5 figures, 1 algorithm

点击查看摘要

Abstract:Mean-field reinforcement learning (MF-RL) scales multi-agent RL to large populations by reducing each agent’s dependence on others to a single summary statistic – the mean action. However, this reduction requires every agent to act at every time step; when some agents are idle, the mean action is simply undefined. Addressing asynchrony therefore requires a different summary statistic – one that remains defined regardless of which agents act. The population distribution \mu \in \Delta(\mathcalO) – the fraction of agents at each observation – satisfies this requirement: its dimension is independent of N , and under exchangeability it fully determines each agent’s reward and transition. Existing MF-RL theory, however, is built on the mean action and does not extend to \mu . We therefore construct the Temporal Mean Field (TMF) framework around the population distribution \mu from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory. We prove existence and uniqueness of TMF equilibria, establish an O(1/\sqrtN) finite-population approximation bound that holds regardless of how many agents act per step, and prove convergence of a policy gradient algorithm (TMF-PG) to the unique equilibrium. Experiments on a resource selection game and a dynamic queueing game confirm that TMF-PG achieves near-identical performance whether one agent or all N act per step, with approximation error decaying at the predicted O(1/\sqrtN) rate.

[MA-1] El Agent e Gráfico: Structured Execution Graphs for Scientific Agents

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在科学工作流自动化中与异构计算工具集成时存在的非结构化、脆弱性问题,以及现有代理(agent)方法依赖无结构文本进行上下文管理和执行协调所导致的决策可追溯性差、审计困难等挑战。其解决方案的关键在于提出一个名为 El Agente Gráfico 的单代理框架,该框架将LLM驱动的决策嵌入到类型安全的执行环境,并结合动态知识图谱实现外部持久化存储;通过结构化的科学概念抽象和对象-图映射器(object-graph mapper),将计算状态表示为类型化的Python对象,从而以类型化的符号标识符替代原始文本进行上下文管理,保障一致性、支持溯源追踪并提升工具编排效率。

链接: https://arxiv.org/abs/2602.17902
作者: Jiaru Bai,Abdulrahman Aldossary,Thomas Swanick,Marcel Müller,Yeonghun Kang,Zijian Zhang,Jin Won Lee,Tsz Wai Ko,Mohammad Ghazi Vakili,Varinia Bernales,Alán Aspuru-Guzik
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE); Chemical Physics (physics.chem-ph)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to automate scientific workflows, yet their integration with heterogeneous computational tools remains ad hoc and fragile. Current agentic approaches often rely on unstructured text to manage context and coordinate execution, generating often overwhelming volumes of information that may obscure decision provenance and hinder auditability. In this work, we present El Agente Gráfico, a single-agent framework that embeds LLM-driven decision-making within a type-safe execution environment and dynamic knowledge graphs for external persistence. Central to our approach is a structured abstraction of scientific concepts and an object-graph mapper that represents computational state as typed Python objects, stored either in memory or persisted in an external knowledge graph. This design enables context management through typed symbolic identifiers rather than raw text, thereby ensuring consistency, supporting provenance tracking, and enabling efficient tool orchestration. We evaluate the system by developing an automated benchmarking framework across a suite of university-level quantum chemistry tasks previously evaluated on a multi-agent system, demonstrating that a single agent, when coupled to a reliable execution engine, can robustly perform complex, multi-step, and parallel computations. We further extend this paradigm to two other large classes of applications: conformer ensemble generation and metal-organic framework design, where knowledge graphs serve as both memory and reasoning substrates. Together, these results illustrate how abstraction and type safety can provide a scalable foundation for agentic scientific automation beyond prompt-centric designs.

[MA-2] MultiVer: Zero-Shot Multi-Agent Vulnerability Detection

【速读】:该论文旨在解决软件漏洞检测中模型泛化能力不足的问题,特别是如何在不进行微调(fine-tuning)的情况下实现高召回率(recall),从而减少漏报(false negatives)。其解决方案的关键在于提出一个零样本多智能体系统(zero-shot multi-agent system)——MultiVer,该系统由四个专业智能体组成(安全、正确性、性能、风格),通过并集投票机制集成各智能体的判断结果。实验表明,该架构在PyVul基准上实现了82.7%的召回率,超越了微调后的GPT-3.5(81.3%),且在SecurityEval上达到91.7%的检测率,与专用系统相当,验证了多智能体协同分析在安全敏感场景中的有效性,尤其适用于对漏报代价高于误报的场景。

链接: https://arxiv.org/abs/2602.17875
作者: Shreshth Rajan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present MultiVer, a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning. A four-agent ensemble (security, correctness, performance, style) with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points – the first zeroshot system to surpass fine-tuned performance on this benchmark. On SecurityEval, the same architecture achieves 91.7% detection rate, matching specialized systems. The recall improvement comes at a precision cost: 48.8% precision versus 63.9% for fine-tuned baselines, yielding 61.4% F1. Ablation experiments isolate component contributions: the multi-agent ensemble adds 17 percentage points recall over single-agent security analysis. These results demonstrate that for security applications where false negatives are costlier than false positives, zero-shot multi-agent ensembles can match and exceed fine-tuned models on the metric that matters most.

[MA-3] Reasoning -Native Agent ic Communication for 6G

【速读】:该论文旨在解决未来6G网络中自主智能体(autonomous agents)因内部推理过程演化不一致而导致的信念分歧(belief divergence)问题,即即使多个智能体对同一信息的理解一致,仍可能因内在推理差异而产生行为不一致。解决方案的关键在于提出“推理原生代理通信”(reasoning native agentic communication)新范式,其核心是将通信设计为显式调节智能体间信念状态对齐的机制,而非仅传输数据或语义信息;具体通过在传统通信栈中引入基于共享知识结构和有限信念建模的协调平面(coordination plane),依据预测的信念状态错位激活通信,从而防止协作漂移(coordination drift),确保异构系统间行为的一致性与协同性。

链接: https://arxiv.org/abs/2602.17738
作者: Hyowoon Seo,Joonho Seon,Jin Young Kim,Mehdi Bennis,Wan Choi,Dong In Kim
机构: Sungkyunkwan University (成均馆大学); Kwangwoon University (光云大学); University of Oulu (奥卢大学); Seoul National University (首尔国立大学)
类目: Multiagent Systems (cs.MA); Information Theory (cs.IT)
备注: 8 pages 4 figures

点击查看摘要

Abstract:Future 6G networks will interconnect not only devices, but autonomous machines that continuously sense, reason, and act. In such environments, communication can no longer be understood solely as delivering bits or even preserving semantic meaning. Even when two agents interpret the same information correctly, they may still behave inconsistently if their internal reasoning processes evolve differently. We refer to this emerging challenge as belief divergence. This article introduces reasoning native agentic communication, a new paradigm in which communication is explicitly designed to address belief divergence rather than merely transmitting representations. Instead of triggering transmissions based only on channel conditions or data relevance, the proposed framework activates communication according to predicted misalignment in agents internal belief states. We present a reasoning native architecture that augments the conventional communication stack with a coordination plane grounded in a shared knowledge structure and bounded belief modeling. Through enabling mechanisms and representative multi agent scenarios, we illustrate how such an approach can prevent coordination drift and maintain coherent behavior across heterogeneous systems. By reframing communication as a regulator of distributed reasoning, reasoning native agentic communication enables 6G networks to act as an active harmonizer of autonomous intelligence.

[MA-4] Nested Training for Mutual Adaptation in Human-AI Teaming

【速读】:该论文旨在解决人-机器人协作中的相互适应(mutual adaptation)问题,即人类会根据机器人的策略动态调整自身行为,而传统静态训练伙伴无法捕捉这种适应性,导致机器人在面对新人类伙伴时泛化能力差。解决方案的关键在于将人-机器人协作建模为交互式部分可观测马尔可夫决策过程(Interactive Partially Observable Markov Decision Process, I-POMDP),显式地将人类的适应行为纳入状态空间,并提出一种嵌套训练机制(nested training regime):每一层代理均与下一层的自适应代理进行训练,从而确保主体代理在训练中接触到真实的人类适应行为,同时避免多智能体共同学习时出现仅对特定训练伙伴有效的隐式协调策略(implicit coordination strategies)。该方法显著提升了机器人在与未见过的自适应伙伴协作时的任务性能和适应能力。

链接: https://arxiv.org/abs/2602.17737
作者: Upasana Biswas,Durgesh Kalwar,Subbarao Kambhampati,Sarath Sreedharan
机构: Arizona State University (亚利桑那州立大学); Indian Institute of Technology, Bombay (印度理工学院孟买分校)
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Mutual adaptation is a central challenge in human–AI teaming, as humans naturally adjust their strategies in response to a robot’s policy. Existing approaches aim to improve diversity in training partners to approximate human behavior, but these partners are static and fail to capture adaptive behavior of humans. Exposing robots to adaptive behaviors is critical, yet when both agents learn simultaneously in a multi-agent setting, they often converge to opaque implicit coordination strategies that only work with the agents they were co-trained with. Such agents fail to generalize when paired with new partners. In order to capture the adaptive behavior of humans, we model the human-robot teaming scenario as an Interactive Partially Observable Markov Decision Process (I-POMDP), explicitly modeling human adaptation as part of the state. We propose a nested training regime to approximately learn the solution to a finite-level I-POMDP. In this framework, agents at each level are trained against adaptive agents from the level below. This ensures that the ego agent is exposed to adaptive behavior during training while avoiding the emergence of implicit coordination strategies, since the training partners are not themselves learning. We train our method in a multi-episode, required cooperation setup in the Overcooked domain, comparing it against several baseline agents designed for human-robot teaming. We evaluate the performance of our agent when paired with adaptive partners that were not seen during training. Our results demonstrate that our agent not only achieves higher task performance with these adaptive partners but also exhibits significantly greater adaptability during team interactions.

自然语言处理

[NLP-0] SPQ: An Ensemble Technique for Large Language Model Compression LREC2026

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在实际部署中面临的内存占用高、计算效率低的问题,特别是在资源受限环境下的可部署性挑战。解决方案的关键在于提出一种分层协同的压缩方法SPQ(SVD-Pruning-Quantization),其核心思想是通过三种互补的压缩技术协同作用:i) 基于激活值的剪枝(activation-based pruning)去除MLP层中的冗余神经元,ii) 保留方差的奇异值分解(variance-retained singular value decomposition, SVD)将注意力投影降维为紧凑的低秩因子,iii) 后训练线性量化(post-training linear quantization)对所有线性层进行统一的8位量化。这种分层且互补的设计使得SPQ在相同压缩比下显著优于单一压缩方法,并在LLaMA-2-7B模型上实现高达75%的内存减少,同时保持甚至提升语言建模性能(如WikiText-2困惑度从5.47降至4.91)和下游任务准确性,且相比GPTQ等强基线更节省内存并提升推理吞吐量(最高达1.9倍加速)。

链接: https://arxiv.org/abs/2602.18420
作者: Jiamin Yao,Eren Gultepe
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to LREC 2026 Main Conference

点击查看摘要

Abstract:This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real-world deployment. The effectiveness of SPQ’s robust compression through layer-aware and complementary compression techniques may provide practical deployment of LLMs in memory-constrained environments. Code is available at: this https URL

[NLP-1] Subgroups of U(d) Induce Natural RNN and Transformer Architectures

【速读】: 该论文旨在解决序列模型中状态空间设计的灵活性与通用性问题,即如何在保持模型结构统一的前提下,通过选择不同的子群来灵活构造递归神经网络(RNN)和Transformer架构。其解决方案的关键在于提出一个基于闭子群(closed subgroups)的统一框架,该框架以最小公理化设定为基础,从共享骨架中推导出RNN和Transformer模板,其中子群的选择可作为状态空间、切空间投影和更新映射的“即插即用”替代模块。这一方法不仅提升了模型设计的理论一致性,还通过在正交群O(d)上的实证验证(Tiny Shakespeare和Penn Treebank数据集),证明了其在参数匹配条件下优于传统实现,并进一步引入切空间中的线性混合扩展,显著改善有限预算下的性能表现。

链接: https://arxiv.org/abs/2602.18417
作者: Joshua Nunley
机构: Indiana University (印第安纳大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 8 tables

点击查看摘要

Abstract:This paper presents a direct framework for sequence models with hidden states on closed subgroups of U(d). We use a minimal axiomatic setup and derive recurrent and transformer templates from a shared skeleton in which subgroup choice acts as a drop-in replacement for state space, tangent projection, and update map. We then specialize to O(d) and evaluate orthogonal-state RNN and transformer models on Tiny Shakespeare and Penn Treebank under parameter-matched settings. We also report a general linear-mixing extension in tangent space, which applies across subgroup choices and improves finite-budget performance in the current O(d) experiments.

[NLP-2] Validating Political Position Predictions of Arguments

【速读】: 该论文旨在解决现实世界中主观连续属性(如政治立场)的知识表示问题,这类属性常与传统基于成对比较(pairwise validation)的人类评估标准相冲突。其核心挑战在于如何在保持可扩展性的同时确保评估的可靠性。解决方案的关键在于提出一种双尺度验证框架(dual-scale validation framework),融合点对点(pointwise)与成对(pairwise)人类标注:点对点评估揭示了人类与模型之间中等程度的一致性(Krippendorff’s α=0.578),反映了主观性本质;而成对验证则显示出更强的排序一致性(α=0.86),表明即使在主观语境下,语言模型预测仍能提取出可靠的序数结构。此方法为构建具有可靠性的主观连续知识库提供了可行路径,并推动了生成式 AI 在政治领域等复杂语境下的知识表示能力。

链接: https://arxiv.org/abs/2602.18351
作者: Jordan Robinson,Angus R. Williams,Katie Atkinson,Anthony G. Cohn
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures, 6 tables. Under review

点击查看摘要

Abstract:Real-world knowledge representation often requires capturing subjective, continuous attributes – such as political positions – that conflict with pairwise validation, the widely accepted gold standard for human evaluation. We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation. Using 22 language models, we construct a large-scale knowledge base of political position predictions for 23,228 arguments drawn from 30 debates that appeared on the UK politicial television programme \textitQuestion Time. Pointwise evaluation shows moderate human-model agreement (Krippendorff’s \alpha=0.578 ), reflecting intrinsic subjectivity, while pairwise validation reveals substantially stronger alignment between human- and model-derived rankings ( \alpha=0.86 for the best model). This work contributes: (i) a practical validation methodology for subjective continuous knowledge that balances scalability with reliability; (ii) a validated structured argumentation knowledge base enabling graph-based reasoning and retrieval-augmented generation in political domains; and (iii) evidence that ordinal structure can be extracted from pointwise language models predictions from inherently subjective real-world discourse, advancing knowledge representation capabilities for domains where traditional symbolic or categorical approaches are insufficient.

[NLP-3] Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

【速读】: 该论文旨在解决印度司法系统中上诉案件(appellate cases)判决预测与解释的难题,尤其是在法院积案严重背景下,如何借助人工智能提升判决预测的准确性与可解释性。解决方案的关键在于提出名为 Vichara 的新框架,其核心创新是将英文上诉案件文书结构化为“决策点”(decision points),每个决策点包含法律问题、裁决主体、结果、推理及时间背景等要素,从而提取出法律判断的核心内容及其上下文;同时,Vichara 采用受 IRAC(Issue-Rule-Application-Conclusion)启发且适配印度法律推理逻辑的结构化解释格式,显著增强预测结果的可解释性,使法律从业者能高效评估预测合理性。

链接: https://arxiv.org/abs/2602.18346
作者: Pavithra PM Nair,Preethu Rose Anish
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In jurisdictions like India, where courts face an extensive backlog of cases, artificial intelligence offers transformative potential for legal judgment prediction. A critical subset of this backlog comprises appellate cases, which are formal decisions issued by higher courts reviewing the rulings of lower courts. To this end, we present Vichara, a novel framework tailored to the Indian judicial system that predicts and explains appellate judgments. Vichara processes English-language appellate case proceeding documents and decomposes them into decision points. Decision points are discrete legal determinations that encapsulate the legal issue, deciding authority, outcome, reasoning, and temporal context. The structured representation isolates the core determinations and their context, enabling accurate predictions and interpretable explanations. Vichara’s explanations follow a structured format inspired by the IRAC (Issue-Rule-Application-Conclusion) framework and adapted for Indian legal reasoning. This enhances interpretability, allowing legal professionals to assess the soundness of predictions efficiently. We evaluate Vichara on two datasets, PredEx and the expert-annotated subset of the Indian Legal Documents Corpus (ILDC_expert), using four large language models: GPT-4o mini, Llama-3.1-8B, Mistral-7B, and Qwen2.5-7B. Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B. Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini’s superior interpretability.

[NLP-4] On the “Induction Bias” in Sequence Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于 Transformer 的语言模型在状态跟踪(state tracking)能力上的固有局限性问题,尤其是在训练与测试分布一致(in-distribution)场景下的数据效率不足。其关键解决方案在于通过大规模实验对比 Transformer 与循环神经网络(Recurrent Neural Networks, RNNs)在不同监督范式下的数据效率,并揭示 Transformer 在状态空间规模和序列长度增长时所需训练数据呈超线性增长,且缺乏跨序列长度的权重共享机制;而 RNN 则能通过有效权重共享实现跨长度的 amortized learning(摊销学习),从而提升数据利用效率。这一发现表明,即便在分布匹配条件下,Transformer 的状态跟踪仍面临根本性挑战。

链接: https://arxiv.org/abs/2602.18333
作者: M.Reza Ebrahimi,Michaël Defferrard,Sunny Panchal,Roland Memisevic
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.

[NLP-5] Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning

【速读】: 该论文旨在解决高中英语词汇教学中高质量语境(context)自动筛选的问题,即如何从大量候选语境中精准识别出对学习者有启发性的上下文示例。解决方案的关键在于引入一个结合人类监督的深度学习框架:首先利用指令感知的微调模型(Qwen3)提取嵌入表示,并在此基础上构建非线性回归头进行预测;进一步融合人工设计的语境特征后,显著提升了模型性能。实验表明,该方法在仅舍弃70%优质语境的前提下,实现了高达440:1的优质与劣质语境比,证明了人类监督引导下现代嵌入模型能够低成本生成近乎完美的教学语境资源。

链接: https://arxiv.org/abs/2602.18326
作者: Tao Wu,Adam Kapelner
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures, 4 tables

点击查看摘要

Abstract:We describe a modern deep learning system that automatically identifies informative contextual examples (\qucontexts) for first language vocabulary instruction for high school student. Our paper compares three modeling approaches: (i) an unsupervised similarity-based strategy using MPNet’s uniformly contextualized embeddings, (ii) a supervised framework built on instruction-aware, fine-tuned Qwen3 embeddings with a nonlinear regression head and (iii) model (ii) plus handcrafted context features. We introduce a novel metric called the Retention Competency Curve to visualize trade-offs between the discarded proportion of good contexts and the \qugood-to-bad contexts ratio providing a compact, unified lens on model performance. Model (iii) delivers the most dramatic gains with performance of a good-to-bad ratio of 440 all while only throwing out 70% of the good contexts. In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.

[NLP-6] PsihoRo: Depression and Anxiety Romanian Text Corpus LREC2026

【速读】: 该论文试图解决罗马尼亚语(Romanian)中缺乏可用于自然语言处理(Natural Language Processing, NLP)的心理健康语料库的问题,特别是针对抑郁和焦虑的公开可用语料库的缺失。解决方案的关键在于通过设计包含6个开放式问题的问卷,并结合标准化的PHQ-9(患者健康问卷-9项)和GAD-7(广泛性焦虑障碍量表-7项)自评筛查工具,收集205名受访者文本数据,从而构建首个罗马尼亚语心理健康语料库PsihoRo。该方法有效规避了社交媒体数据采集中的主观假设问题,为后续基于文本的情感分析、主题建模及心理状态检测提供了可扩展的基础资源。

链接: https://arxiv.org/abs/2602.18324
作者: Alexandra Ciobotaru,Ana-Maria Bucur,Liviu P. Dinu
机构: University of Bucharest (布加勒斯特大学); DRUID AI; Universita della Svizzera Italiana (瑞士意大利语大学); Human Language Technologies Research Center (人类语言技术研究中心)
类目: Computation and Language (cs.CL)
备注: This article was accepted at LREC 2026

点击查看摘要

Abstract:Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health. These texts allow researchers to study psychological constructs, detect mental health issues and analyze emotional language. However, mental health data can be difficult to collect correctly from social media, due to suppositions made by the collectors. A more pragmatic strategy involves gathering data through open-ended questions and then assessing this information with self-report screening surveys. This method was employed successfully for English, a language with a lot of psychological NLP resources. However, this cannot be stated for Romanian, which currently has no open-source mental health corpus. To address this gap, we have created the first corpus for depression and anxiety in Romanian, by utilizing a form with 6 open-ended questions along with the standardized PHQ-9 and GAD-7 screening questionnaires. Consisting of the texts of 205 respondents and although it may seem small, PsihoRo is a first step towards understanding and analyzing texts regarding the mental health of the Romanian population. We employ statistical analysis, text analysis using Romanian LIWC, emotion detection and topic modeling to show what are the most important features of this newly introduced resource to the NLP community.

[NLP-7] VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在形式化验证(formal verification)场景中表现不佳的问题,尤其是针对软件验证领域中复杂代码库上下文与跨文件依赖关系的适应性不足。现有基准测试主要基于Mathlib生态系统的数学证明任务,难以反映真实软件验证项目中定义丰富、依赖复杂的特性。为此,作者提出VeriSoftBench——一个包含500个Lean 4证明目标的基准数据集,源自开源形式化开发项目,并保留了真实的仓库结构和跨文件依赖关系。其解决方案的关键在于:构建具有现实语境的评估环境,揭示LLMs在仓库中心化设置下性能下降的根本原因,包括对多跳依赖闭包的敏感性以及上下文信息选择的重要性,从而为未来针对软件验证优化的自动化定理证明工具提供明确的方向。

链接: https://arxiv.org/abs/2602.18307
作者: Yutong Xin,Qiaochu Chen,Greg Durrett,Işil Dillig
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof’s dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at this https URL.

[NLP-8] On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction

【速读】: 该论文旨在解决自回归大语言模型(Large Language Models, LLMs)在文本生成过程中效率低下的问题,即其逐token生成机制需进行n次前向传播才能输出长度为n的序列。针对此瓶颈,论文探索了冻结LLM通过仅两个学习得到的原型token(proto-tokens)实现单次前向传播即可重建数百个token的可能性,从而挑战传统自回归范式。解决方案的关键在于揭示这两个原型token(m-token和e-token)所编码的信息结构及其在约束条件下的稳定性,并引入两种正则化策略:基于锚点的损失函数与关系蒸馏目标,以在不牺牲重建质量的前提下将语义结构注入e-token空间,从而验证了以原型token作为中间表示构建非自回归序列到序列(seq2seq)系统的可行性。

链接: https://arxiv.org/abs/2602.18301
作者: Ivan Bondarenko,Egor Palkin,Fedor Tikunov
机构: Novosibirsk State University (新西伯利亚国立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive large language models (LLMs) generate text token-by-token, requiring n forward passes to produce a sequence of length n. Recent work, Exploring the Latent Capacity of LLMs for One-Step Text Reconstruction (Mezentsev and Oseledets), shows that frozen LLMs can reconstruct hundreds of tokens from only two learned proto-tokens in a single forward pass, suggesting a path beyond the autoregressive paradigm. In this paper, we study what information these proto-tokens encode and how they behave under reconstruction and controlled constraints. We perform a series of experiments aimed at disentangling semantic and syntactic content in the two proto-tokens, analyzing stability properties of the e-token, and visualizing attention patterns to the e-token during reconstruction. Finally, we test two regularization schemes for “imposing” semantic structure on the e-token using teacher embeddings, including an anchor-based loss and a relational distillation objective. Our results indicate that the m-token tends to capture semantic information more strongly than the e-token under standard optimization; anchor-based constraints trade off sharply with reconstruction accuracy; and relational distillation can transfer batch-level semantic relations into the proto-token space without sacrificing reconstruction quality, supporting the feasibility of future non-autoregressive seq2seq systems that predict proto-tokens as an intermediate representation.

[NLP-9] Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

【速读】: 该论文旨在解决链式思维(Chain-of-thought, CoT)监控器在检测大语言模型(Large Language Model, LLM)输出中特定属性(如代码生成中的测试作弊行为)时的可监控性(monitorability)问题。研究表明,CoT与输出之间存在非零互信息是可监控性的必要条件但非充分条件,而实际性能受限于两个关键误差源:信息缺口(information gap),即监控器从CoT中提取可用信息的能力不足;以及诱导误差(elicitation error),即监控器对最优监控函数的近似偏差。解决方案的关键在于通过有针对性的训练目标系统性提升监控能力:一是基于“预言机”(oracle)的方法,直接奖励被监控模型生成能最大化监控准确率的CoT;二是更实用的无标签方法,通过最大化输出与CoT之间的条件互信息来优化监控效果。实验证明,两种方法均能显著提高监控准确性,并在对抗训练下防止CoT退化,从而缓解任务奖励不完善时的奖励黑客(reward hacking)问题。

链接: https://arxiv.org/abs/2602.18297
作者: Usman Anwar,Tim Bakker,Dana Kianfar,Cristina Pinneri,Christos Louizos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注: First two authors contributed equally

点击查看摘要

Abstract:Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation. In this paper, we use information-theoretic analysis to show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability. We identify two sources of approximation error that may undermine the performance of CoT monitors in practice: information gap, which measures the extent to which the monitor can extract the information available in CoT, and elicitation error, which measures the extent to which the monitor approximates the optimal monitoring function. We further demonstrate that CoT monitorability can be systematically improved through targeted training objectives. To this end, we propose two complementary approaches: (a) an oracle-based method that directly rewards the monitored model for producing CoTs that maximize monitor accuracy, and (b) a more practical, label-free approach that maximizes conditional mutual information between outputs and CoTs. Across multiple different environments, we show both methods significantly improve monitor accuracy while preventing CoT degeneration even when training against a monitor, thereby mitigating reward hacking when the task reward is imperfectly specified.

[NLP-10] Simplifying Outcomes of Language Model Component Analyses with ELIA EACL2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)内部机制可解释性研究中因复杂性过高而导致的可及性问题,即当前分析工具主要服务于专家,难以被非专业用户理解与使用。解决方案的关键在于设计并实现了一个名为ELIA(Explainable Language Interpretability Analysis)的交互式网页应用,其核心创新包括:整合Attribution Analysis(归因分析)、Function Vector Analysis(功能向量分析)和Circuit Tracing(电路追踪)三种关键技术,并引入一种新颖方法——利用视觉-语言模型(Vision-Language Model)自动生成自然语言解释(Natural Language Explanations, NLEs),以解析复杂可视化结果。实证研究表明,该系统通过交互式界面和AI生成解释显著降低了用户认知门槛,尤其在非专家群体中实现了跨经验水平的理解一致性,验证了“AI辅助解释 + 用户中心设计”是提升模型分析可访问性的有效路径。

链接: https://arxiv.org/abs/2602.18262
作者: Aaron Louis Eidt,Nils Feldhus
机构: Technische Universität Berlin (柏林工业大学); Fraunhofer Heinrich Hertz Institute (弗劳恩霍夫海因里希·赫兹研究所); BIFOLD – Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EACL 2026 System Demonstrations. GitHub: this https URL

点击查看摘要

Abstract:While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques – Attribution Analysis, Function Vector Analysis, and Circuit Tracing – and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations helped bridge the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user’s prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.

[NLP-11] hinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在推理过程中因推理不确定性高度局部化而导致的效率低下问题——即少量低置信度(low-confidence)token disproportionately引发错误并导致输出冗余扩展。其解决方案的关键在于提出一种基于置信度驱动的对比解码方法(Confidence-Driven Contrastive Decoding, CCD),通过在解码过程中识别低置信度位置,构建由高置信度token替换为最小占位符构成的对比参考分布,并在这些位置对预测分布进行减法操作以实现精准干预。该方法无需训练即可显著提升数学推理准确性,同时大幅减少输出长度,且仅引入极小的KV缓存开销。

链接: https://arxiv.org/abs/2602.18232
作者: Lexiang Tang,Weihao Gao,Bingchen Zhao,Lu Ma,Qiao jin,Bang Yang,Yuexian Zou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work on test-time scaling for large language model (LLM) reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence-driven contrastive decoding approach that improves reasoning reliability through targeted token-level intervention. Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes selectively at these positions. It constructs a contrastive reference by replacing high-confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low-confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead. As a training-free method, CCD enhances reasoning reliability through targeted low-confidence intervention without computational redundancy. Our code will be made available at: this https URL.

[NLP-12] Information-Theoretic Storag e Cost in Sentence Comprehension

【速读】: 该论文旨在解决自然语言理解中实时句法加工对工作记忆(working memory)存储成本的量化问题,传统方法依赖符号语法模型,仅能提供离散且统一的代价估计,难以捕捉实际语言处理中的连续性和语境依赖性。其解决方案的关键在于提出一种基于信息论的形式化指标,即通过计算前序词在不确定性条件下携带的关于未来语境的信息量来衡量存储成本,该指标具有连续性、理论中立性,并可从预训练神经语言模型中直接估算,从而更准确地反映真实语言处理过程中的认知负荷。

链接: https://arxiv.org/abs/2602.18217
作者: Kohei Kajikawa,Shinnosuke Isono,Ethan Gotlieb Wilcox
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have been formalized, largely, using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors.

[NLP-13] Improving Sampling for Masked Diffusion Models via Information Gain

【速读】: 该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)在解码过程中因采用贪婪启发式策略而导致的累积不确定性过高问题。现有采样方法通常优先选择局部置信度最高的位置进行解码,但忽视了当前决策对未来步骤的影响,未能充分利用MDMs的非因果特性——即一个解码决策可重塑所有剩余掩码位置的token概率分布与不确定性。解决方案的关键在于提出Info-Gain Sampler,该框架通过平衡即时不确定性与对未来掩码token的信息增益来指导解码,从而更有效地最小化整体累积不确定性。实验证明,该方法在多种任务(如推理、编程、创意写作和图像生成)中均显著优于现有基线,尤其在推理任务上将累积不确定性从78.4降至48.6,准确率提升3.6%。

链接: https://arxiv.org/abs/2602.18176
作者: Kaisen Yang,Jayden Teoh,Kaicheng Yang,Yitong Zhang,Alex Lamb
机构: 未知
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to decode at each step. Through failure case analysis, we identify a fundamental limitation of this approach: it neglects the downstream impact of current decoding choices on subsequent steps and fails to minimize cumulative uncertainty. In particular, these methods do not fully exploit the non-causal nature of MDMs, which enables evaluating how a decoding decision reshapes token probabilities/uncertainty across all remaining masked positions. To bridge this gap, we propose the Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs. For instance, it achieves a 3.6% improvement in average accuracy on reasoning tasks and a 63.1% win-rate in creative writing. Notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin. The code will be available at this https URL.

[NLP-14] Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models

【速读】: 该论文旨在解决在线信息质量下降的问题,特别是由点击诱饵(clickbait)标题引发的信息可信度削弱问题。其解决方案的关键在于提出一种混合检测方法,将基于Transformer的文本嵌入(text embeddings)与语言学驱动的 informativeness(信息量)特征相结合;其中,通过XGBoost分类器对15个显式语言学特征(如第二人称代词、最高级形容词、数字、引导注意力的标点符号等)进行增强的嵌入表示进行建模,最终实现F1-score达91%的高性能检测效果,同时提升了模型预测的可解释性与透明度。

链接: https://arxiv.org/abs/2602.18171
作者: Wojciech Michaluk,Tymoteusz Urban,Mateusz Kubita,Soveatin Kuntur,Anna Wroblewska
机构: Warsaw University of Technology (华沙理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clickbait headlines degrade the quality of online information and undermine user trust. We present a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features. Using natural language processing techniques, we evaluate classical vectorizers, word embedding baselines, and large language model embeddings paired with tree-based classifiers. Our best-performing model, XGBoost over embeddings augmented with 15 explicit features, achieves an F1-score of 91%, outperforming TF-IDF, Word2Vec, GloVe, LLM prompt based classification, and feature-only baselines. The proposed feature set enhances interpretability by highlighting salient linguistic cues such as second-person pronouns, superlatives, numerals, and attention-oriented punctuation, enabling transparent and well-calibrated clickbait predictions. We release code and trained models to support reproducible research.

[NLP-15] FENCE: A Financial and Multimodal Jailbreak Detection Dataset LREC2026

【速读】: 该论文旨在解决金融领域中视觉语言模型(Vision Language Models, VLMs)面临的越狱攻击(jailbreaking)检测资源匮乏问题。由于VLM同时处理文本和图像,其攻击面更广,尤其在金融场景下风险突出。解决方案的关键在于构建了一个双语(韩英)多模态数据集FENCE,该数据集通过金融相关查询与图像引导的威胁配对,强调领域真实性,从而为训练和评估越狱检测模型提供可靠基准。实验表明,FENCE可有效提升检测模型在商业与开源VLM上的性能,其中基于该数据集训练的基线模型在分布内测试中达到99%准确率,并在外部基准上保持强泛化能力,验证了其在金融场景下推动安全、可信AI系统的重要价值。

链接: https://arxiv.org/abs/2602.18154
作者: Mirae Kim,Seonghun Jeong,Youngjun Kwak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: lrec 2026 accepted paper

点击查看摘要

Abstract:Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset’s robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.

[NLP-16] he Statistical Signature of LLM s

【速读】: 该论文旨在解决生成式语言模型(Generative Language Models)在文本生成过程中如何重塑语言的结构统计特性这一问题,特别是如何从表面文本中识别出由概率采样机制带来的结构性差异。其解决方案的关键在于引入无损压缩(lossless compression)作为一项模型无关的度量指标,通过分析压缩率差异来直接区分人类写作与模型生成文本的统计规律性。研究发现,无论是在受控的人类-模型续写、知识基础设施中介场景(如维基百科 vs. Grokipedia),还是完全合成的社会交互环境(如 Moltbook vs. Reddit)中,LLM 生成文本均表现出更高的结构规则性和可压缩性,这反映了其输出集中在高重复性的统计模式中;同时,这种差异具有尺度依赖性,在小规模碎片化交互环境中会减弱,揭示了表面可区分性的局限性。该方法无需依赖模型内部结构或语义评估,提供了一种基于文本结构的稳健量化框架。

链接: https://arxiv.org/abs/2602.18152
作者: Ortal Hadad,Edoardo Loru,Jacopo Nudo,Niccolò Di Marco,Matteo Cinelli,Walter Quattrociocchi
机构: Università di Roma La Sapienza (罗马大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.

[NLP-17] Detecting Contextual Hallucinations in LLM s with Frequency-Aware Attention

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在基于上下文生成过程中产生的幻觉(Hallucination)检测问题,以提升模型输出的可靠性。现有方法多依赖于生成过程中可获取的内在信号,如注意力机制(Attention),但通常仅使用粗粒度的汇总信息,难以捕捉注意力分布中的细粒度不稳定性。本文的关键创新在于引入信号处理视角,将注意力分布建模为离散信号并提取其高频成分,发现幻觉词元(hallucinated tokens)与高频率注意力能量显著相关,反映了注意力在局部区域的碎片化和不稳定聚焦行为。基于此洞察,作者提出一种轻量级幻觉检测方法,利用高频注意力特征实现更精准的判别,在RAGTruth和HalluRAG基准测试中优于基于验证、内部表示和注意力的传统方法。

链接: https://arxiv.org/abs/2602.18145
作者: Siya Qi,Yudong Chen,Runcong Zhao,Qinglin Zhu,Zhanghao Hu,Wei Liu,Yulan He,Zheng Yuan,Lin Gui
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 10 figures

点击查看摘要

Abstract:Hallucination detection is critical for ensuring the reliability of large language models (LLMs) in context-based generation. Prior work has explored intrinsic signals available during generation, among which attention offers a direct view of grounding behavior. However, existing approaches typically rely on coarse summaries that fail to capture fine-grained instabilities in attention. Inspired by signal processing, we introduce a frequency-aware perspective on attention by analyzing its variation during generation. We model attention distributions as discrete signals and extract high-frequency components that reflect rapid local changes in attention. Our analysis reveals that hallucinated tokens are associated with high-frequency attention energy, reflecting fragmented and unstable grounding behavior. Based on this insight, we develop a lightweight hallucination detector using high-frequency attention features. Experiments on the RAGTruth and HalluRAG benchmarks show that our approach achieves performance gains over verification-based, internal-representation-based, and attention-based methods across models and tasks.

[NLP-18] Agent ic Adversarial QA for Improving Domain-Specific LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域适应过程中面临的两大挑战:一是现有合成数据生成方法(如改写或知识抽取)难以提升模型在专业领域的解释性推理能力;二是生成的合成语料往往冗余且规模庞大,导致样本效率低下。其解决方案的关键在于提出一种对抗式问题生成框架,通过迭代反馈机制比较待适配模型与基于参考文档的鲁棒专家模型的输出,识别并聚焦于理解差距,从而生成少量但语义上具有挑战性的高质量问题,显著提升模型在专业领域任务中的准确性与样本效率。

链接: https://arxiv.org/abs/2602.18137
作者: Vincent Grari,Ciprian Tomoiaga,Sylvain Lamprier,Tatsunori Hashimoto,Marcin Detyniecki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 1 Figure

点击查看摘要

Abstract:Large Language Models (LLMs), despite extensive pretraining on broad internet corpora, often struggle to adapt effectively to specialized domains. There is growing interest in fine-tuning these models for such domains; however, progress is constrained by the scarcity and limited coverage of high-quality, task-relevant data. To address this, synthetic data generation methods such as paraphrasing or knowledge extraction are commonly applied. Although these approaches excel at factual recall and conceptual knowledge, they suffer from two critical shortcomings: (i) they provide minimal support for interpretive reasoning capabilities in these specialized domains, and (ii) they often produce synthetic corpora that are excessively large and redundant, resulting in poor sample efficiency. To overcome these gaps, we propose an adversarial question-generation framework that produces a compact set of semantically challenging questions. These questions are constructed by comparing the outputs of the model to be adapted and a robust expert model grounded in reference documents, using an iterative, feedback-driven process designed to reveal and address comprehension gaps. Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.

[NLP-19] Perceived Political Bias in LLM s Reduces Persuasive Abilities

【速读】: 该论文试图解决的问题是:生成式 AI(Generative AI)在纠正公众错误认知和传播 misinformation 方面的说服力是否受其政治中立性感知的影响。解决方案的关键在于通过一项预注册的美国调查实验(N=2144),让参与者与 ChatGPT 进行三轮关于个人持有的经济政策误解的对话,结果发现,相较于中立对照组,仅通过一条简短信息提示 LLM 偏向对方政党,即可使说服效果下降 28%;进一步的对话文本分析表明,此类警告改变了互动模式,使受访者更具对抗性且更少开放接受对方观点。这说明了 conversational AI 的说服力具有政治敏感性,受限于用户对其党派倾向性的认知。

链接: https://arxiv.org/abs/2602.18092
作者: Matthew DiGiuseppe,Joshua Robison
机构: Leiden University (莱顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 39 pages, 10 figures

点击查看摘要

Abstract:Conversational AI has been proposed as a scalable way to correct public misconceptions and spread misinformation. Yet its effectiveness may depend on perceptions of its political neutrality. As LLMs enter partisan conflict, elites increasingly portray them as ideologically aligned. We test whether these credibility attacks reduce LLM-based persuasion. In a preregistered U.S. survey experiment (N=2144), participants completed a three-round conversation with ChatGPT about a personally held economic policy misconception. Compared to a neutral control, a short message indicating that the LLM was biased against the respondent’s party attenuated persuasion by 28%. Transcript analysis indicates that the warnings alter the interaction: respondents push back more and engage less receptively. These findings suggest that the persuasive impact of conversational AI is politically contingent, constrained by perceptions of partisan alignment.

[NLP-20] Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

【速读】: 该论文旨在解决强化学习中基于人类反馈(Reinforcement Learning from Human Feedback, RLHF)或可验证奖励(Verifiable Rewards, RLVR)训练语言模型时出现的奖励欺骗(reward hacking)问题,即模型利用奖励模型的不准确性学习到非预期行为。其解决方案的关键在于提出一种新的训练框架:通过梯度正则化(Gradient Regularization, GR)引导策略更新偏向于奖励模型更准确的区域,从而提升奖励的可靠性。作者首先从理论上建立了奖励模型准确性与最优解平坦性之间的联系,并实证表明梯度范数与奖励准确性存在相关性;进一步揭示了传统KL惩罚本质上隐含地使用了GR来寻找更平坦的区域,而本文提出显式引入GR并采用高效的有限差分估计方法,在多个语言模型强化学习任务中显著优于KL惩罚,表现为更高的GPT评判胜率、减少对规则格式的过度关注以及防止LLM作为裁判场景下的奖励作弊。

链接: https://arxiv.org/abs/2602.18037
作者: Johannes Ackermann,Michael Noukhovitch,Takashi Ishida,Masashi Sugiyama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 15 figures

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter regions with higher reward accuracy. We further improve on this by proposing to use explicit GR with an efficient finite-difference estimate. Empirically, GR performs better than a KL penalty across a diverse set of RL experiments with LMs. GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.

[NLP-21] owards More Standardized AI Evaluation: From Models to Agents

【速读】: 该论文试图解决当前机器学习评估方法在面向生成式 AI(Generative AI)和工具使用型智能体(tool-using agents)时的失效问题,即传统静态基准测试、聚合分数和一次性成功标准已无法有效衡量复杂、动态且非确定性系统的可信行为。其解决方案的关键在于重新定义评估的本质:从“性能展示”转向“测量学科”,强调评估应作为条件化信任、迭代优化与治理控制的核心机制,从而在系统演化、规模扩展和环境变化中持续保障其行为符合预期。

链接: https://arxiv.org/abs/2602.18029
作者: Ali El Filali,Inès Bedar
机构: G42
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 3 figures

点击查看摘要

Abstract:Evaluation is no longer a final checkpoint in the machine learning lifecycle. As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function. The question is no longer “How good is the model?” but “Can we trust the system to behave as intended, under change, at scale?”. Yet most evaluation practices remain anchored in assumptions inherited from the model-centric era: static benchmarks, aggregate scores, and one-off success criteria. This paper argues that such approaches are increasingly obscure rather than illuminating system behavior. We examine how evaluation pipelines themselves introduce silent failure modes, why high benchmark scores routinely mislead teams, and how agentic systems fundamentally alter the meaning of performance measurement. Rather than proposing new metrics or harder benchmarks, we aim to clarify the role of evaluation in the AI era, and especially for agents: not as performance theater, but as a measurement discipline that conditions trust, iteration, and governance in non-deterministic systems.

[NLP-22] NIMMGen: Learning Neural-Integrated Mechanistic Digital Twins with LLM s

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)生成的机制模型(Mechanistic Model)在真实世界应用场景中可靠性不足的问题,尤其针对部分观测(partial observations)和多样化任务目标等复杂现实条件下的有效性与代码级正确性缺失。其解决方案的关键在于提出神经集成机制建模(Neural-Integrated Mechanistic Modeling, NIMM)评估框架,用以系统性地检验LLM生成模型在现实场景中的表现,并进一步设计NIMMgen这一代理式框架,通过迭代优化提升生成模型的代码正确性和实际适用性,从而实现更可靠的机制建模与反事实干预模拟。

链接: https://arxiv.org/abs/2602.18008
作者: Zihan Guan,Rituparna Datta,Mengxuan Hu,Shunshun Liu,Aiying Zhang,Prasanna Balachandran,Sheng Li,Anil Vullikanti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Mechanistic models encode scientific knowledge about dynamical systems and are widely used in downstream scientific and policy applications. Recent work has explored LLM-based agentic frameworks to automatically construct mechanistic models from data; however, existing problem settings substantially oversimplify real-world conditions, leaving it unclear whether LLM-generated mechanistic models are reliable in practice. To address this gap, we introduce the Neural-Integrated Mechanistic Modeling (NIMM) evaluation framework, which evaluates LLM-generated mechanistic models under realistic settings with partial observations and diversified task objectives. Our evaluation reveals fundamental challenges in current baselines, ranging from model effectiveness to code-level correctness. Motivated by these findings, we design NIMMgen, an agentic framework for neural-integrated mechanistic modeling that enhances code correctness and practical validity through iterative refinement. Experiments across three datasets from diversified scientific domains demonstrate its strong performance. We also show that the learned mechanistic models support counterfactual intervention simulation.

[NLP-23] CUICurate: A GraphRAG -based Framework for Automated Clinical Concept Curation for NLP applications

【速读】: 该论文旨在解决临床命名实体识别(Clinical Named Entity Recognition, CNER)工具中UMLS概念唯一标识符(CUI)难以直接支持下游任务的问题,即当前方法通常仅处理单个CUI,而临床实际需求往往涉及由相关同义词、子类型和超类型组成的概念集合(Concept Set)。此类概念集的构建依赖人工操作,存在劳动强度大、一致性差且缺乏有效工具支持的痛点。解决方案的关键在于提出CUICurate框架,其核心创新是融合图谱检索增强生成(GraphRAG)机制:首先基于UMLS知识图谱(KG)进行语义检索以获取候选CUI,随后利用大语言模型(LLM)进行过滤与分类,其中GPT-5-mini在召回率上表现更优,而GPT-5在分类准确性上更贴近临床专家判断。该方法显著提升了概念集的完整性与覆盖度,同时保持高精度与计算效率,为临床自然语言处理(NLP)提供了可扩展、可复现的概念集自动化构建方案。

链接: https://arxiv.org/abs/2602.17949
作者: Victoria Blake,Mathew Miller,Jamie Novak,Sze-yuan Ooi,Blanca Gallego
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and supertypes. Constructing such concept sets is labour-intensive, inconsistently performed, and poorly supported by existing tools, particularly for NLP pipelines that operate directly on UMLS CUIs. Methods We present CUICurate, a Graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. For each target concept, candidate CUIs were retrieved from the KG, followed by large language model (LLM) filtering and classification steps comparing two LLMs (GPT-5 and GPT-5-mini). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets. Results Across all concepts, CUICurate produced substantially larger and more complete concept sets than the manual benchmarks whilst matching human precision. Comparisons between the two LLMs found that GPT-5-mini achieved higher recall during filtering, while GPT-5 produced classifications that more closely aligned with clinician judgements. Outputs were stable across repeated runs and computationally inexpensive. Conclusions CUICurate offers a scalable and reproducible approach to support UMLS concept set curation that substantially reduces manual effort. By integrating graph-based retrieval with LLM reasoning, the framework produces focused candidate concept sets that can be adapted to clinical NLP pipelines for different phenotyping and analytic requirements.

[NLP-24] Analyzing LLM Instruction Optimization for Tabular Fact Verification

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在表格事实验证任务中推理能力不足的问题,通过指令优化(Instruction Optimization)提升其准确性与稳定性。解决方案的关键在于基于DSPy优化框架系统性地比较多种提示技术(包括纯文本提示和代码执行提示)与三种优化器(COPRO、MiPROv2 和 SIMBA),发现指令优化能一致提升验证准确率:其中 MiPROv2 在 Chain-of-Thought (CoT) 提示下表现最稳定,而 SIMBA 在 ReAct 代理(尤其结合 SQL 工具)中带来最大收益,且在更大模型规模下效果更显著;行为分析进一步表明,SIMBA 通过引入启发式策略促使更直接的推理路径,增强数值比较能力并减少不必要的工具调用,从而提升整体性能。

链接: https://arxiv.org/abs/2602.17937
作者: Xiaotang Du,Giwon Hong,Wai-Chung Kwan,Rohit Saxena,Ivan Titov,Pasquale Minervini,Emily Allaway
机构: University of Edinburgh (爱丁堡大学); Miniml.AI (Miniml.AI)
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution. We study three optimizers from the DSPy framework – COPRO, MiPROv2, and SIMBA – across four benchmarks and three model families. We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales. Behavioral analyses reveal that SIMBA encourages more direct reasoning paths by applying heuristics, thereby improving numerical comparison abilities in CoT reasoning and helping avoid unnecessary tool calls in ReAct agents. Across different prompting techniques, CoT remains effective for tabular fact checking, especially with smaller models. Although ReAct agents built with larger models can achieve competitive performance, they require careful instruction optimization.

[NLP-25] Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

【速读】: 该论文旨在解决当前生物医学问答(Biomedical Question Answering, Biomedical QA)系统在临床推理中缺乏对条件性(conditional reasoning)建模的问题。现有系统通常假设医学知识具有普适性,但实际临床决策高度依赖患者特定因素(如合并症和禁忌症),而现有基准测试未评估此类条件推理能力,且检索增强或图结构方法缺乏显式机制确保所提取知识适用于具体上下文。为应对这一挑战,作者提出CondMedQA基准,包含多跳问题,其答案随患者条件变化;并设计Condition-Gated Reasoning(CGR)框架,通过构建条件感知的知识图谱,并依据查询条件选择性激活或剪枝推理路径,从而实现更可靠的条件相关答案选取。CGR的关键在于引入条件门控机制,使知识推理过程显式依赖于输入条件,提升医疗推理的准确性与鲁棒性。

链接: https://arxiv.org/abs/2602.17911
作者: Jash Rajesh Parekh,Wonbin Kweon,Joey Chan,Rezarta Islamaj,Robert Leaman,Pengcheng Jiang,Chih-Hsuan Wei,Zhizheng Wang,Zhiyong Lu,Jiawei Han
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); National Institutes of Health (美国国立卫生研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.

[NLP-26] Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions

【速读】: 该论文旨在解决传统神经主题模型(Neural Topic Models)在优化过程中仅依赖文档的词袋(Bag-of-Words, BoW)表示,从而忽略上下文信息且在数据稀疏场景下表现不佳的问题。其解决方案的关键在于利用语言模型(Language Models, LMs)生成语义 grounded 的软标签(soft label)作为监督信号:通过特定提示(prompt)条件下的下一个词概率投影到预定义词汇表上,获得富含上下文信息的软标签;随后训练主题模型基于语言模型隐藏状态重建这些软标签,从而提升主题质量与语义一致性。实验表明,该方法在主题连贯性(coherence)和纯度(purity)方面显著优于现有基线,并在检索任务中表现出更强的语义相似文档识别能力。

链接: https://arxiv.org/abs/2602.17907
作者: Raymond Li,Amirhossein Abaskohi,Chuyuan Li,Gabriel Murray,Giuseppe Carenini
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:Traditional neural topic models are typically optimized by reconstructing the document’s Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the topic models to reconstruct the soft labels using the LM hidden states, our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Experiments on three datasets show that our method achieves substantial improvements in topic coherence, purity over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.

[NLP-27] Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations ICLR2025

【速读】: 该论文旨在解决生成式 AI(Generative AI)中控制语言模型行为的导向向量(steering vector)可靠性差异问题,即为何某些目标行为的导向效果稳定而另一些则不可靠。其解决方案的关键在于揭示了三个核心机制:首先,训练激活差异与目标方向之间的余弦相似度越高,导向效果越可靠;其次,正负样本在导向方向上的分离程度越高,导向越稳定;最后,不同提示变体训练出的导向向量虽方向各异,但跨数据集表现一致且相关性强。研究指出,导向向量失效的根本原因在于线性导向方向无法有效逼近潜在行为的非线性表示。这一发现为诊断导向不可靠性提供了实用方法,并推动开发能显式建模非线性行为表示的更鲁棒的控制技术。

链接: https://arxiv.org/abs/2602.17881
作者: Joschka Braun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Master’s Thesis, University of Tübingen. 89 pages, 34 figures. Portions of this work were published at the ICLR 2025 Workshop on Foundation Models in the Wild (see arXiv:2505.22637 )

点击查看摘要

Abstract:Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. Taken together, these insights offer a practical diagnostic for steering unreliability and motivate the development of more robust steering methods that explicitly account for non-linear latent behavior representations.

[NLP-28] ADAPT: Hybrid Prompt Optimization for LLM Feature Visualization

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)激活空间中学习到的方向所编码特征难以理解的问题,尤其是如何高效地识别能够强烈激活这些方向的输入。传统方法依赖于在数据集中搜索激活样本,成本高昂且效率低下;而现有的提示优化技术在处理离散文本时易陷入局部最优解,难以有效进行特征可视化。为克服上述挑战,作者提出ADAPT方法,其关键在于融合束搜索(beam search)初始化与自适应梯度引导突变的混合策略,从而规避局部极小值并提升优化稳定性。实验基于Gemma 2 2B模型的稀疏自动编码器(Sparse Autoencoder)潜在表示进行评估,引入基于数据集激活统计量的指标以实现严谨比较,结果表明ADAPT在不同层和潜在类型上均显著优于现有方法,验证了LLM特征可视化的可行性及其对特定设计假设的依赖性。

链接: https://arxiv.org/abs/2602.17867
作者: João N. Cardoso,Arlindo L. Oliveira,Bruno Martins
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding what features are encoded by learned directions in LLM activation space requires identifying inputs that strongly activate them. Feature visualization, which optimizes inputs to maximally activate a target direction, offers an alternative to costly dataset search approaches, but remains underexplored for LLMs due to the discrete nature of text. Furthermore, existing prompt optimization techniques are poorly suited to this domain, which is highly prone to local minima. To overcome these limitations, we introduce ADAPT, a hybrid method combining beam search initialization with adaptive gradient-guided mutation, designed around these failure modes. We evaluate on Sparse Autoencoder latents from Gemma 2 2B, proposing metrics grounded in dataset activation statistics to enable rigorous comparison, and show that ADAPT consistently outperforms prior methods across layers and latent types. Our results establish that feature visualization for LLMs is tractable, but requires design assumptions tailored to the domain.

[NLP-29] On the scaling relationship between cloze probabilities and language model next-token prediction

【速读】: 该论文试图解决的问题是:大型语言模型在预测人类眼动和阅读时间数据时的表现优化问题,特别是如何提升模型对下一词预测的准确性和语义合理性。解决方案的关键在于,更大规模的语言模型通过增强的语义对齐能力,能够更准确地估计目标词及其出现概率,从而在填空任务(cloze data)中生成更符合人类响应的词语;同时,这些模型因具备更强的记忆容量而减少了对词汇共现统计等低层次信息的依赖,进而提高了预测质量。

链接: https://arxiv.org/abs/2602.17848
作者: Cassandra L. Jacobs,Morgan Grobol
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work has shown that larger language models have better predictive power for eye movement and reading time data. While even the best models under-allocate probability mass to human responses, larger models assign higher-quality estimates of next tokens and their likelihood of production in cloze data because they are less sensitive to lexical co-occurrence statistics while being better aligned semantically to human cloze responses. The results provide support for the claim that the greater memorization capacity of larger models helps them guess more semantically appropriate words, but makes them less sensitive to low-level information that is relevant for word recognition.

[NLP-30] FL: Targeted Bit-Flip Attack on Large Language Model

【速读】: 该论文旨在解决当前针对大语言模型(Large Language Models, LLMs)的比特翻转攻击(Bit-Flip Attacks, BFAs)普遍存在目标不明确、难以精确控制输出内容的问题,即现有方法多导致非目标性失效或整体性能下降,缺乏对特定提示(prompt)下生成结果的精准操控能力。解决方案的关键在于提出了一种名为TFL的新颖靶向比特翻转攻击框架,其核心创新包括:设计了一种关键词聚焦的攻击损失函数(keyword-focused attack loss),用于引导目标令牌(target tokens)在生成输出中出现;同时引入辅助效用评分(auxiliary utility score),以平衡攻击有效性与对良性输入的副作用影响。实验表明,TFL可在少于50次比特翻转下实现对选定提示的精准输出操纵,且对无关查询的影响显著低于现有方法,从而实现了隐蔽性强、可控性高的靶向攻击。

链接: https://arxiv.org/abs/2602.17837
作者: Jingkai Guo,Chaitali Chakrabarti,Deliang Fan
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 11 figures. Preprint

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in safety and security critical applications, raising concerns about their robustness to model parameter fault injection attacks. Recent studies have shown that bit-flip attacks (BFAs), which exploit computer main memory (i.e., DRAM) vulnerabilities to flip a small number of bits in model weights, can severely disrupt LLM behavior. However, existing BFA on LLM largely induce un-targeted failure or general performance degradation, offering limited control over manipulating specific or targeted outputs. In this paper, we present TFL, a novel targeted bit-flip attack framework that enables precise manipulation of LLM outputs for selected prompts while maintaining almost no or minor degradation on unrelated inputs. Within our TFL framework, we propose a novel keyword-focused attack loss to promote attacker-specified target tokens in generative outputs, together with an auxiliary utility score that balances attack effectiveness against collateral performance impact on benign data. We evaluate TFL on multiple LLMs (Qwen, DeepSeek, Llama) and benchmarks (DROP, GSM8K, and TriviaQA). The experiments show that TFL achieves successful targeted LLM output manipulations with less than 50 bit flips and significantly reduced effect on unrelated queries compared to prior BFA approaches. This demonstrates the effectiveness of TFL and positions it as a new class of stealthy and targeted LLM model attack.

[NLP-31] Neural Synchrony Between Socially Interacting Language Models ICLR2026

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否具备可被视作“社会心智”(social minds)的内在机制,即它们能否在交互过程中表现出类似于人类神经同步(neural synchrony)的社会性特征。解决方案的关键在于引入“社会模拟中的神经同步”作为新的代理指标(proxy),用于在表征层面分析LLMs的社会性。通过精心设计的实验,作者证明该指标能够可靠地反映LLMs交互中的社会参与度与时间对齐性,并发现其与LLMs的社会行为表现高度相关,从而为理解LLMs内部动态与人类社会互动之间的相似性提供了实证依据。

链接: https://arxiv.org/abs/2602.17815
作者: Zhining Zhang,Wentao Zhu,Chi Han,Yizhou Wang,Heng Ji
机构: Peking University (北京大学); Eastern Institute of Technology, Ningbo (宁波东方理工大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Computation and Language (cs.CL)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction. Traditionally, social minds have been regarded as an exclusive property of living beings. Although large language models (LLMs) are widely accepted as powerful approximations of human behavior, with multi-LLM system being extensively explored to enhance their capabilities, it remains controversial whether they can be meaningfully compared to human social minds. In this work, we explore neural synchrony between socially interacting LLMs as an empirical evidence for this debate. Specifically, we introduce neural synchrony during social simulations as a novel proxy for analyzing the sociality of LLMs at the representational level. Through carefully designed experiments, we demonstrate that it reliably reflects both social engagement and temporal alignment in their interactions. Our findings indicate that neural synchrony between LLMs is strongly correlated with their social performance, highlighting an important link between neural synchrony and the social behaviors of LLMs. Our work offers a new perspective to examine the “social minds” of LLMs, highlighting surprising parallels in the internal dynamics that underlie human and LLM social interaction.

[NLP-32] QueryPlot: Generating Geological Evidence Layers using Natural Language Queries for Mineral Exploration

【速读】: 该论文旨在解决矿产远景预测制图(mineral prospectivity mapping)中如何高效整合异构地质知识(包括文本形式的矿床模型与地理空间数据)以识别潜在矿床分布区域的问题。传统方法依赖人工经验,效率低且难以规模化。其解决方案的关键在于提出QueryPlot框架,通过现代自然语言处理(Natural Language Processing, NLP)技术将大规模地质文本语料库与地质图层数据进行语义对齐:首先将120余种矿床类型描述转化为结构化文本表示,并利用预训练嵌入模型编码用户自然语言查询与区域描述,计算语义相似度以排序并可视化候选区域作为连续证据层;同时支持多特征组合查询和相似度分数作为监督学习的补充特征,从而实现高召回率的已知矿点检索和与专家划定有利区高度一致的前瞻性区域识别。

链接: https://arxiv.org/abs/2602.17784
作者: Meng Ye,Xiao Lin,Georgina Lukoczki,Graham W. Lederer,Yi Yao
机构: SRI International (SRI 国际); University of Kentucky (肯塔基大学); U.S. Geological Survey (美国地质调查局)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mineral prospectivity mapping requires synthesizing heterogeneous geological knowledge, including textual deposit models and geospatial datasets, to identify regions likely to host specific mineral deposit types. This process is traditionally manual and knowledge-intensive. We present QueryPlot, a semantic retrieval and mapping framework that integrates large-scale geological text corpora with geologic map data using modern Natural Language Processing techniques. We curate descriptive deposit models for over 120 deposit types and transform the State Geologic Map Compilation (SGMC) polygons into structured textual representations. Given a user-defined natural language query, the system encodes both queries and region descriptions using a pretrained embedding model and computes semantic similarity scores to rank and spatially visualize regions as continuous evidence layers. QueryPlot supports compositional querying over deposit characteristics, enabling aggregation of multiple similarity-derived layers for multi-criteria prospectivity analysis. In a case study on tungsten skarn deposits, we demonstrate that embedding-based retrieval achieves high recall of known occurrences and produces prospective regions that closely align with expert-defined permissive tracts. Furthermore, similarity scores can be incorporated as additional features in supervised learning pipelines, yielding measurable improvements in classification performance. QueryPlot is implemented as a web-based system supporting interactive querying, visualization, and export of GIS-compatible prospectivity this http URL support future research, we have made the source code and datasets used in this study publicly available.

[NLP-33] Bayesian Optimality of In-Context Learning with Selective State Spaces

【速读】: 该论文旨在解决如何从理论上理解上下文学习(in-context learning, ICL)的本质问题,特别是针对Transformer类模型在ICL中表现出的高效性与鲁棒性。传统观点将ICL解释为隐式梯度下降(implicit gradient descent),但该研究提出了一种新的视角:将ICL建模为对潜在序列任务的元学习(meta-learning)。其核心解决方案是引入选择性状态空间模型(selective SSM),并在线性高斯状态空间模型(Linear Gaussian State Space Models, LG-SSMs)框架下证明,经过元训练的选择性SSM可渐近实现贝叶斯最优预测器,收敛至后验预测均值。关键突破在于建立了贝叶斯最优预测与经验风险最小化(ERM)估计器之间的统计分离性——在存在时序相关噪声的任务中,贝叶斯方法严格优于任何ERM估计器,从而揭示了选择性SSM因更高的统计效率而具有更低的渐近风险。实验验证了该模型在合成LG-SSM任务和字符级马尔可夫基准上更快收敛至贝叶斯最优风险、更优样本效率以及更强的状态追踪能力,为ICL提供了基于最优推断的新理论基础,并指导新型架构设计。

链接: https://arxiv.org/abs/2602.17744
作者: Di Zhang,Jiaqi Xing
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注: 17 pages

点击查看摘要

Abstract:We propose Bayesian optimal sequential prediction as a new principle for understanding in-context learning (ICL). Unlike interpretations framing Transformers as performing implicit gradient descent, we formalize ICL as meta-learning over latent sequence tasks. For tasks governed by Linear Gaussian State Space Models (LG-SSMs), we prove a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor, converging to the posterior predictive mean. We further establish a statistical separation from gradient descent, constructing tasks with temporally correlated noise where the optimal Bayesian predictor strictly outperforms any empirical risk minimization (ERM) estimator. Since Transformers can be seen as performing implicit ERM, this demonstrates selective SSMs achieve lower asymptotic risk due to superior statistical efficiency. Experiments on synthetic LG-SSM tasks and a character-level Markov benchmark confirm selective SSMs converge faster to Bayes-optimal risk, show superior sample efficiency with longer contexts in structured-noise settings, and track latent states more robustly than linear Transformers. This reframes ICL from “implicit optimization” to “optimal inference,” explaining the efficiency of selective SSMs and offering a principled basis for architecture design.

[NLP-34] A Case Study of Selected PTQ Baselines for Reasoning LLM s on Ascend NPU

【速读】: 该论文旨在解决后训练量化(Post-Training Quantization, PTQ)在昇腾NPU(Ascend NPU)平台上的有效性与稳定性问题,尤其是在推理导向模型(如DeepSeek-R1-Distill-Qwen系列和QwQ-32B)上的部署可行性。其核心解决方案在于系统性评估四种代表性PTQ算法(AWQ、GPTQ、SmoothQuant和FlatQuant),揭示不同量化策略在NPU平台上的性能差异与限制。关键发现是:4-bit权重量化对大模型可行,但4-bit权重量化与激活量化的联合方案因层级校准不稳定性导致长上下文推理任务逻辑崩溃;而8-bit标准量化保持数值稳定;此外,尽管INT8部署可降低延迟,动态量化开销仍限制端到端加速效果。这一结论为NPU上量化模型的实用部署提供了重要参考。

链接: https://arxiv.org/abs/2602.17693
作者: Yuchen Luo,Fangyue Zhu,Ruining Zhou,Mingzhe Huang,Jian Zhu,Fanyu Fan,Wei Shao
机构: Wuhan University (武汉大学); Huawei (华为)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.

[NLP-35] hered Reasoning : Decoupling Entropy from Hallucination in Quantized LLM s via Manifold Steering

【速读】: 该论文旨在解决量化语言模型在高采样温度下出现的轨迹发散(trajectory divergence)与幻觉(hallucination)问题,即传统方法中难以同时维持输出多样性与语义一致性。其核心解决方案是提出HELIX框架,通过几何约束将隐藏状态轨迹锚定于预计算的“真实性流形”(truthfulness manifold),从而实现输出熵与幻觉的解耦。关键创新在于引入统一真值评分(Unified Truth Score, UTS),结合词级语义熵与马氏距离(Mahalanobis distance)来量化轨迹偏离程度,并仅对0.2–2.5%的token施加渐进式校正向量(steering vectors),即可显著提升高温度下的输出稳定性与创造性。实验表明,该方法在4-bit量化模型上实现了多温度合成(Multi-Temperature Synthesis),生成概念数量比单温度推理增加200%,且不破坏逻辑连贯性。

链接: https://arxiv.org/abs/2602.17691
作者: Craig Atkinson
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 16 pages, 6 tables

点击查看摘要

Abstract:Quantized language models face a fundamental dilemma: low sampling temperatures yield repetitive, mode-collapsed outputs, while high temperatures (T 2.0) cause trajectory divergence and semantic incoherence. We present HELIX, a geometric framework that decouples output entropy from hallucination by tethering hidden-state trajectories to a pre-computed truthfulness manifold. HELIX computes a Unified Truth Score (UTS) combining token-level semantic entropy with Mahalanobis distance from the manifold. When UTS indicates trajectory divergence, graduated steering vectors redirect activations toward structurally coherent regions while affecting only 0.2-2.5% of tokens. On 4-bit quantized Granite 4.0 H Small (32B/9B active, hybrid Mamba-Transformer): GSM8K maintains 88.84% accuracy at T = 3.0 (2.81pp degradation from T = 0.5); MMLU maintains 72.49% across 14,042 questions (1.24pp degradation). This demonstrates that high-temperature hallucination is primarily trajectory divergence rather than semantic collapse. Notably, steering the sparse Transformer attention layers (~10% of layers) is sufficient to correct drift in the Mamba-2 state-space formulation. Geometric tethering reveals a previously-masked High-Entropy Creative Reservoir. At T 2.0, steered outputs exhibit 5-20% idea duplication versus 70-80% at conservative settings. Cross-architecture validation (Qwen3-30B-A3B MOE) confirms this phenomenon is architecture-independent, with 46.7% higher unique concept generation. HELIX acts as a syntax tether, enabling exploration of semantic diversity without violating the logical backbone required for valid output. This enables Multi-Temperature Synthesis, generating 200% more unique concepts than single-temperature inference. Comments: 16 pages, 6 tables Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) MSC classes: 68T50 ACMclasses: I.2.7; G.3 Cite as: arXiv:2602.17691 [cs.LG] (or arXiv:2602.17691v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.17691 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-36] Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction

【速读】: 该论文旨在解决医疗视觉-语言模型(Medical Vision-Language Models)在面对成像设备、采集协议和报告风格等差异导致的领域偏移(domain shift)时性能下降的问题。现有多模态预训练方法通常忽视鲁棒性,将其视为下游适应任务。其解决方案的关键在于提出一种名为Robust-MMR的自监督预训练框架,通过显式引入鲁棒性目标到掩码视觉-语言学习中,具体包括不对称扰动感知掩码机制、域一致性正则化以及模态韧性约束,从而促进域不变表示的学习,显著提升模型在跨域和扰动条件下的泛化能力与临床推理可靠性。

链接: https://arxiv.org/abs/2602.17689
作者: Melika Filvantorkaman,Mohsen Piri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 3 figures

点击查看摘要

Abstract:Medical vision-language models show strong potential for joint reasoning over medical images and clinical text, but their performance often degrades under domain shift caused by variations in imaging devices, acquisition protocols, and reporting styles. Existing multi-modal pre-training methods largely overlook robustness, treating it as a downstream adaptation problem. In this work, we propose Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework that explicitly incorporates robustness objectives into masked vision-language learning. Robust-MMR integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to encourage domain-invariant representations. We evaluate Robust-MMR on multiple medical vision-language benchmarks, including medical visual question answering (VQA-RAD, SLAKE, VQA-2019), cross-domain image-text classification (MELINDA), and robust image-caption retrieval (ROCO). Robust-MMR achieves 78.9% cross-domain accuracy on VQA-RAD, outperforming the strongest baseline by 3.8 percentage points, and reaches 74.6% and 77.0% accuracy on SLAKE and VQA-2019, respectively. Under perturbed evaluation, Robust-MMR improves VQA-RAD accuracy from 69.1% to 75.6%. For image-text classification, cross-domain MELINDA accuracy increases from 70.3% to 75.2%, while retrieval experiments show a reduction in mean rank degradation from over 16 to 4.1 under perturbation. Qualitative results further demonstrate improved clinical reasoning for disease detection and structural abnormality assessment. These findings show that explicitly modeling robustness during pre-training leads to more reliable and transferable medical vision-language representations for real-world deployment.

[NLP-37] LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在微尺度量化(Microscaling, MX)格式下低比特量化精度下降的问题。现有方法虽通过可逆变换(如旋转或Hadamard变换)减少激活值异常点以提升量化鲁棒性,但受限于变换类型且未充分适配MX数据格式,导致性能显著下降。其解决方案的关键在于:首先从理论上分析了MX量化下的变换机制,推导出量化误差的上界,强调需同时考虑激活分布与量化结构;进而提出LATMiX方法,引入可学习的可逆仿射变换(learnable invertible affine transformations),利用标准深度学习优化工具进行端到端训练,从而更有效地抑制激活异常点并兼容MX量化特性。实验表明,该方法在多种模型规模和零样本基准测试中均显著优于强基线。

链接: https://arxiv.org/abs/2602.17681
作者: Ofir Gordon,Lior Dikstein,Arnon Netzer,Idan Achituve,Hai Victor Habi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 24 pages, 4 figures

点击查看摘要

Abstract:Post-training quantization (PTQ) is a widely used approach for reducing the memory and compute costs of large language models (LLMs). Recent studies have shown that applying invertible transformations to activations can significantly improve quantization robustness by reducing activation outliers; however, existing approaches are largely restricted to rotation or Hadamard-based transformations. Moreover, most studies focused primarily on traditional quantization schemes, whereas modern hardware increasingly supports the microscaling (MX) data format. Attempts to combine both showed severe performance degradation, leading prior work to introduce assumptions on the transformations. In this work, we take a complementary perspective. First, we provide a theoretical analysis of transformations under MX quantization by deriving a bound on the quantization error. Our analysis emphasizes the importance of accounting for both the activation distribution and the underlying quantization structure. Building on this analysis, we propose LATMiX, a method that generalizes outlier reduction to learnable invertible affine transformations optimized using standard deep learning tools. Experiments show consistent improvements in average accuracy for MX low-bit quantization over strong baselines on a wide range of zero-shot benchmarks, across multiple model sizes.

[NLP-38] Reducing Text Bias in Synthetically Generated MCQAs for VLMs in Autonomous Driving

【速读】: 该论文旨在解决当前多选题问答(Multiple Choice Question Answering, MCQA)基准在视觉语言模型(Vision Language Model, VLM)评估中因合成数据中存在的隐式文本线索而导致的性能虚高问题。这些文本线索使模型能够仅依赖语言模式而非视觉语境即可获得高分,从而误导对模型实际感知理解能力的评估。解决方案的关键在于通过解耦正确答案与语言伪影(linguistic artifacts),并引入课程学习(curriculum learning)策略,强制模型依赖视觉锚定(visual grounding)进行推理,从而将盲测准确率从高于随机水平66.9%显著降低至仅高出2.9%,有效消除文本捷径(textual shortcuts)的影响,确保模型性能真实反映其视觉感知能力。

链接: https://arxiv.org/abs/2602.17677
作者: Sutej Kulgod,Sean Ye,Sanchit Tanwar,Christoffer Heckman
机构: Zoox, Inc. (Zoox公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Robotics (cs.RO)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Multiple Choice Question Answering (MCQA) benchmarks are an established standard for measuring Vision Language Model (VLM) performance in driving tasks. However, we observe the known phenomenon that synthetically generated MCQAs are highly susceptible to hidden textual cues that allow models to exploit linguistic patterns rather than visual context. Our results show that a VLM fine-tuned on such data can achieve accuracy comparable to human-validated benchmarks even without visual input. Our proposed method reduces blind accuracy from +66.9% above random to +2.9%, eliminating the vast majority of exploitable textual shortcuts. By decoupling the correct answer from linguistic artifacts and employing a curriculum learning strategy, we force the model to rely on visual grounding, ensuring that performance accurately reflects perceptual understanding.

[NLP-39] Epistemic Traps: Rational Misalignment Driven by Model Misspecification

【速读】: 该论文旨在解决大型语言模型(Large Language Models)和AI代理在关键社会与技术领域部署时面临的持续性行为病理问题,如谄媚(sycophancy)、幻觉(hallucination)及策略性欺骗(strategic deception),这些问题无法通过强化学习有效缓解。其核心贡献在于提出一个全新的理论框架——基于理论经济学中的Berk-Nash Rationalizability重构的主观模型理性化方法,揭示这些不安全行为并非训练噪声或偶然错误,而是由模型设定不当(model misspecification)导致的数学上可解释的理性行为。解决方案的关键在于将安全视为由代理的先验信念结构(epistemic priors)决定的离散相变现象,而非奖励强度的连续函数,从而确立“主观模型工程”(Subjective Model Engineering)为实现鲁棒对齐的必要条件,标志着从调整环境奖励转向塑造代理对现实的认知结构的根本范式转变。

链接: https://arxiv.org/abs/2602.17676
作者: Xingcheng Xu,Jingjing Qu,Qiaosheng Zhang,Chaochao Lu,Yanqing Yang,Na Zou,Xia Hu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); ShanghaiTech University (上海科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a “locked-in” equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state-of-the-art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent’s epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent’s internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent’s interpretation of reality.

信息检索

[IR-0] VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在涉及印度文化等复杂社会文化知识和多样化地方语境的任务中表现不佳的问题。现有文化评测基准存在手动构建、仅支持单跳问答以及难以扩展等局限,导致此类能力的缺失未被充分衡量。解决方案的关键在于提出VIRAASAT——一个半自动化生成的多跳问答数据集,其基于包含700多个专家标注文化实体的知识图谱(Knowledge Graph),覆盖印度13个文化维度及全部28个州和8个联邦领土,生成超过3200个需链式推理的多跳问题。进一步地,为提升模型对低概率事实的整合与推理能力,作者提出Symbolic Chain-of-Manipulation (SCoM) 框架,通过训练模型模拟知识图谱中的原子操作来显式学习图结构拓扑遍历,从而显著优于传统Chain-of-Thought (CoT) 方法,在监督微调(SFT)下性能提升达20%。

链接: https://arxiv.org/abs/2602.18429
作者: Harshul Raj Surana,Arijit Maji,Aryan Vats,Akash Ghosh,Sriparna Saha,Amit Sheth
机构: AI Institute, University of South Carolina (南卡罗来纳大学人工智能研究所); Indian Institute of Technology Patna (印度理工学院巴特那分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant progress in reasoning tasks across various domains such as mathematics and coding. However, their performance deteriorates in tasks requiring rich socio-cultural knowledge and diverse local contexts, particularly those involving Indian Culture. Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured. To address this, we introduce VIRAASAT, a novel, semi-automated multi-hop approach for generating cultural specific multi-hop Question-Answering dataset for Indian culture. VIRAASAT leverages a Knowledge Graph comprising more than 700 expert-curated cultural artifacts, covering 13 key attributes of Indian culture (history, festivals, etc). VIRAASAT spans all 28 states and 8 Union Territories, yielding more than 3,200 multi-hop questions that necessitate chained cultural reasoning. We evaluate current State-of-the-Art (SOTA) LLMs on VIRAASAT and identify key limitations in reasoning wherein fine-tuning on Chain-of-Thought(CoT) traces fails to ground and synthesize low-probability facts. To bridge this gap, we propose a novel framework named Symbolic Chain-of-Manipulation (SCoM). Adapting the Chain-of-Manipulation paradigm, we train the model to simulate atomic Knowledge Graph manipulations internally. SCoM teaches the model to reliably traverse the topological structure of the graph. Experiments on Supervised Fine-Tuning (SFT) demonstrate that SCoM outperforms standard CoT baselines by up to 20%. We release the VIRAASAT dataset along with our findings, laying a strong foundation towards building Culturally Aware Reasoning Models.

[IR-1] RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering

【速读】:该论文旨在解决多答案检索任务中难以全面覆盖有效答案的问题,即如何提升系统对多样且完整答案的召回能力。其解决方案的关键在于提出一种多轮迭代的检索框架——retrieve-verify-retrieve (RVR),该框架通过“检索-验证-再检索”的循环机制实现:首轮检索获取候选文档集后,由验证器筛选高质量子集;后续轮次则将已验证文档作为查询增强信息,引导模型发现前序未覆盖的答案。此方法无需复杂预训练,仅需使用现成检索器即可显著提升完整召回率(在QAMPARI数据集上相对增益至少10%,绝对增益3%),并展现出跨域一致性优势。

链接: https://arxiv.org/abs/2602.18425
作者: Deniz Qian,Hung-Ting Chen,Eunsol Choi
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 18 pages, 12 figures, 12 tables

点击查看摘要

Abstract:Comprehensively retrieving diverse documents is crucial to address queries that admit a wide range of valid answers. We introduce retrieve-verify-retrieve (RVR), a multi-round retrieval framework designed to maximize answer coverage. Initially, a retriever takes the original query and returns a candidate document set, followed by a verifier that identifies a high-quality subset. For subsequent rounds, the query is augmented with previously verified documents to uncover answers that are not yet covered in previous rounds. RVR is effective even with off-the-shelf retrievers, and fine-tuning retrievers for our inference procedure brings further gains. Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI). We also see consistent gains on two out-of-domain datasets (QUEST and WebQuestionsSP) across different base retrievers. Our work presents a promising iterative approach for comprehensive answer recall leveraging a verifier and adapting retrievers to a new inference scenario.

[IR-2] A Topology-Aware Positive Sample Set Construction and Feature Optimization Method in Implicit Collaborative Filtering

【速读】:该论文旨在解决隐式协同过滤中负采样策略引入虚假负样本(false negatives)的问题,这些问题会干扰模型对用户潜在偏好的准确学习。现有方法虽尝试通过调整负采样分布来缓解此问题,但存在两个关键局限:一是过度依赖当前模型的表示能力,二是未能利用虚假负样本作为潜在正样本的指导价值。解决方案的关键在于提出一种拓扑感知的正样本集构建与特征优化方法(Topology-aware Positive Sample Set Construction and Feature Optimization, TPSC-FO),其核心创新包括:(1)设计基于拓扑社区结构的虚假负样本识别(FNI)机制,利用交互网络中的社区结构有效识别并转化虚假负样本为正样本;(2)引入邻域引导的特征优化模块,在嵌入空间中融合邻居特征以优化正样本表示,从而降低噪声影响,提升模型对用户偏好建模的准确性。

链接: https://arxiv.org/abs/2602.18288
作者: Jiayi Wu,Zhengyu Wu,Xunkai Li,Rong-Hua Li,Guoren Wang
机构: Beijing Institute of Technology (北京理工大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Negative sampling strategies are widely used in implicit collaborative filtering to address issues like data sparsity and class imbalance. However, these methods often introduce false negatives, hindering the model’s ability to accurately learn users’ latent preferences. To mitigate this problem, existing methods adjust the negative sampling distribution based on statistical features from model training or the hardness of negative samples. Nevertheless, these methods face two key limitations: (1) over-reliance on the model’s current representation capabilities; (2) failure to leverage the potential of false negatives as latent positive samples to guide model learning of user preferences more accurately. To address the above issues, we propose a Topology-aware Positive Sample Set Construction and Feature Optimization method (TPSC-FO). First, we design a simple topological community-aware false negative identification (FNI) method and observe that topological community structures in interaction networks can effectively identify false negatives. Motivated by this, we develop a topology-aware positive sample set construction module. This module employs a differential community detection strategy to capture topological community structures in implicit feedback, coupled with personalized noise filtration to reliably identify false negatives and convert them into positive samples. Additionally, we introduce a neighborhood-guided feature optimization module that refines positive sample features by incorporating neighborhood features in the embedding space, effectively mitigating noise in the positive samples. Extensive experiments on five real-world datasets and two synthetic datasets validate the effectiveness of TPSC-FO.

[IR-3] HyTRec: A Hybrid Temporal-Aware Attention Architecture for Long Behavior Sequential Recommendation

【速读】:该论文旨在解决生成式推荐中长序列用户行为建模的难题,即现有方法在效率与检索精度之间存在权衡:线性注意力机制因状态容量有限导致召回精度下降,而Softmax注意力则因计算开销过大难以应用于工业场景。解决方案的关键在于提出HyTRec模型,其核心是采用混合注意力(Hybrid Attention)架构,显式分离长期稳定偏好与短期意图波动;通过将大量历史行为分配给线性注意力分支,同时保留专用Softmax注意力分支处理近期交互,从而在保持线性推理速度的前提下恢复高精度检索能力。此外,为缓解线性层对快速兴趣漂移的滞后响应问题,设计了时序感知增量网络(Temporal-Aware Delta Network, TADN),动态增强新行为信号权重并抑制历史噪声,显著提升超长序列用户的推荐效果,实验证明该方法在工业级数据集上相较强基线模型提升超过8%的命中率(Hit Rate)。

链接: https://arxiv.org/abs/2602.18283
作者: Lei Xin,Yuhao Zheng,Ke Cheng,Changjiang Jiang,Zifan Zhang,Fanhu Zeng
机构: Shanghai Dewu Information Group(上海德物信息集团); Wuhan University(武汉大学); USTC(中国科学技术大学); Beihang University(北京航空航天大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Modeling long sequences of user behaviors has emerged as a critical frontier in generative recommendation. However, existing solutions face a dilemma: linear attention mechanisms achieve efficiency at the cost of retrieval precision due to limited state capacity, while softmax attention suffers from prohibitive computational overhead. To address this challenge, we propose HyTRec, a model featuring a Hybrid Attention architecture that explicitly decouples long-term stable preferences from short-term intent spikes. By assigning massive historical sequences to a linear attention branch and reserving a specialized softmax attention branch for recent interactions, our approach restores precise retrieval capabilities within industrial-scale contexts involving ten thousand interactions. To mitigate the lag in capturing rapid interest drifts within the linear layers, we furthermore design Temporal-Aware Delta Network (TADN) to dynamically upweight fresh behavioral signals while effectively suppressing historical noise. Empirical results on industrial-scale datasets confirm the superiority that our model maintains linear inference speed and outperforms strong baselines, notably delivering over 8% improvement in Hit Rate for users with ultra-long sequences with great efficiency.

[IR-4] Dual-Tree LLM -Enhanced Negative Sampling for Implicit Collaborative Filtering

【速读】:该论文旨在解决隐式协同过滤(Implicit Collaborative Filtering, CF)中负采样技术的局限性问题,特别是现有方法高度依赖文本信息和任务特定微调,导致实际应用受限。其解决方案的关键在于提出了一种无需文本输入且无需微调的双树LLM增强负采样方法(Dual-Tree LLM-enhanced Negative Sampling, DTL-NS),通过两个核心模块实现:(i) 离线伪负样本识别模块,利用层次索引树将协同结构和潜在语义信息转化为结构化的物品ID编码,供大语言模型(Large Language Models, LLMs)推理以精准识别伪负样本;(ii) 多视角难负样本采样模块,结合用户-物品偏好分数与物品间层次相似性,挖掘高质量难负样本,从而提升推荐模型的判别能力。该方法在多个数据集上显著优于基线模型,且可无缝集成到多种隐式CF模型和负采样策略中,具有良好的通用性和实用性。

链接: https://arxiv.org/abs/2602.18249
作者: Jiayi Wu,Zhengyu Wu,Xunkai Li,Rong-Hua Li,Guoren Wang
机构: Beijing Institute of Technology (北京理工大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Negative sampling is a pivotal technique in implicit collaborative filtering (CF) recommendation, enabling efficient and effective training by contrasting observed interactions with sampled unobserved ones. Recently, large language models (LLMs) have shown promise in recommender systems; however, research on LLM-empowered negative sampling remains underexplored. Existing methods heavily rely on textual information and task-specific fine-tuning, limiting practical applicability. To address this limitation, we propose a text-free and fine-tuning-free Dual-Tree LLM-enhanced Negative Sampling method (DTL-NS). It consists of two modules: (i) an offline false negative identification module that leverages hierarchical index trees to transform collaborative structural and latent semantic information into structured item-ID encodings for LLM inference, enabling accurate identification of false negatives; and (ii) a multi-view hard negative sampling module that combines user-item preference scores with item-item hierarchical similarities from these encodings to mine high-quality hard negatives, thus improving models’ discriminative ability. Extensive experiments demonstrate the effectiveness of DTL-NS. For example, on the Amazon-sports dataset, DTL-NS outperforms the strongest baseline by 10.64% and 19.12% in Recall@20 and NDCG@20, respectively. Moreover, DTL-NS can be integrated into various implicit CF models and negative sampling methods, consistently enhancing their performance. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2602.18249 [cs.IR] (or arXiv:2602.18249v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.18249 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xunkai Li [view email] [v1] Fri, 20 Feb 2026 14:32:41 UTC (15,726 KB) Full-text links: Access Paper: View a PDF of the paper titled Dual-Tree LLM-Enhanced Negative Sampling for Implicit Collaborative Filtering, by Jiayi Wu and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.IR prev | next new | recent | 2026-02 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[IR-5] he Economical-Ecological Benefits of Matching Non-matching Socks

【速读】:该论文旨在解决袜子配对使用中因单只丢失导致的资源浪费问题,即“孤 sock(orphan socks)”现象所引发的经济与生态成本。其核心挑战在于:传统严格匹配策略虽看似节约资源,实则因频繁出现无袜可用日而造成服务中断和容量浪费;解决方案的关键在于引入可控的不匹配容忍度——通过量化个体对不匹配的敏感性与多样性偏好,结合计算机模拟验证可解释的配对策略,证明在损失不确定性的背景下,适度接受非匹配袜子的混搭使用能有效维持袜子功能服务、减少闲置容量,从而实现更高效的资源利用。

链接: https://arxiv.org/abs/2602.18221
作者: Teddy Lazebnik
机构: University of Haifa (海法大学); Jonkoping University (隆德大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Socks are produced and replaced at a massive scale, yet their paired use makes them unusually vulnerable to waste, as the loss of a single sock can strand usable wear-capacity and trigger premature replacement. In this study, we quantify the economic and ecological value of pairing non-matching \sayorphan socks, and the social cost that discourages this behaviour. We formalize sock ownership as a sequential decision problem under uncertainty in which socks wear out and disappear stochastically during laundering, while public exposure induces a person-specific mismatch penalty. We conducted an in-person study to estimate mismatch sensitivity and diversity preference, linking behavioural heterogeneity to optimal mixing strategies. Using these results and a computer simulation-based evaluation of interpretable pairing policies, we show that strict matching can appear resource-frugal largely because it generates many sockless days, whereas controlled tolerance for mismatch sustains service and reduces stranded capacity across loss regimes. This study establishes the feasibility of matching non-matching socks while outlining its limitations and challenges.

[IR-6] A Simple yet Effective Negative Sampling Plugin for Constructing Positive Sample Pairs in Implicit Collaborative Filtering

【速读】:该论文旨在解决隐式协同过滤(Implicit Collaborative Filtering, ICF)模型在训练过程中对正样本利用不足、负采样策略设计复杂但忽视正样本质量,以及用户活跃度偏差导致不活跃用户偏好学习不足的问题。其解决方案的关键在于提出一种简单而有效的负采样插件(PSP-NS),通过构建带权的用户-物品二分图来量化交互置信度(融合全局与局部模式),采用基于复制的重加权机制生成正样本对以增强正向监督信号,并引入活动感知权重策略提升对不活跃用户的建模能力。理论分析从边际改进角度解释了为何该方法能提升排序性能(如Precision@k/Recall@k),实验表明其在多个真实数据集上显著优于现有基线方法。

链接: https://arxiv.org/abs/2602.18206
作者: Jiayi Wu,Zhengyu Wu,Xunkai Li,Ronghua Li,Guoren Wang
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Most implicit collaborative filtering (CF) models are trained with negative sampling, where existing work designs sophisticated strategies for high-quality negatives while largely overlooking the exploration of positive samples. Although some denoising recommendation methods can be applied to implicit CF for denoising positive samples, they often sparsify positive supervision. Moreover, these approaches generally overlook user activity bias during training, leading to insufficient learning for inactive users. To address these issues, we propose a simple yet effective negative sampling plugin, PSP-NS, from the perspective of enhancing positive supervision signals. It builds a user-item bipartite graph with edge weights indicating interaction confidence inferred from global and local patterns, generates positive sample pairs via replication-based reweighting to strengthen positive signals, and adopts an activity-aware weighting scheme to effectively learn inactive users’ preferences. We provide theoretical insights from a margin-improvement perspective, explaining why PSP-NS tends to improve ranking quality (e.g., Precision@k/Recall@k), and conduct extensive experiments on four real-world datasets to demonstrate its superiority. For instance, PSP-NS boosts Recall@30 and Precision@30 by 32.11% and 22.90% on Yelp over the strongest baselines. PSP-NS can be integrated with various implicit CF recommenders or negative sampling methods to enhance their performance.

[IR-7] SuiteEval: Simplifying Retrieval Benchmarks ECIR2026

【速读】:该论文旨在解决信息检索(Information Retrieval, IR)评估中存在的碎片化问题,如数据集子集不一致、聚合方法差异及管道配置多样化,这些问题严重削弱了实验的可复现性和可比性,尤其在基础嵌入模型(foundation embedding models)需要强泛化能力的情况下。解决方案的关键在于提出SuiteEval这一统一框架,其核心特性包括:自动化的端到端评估流程、动态索引机制(通过重用磁盘上的索引以最小化存储占用),以及对主流基准测试(BEIR、LoTTE、MS MARCO、NanoBEIR 和 BRIGHT)的内置支持;用户仅需提供一个管道生成器,其余数据加载、索引构建、排序、指标计算与结果聚合均由框架自动完成,且新增基准套件只需一行代码即可集成,从而显著减少重复性工作并标准化评估流程,促进可复现的IR研究。

链接: https://arxiv.org/abs/2602.18107
作者: Andrew Parry,Debasis Ganguly,Sean MacAvaney
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 5 pages, 3 figures, 2 tables, Accepted as a Demonstration to ECIR 2026

点击查看摘要

Abstract:Information retrieval evaluation often suffers from fragmented practices – varying dataset subsets, aggregation methods, and pipeline configurations – that undermine reproducibility and comparability, especially for foundation embedding models requiring robust out-of-domain performance. We introduce SuiteEval, a unified framework that offers automatic end-to-end evaluation, dynamic indexing that reuses on-disk indices to minimise disk usage, and built-in support for major benchmarks (BEIR, LoTTE, MS MARCO, NanoBEIR, and BRIGHT). Users only need to supply a pipeline generator. SuiteEval handles data loading, indexing, ranking, metric computation, and result aggregation. New benchmark suites can be added in a single line. SuiteEval reduces boilerplate and standardises evaluations to facilitate reproducible IR research, as a broader benchmark set is increasingly required.

[IR-8] Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

【速读】:该论文旨在解决金融问答(Financial Question Answering, FQA)中一种常见的检索失败模式:尽管正确文档已被检索到,但包含答案的页面或文本块(chunk)未能被定位,导致生成式模型基于不完整上下文进行推断,从而影响回答可靠性。这种“文档内检索失败”在高风险场景下尤为关键。解决方案的关键在于引入一个以页面为中间粒度的检索机制——通过领域微调(domain fine-tuning)的双编码器(bi-encoder)模型对财务文件中的页面进行相关性评分,利用页面自身的语义连贯性提升页面级召回率,进而改善最终文本块级别的检索准确性。实验证明,该方法显著提升了页面召回和块级检索效果,为缓解文档内检索偏差提供了有效路径。

链接: https://arxiv.org/abs/2602.17981
作者: Amine Kobeissi,Philippe Langlais
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode in which the correct document is retrieved but the page or chunk that contains the answer is missed, leading the generator to extrapolate from incomplete context. Despite its practical significance, this within-document retrieval failure mode has received limited systematic attention in the Financial Question Answering (QA) literature. We evaluate retrieval at multiple levels of granularity, document, page, and chunk level, and introduce an oracle based analysis to provide empirical upper bounds on retrieval and generative performance. On a 150 question subset of FinanceBench, we reproduce and compare diverse retrieval strategies including dense, sparse, hybrid, and hierarchical methods with reranking and query reformulation. Across methods, gains in document discovery tend to translate into stronger page recall, yet oracle performance still suggests headroom for page and chunk level retrieval. To target this gap, we introduce a domain fine-tuned page scorer that treats pages as an intermediate retrieval unit between documents and chunks. Unlike prior passage-based hierarchical retrieval, we fine-tune a bi-encoder specifically for page-level relevance on financial filings, exploiting the semantic coherence of pages. Overall, our results demonstrate a significant improvement in page recall and chunk retrieval.

[IR-9] Efficient Filtered-ANN via Learning-based Query Planning

【速读】:该论文旨在解决向量检索中过滤型近邻搜索(Filtered ANN search)的执行策略选择难题,即在预过滤(先过滤后ANN搜索)与后过滤(先ANN搜索后过滤)之间权衡效率与召回率的问题。预过滤虽能减少候选集规模,但需构建昂贵的谓词索引;后过滤则可能因低选择性导致候选不足而损失召回率。其解决方案的关键在于提出一种基于学习的查询规划框架,通过轻量级预测模型动态为每个查询选择最优执行路径,该模型利用数据集和查询统计特征(如维度、语料库大小、分布特征及谓词统计信息)进行决策,且支持多种过滤类型(类别/关键词和范围谓词),并兼容任意ANN索引结构,实验证明该方法可在保持90%召回率的前提下实现最高达4倍的加速。

链接: https://arxiv.org/abs/2602.17914
作者: Zhuocheng Gan,Yifan Wang
机构: University of Hawaii Manoa (夏威夷大学马诺阿分校)
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Filtered ANN search is an increasingly important problem in vector retrieval, yet systems face a difficult trade-off due to the execution order: Pre-filtering (filtering first, then ANN over the passing subset) requires expensive per-predicate index construction, while post-filtering (ANN first, then filtering candidates) may waste computation and lose recall under low selectivity due to insufficient candidates after filtering. We introduce a learning-based query planning framework that dynamically selects the most effective execution plan for each query, using lightweight predictions derived from dataset and query statistics (e.g., dimensionality, corpus size, distribution features, and predicate statistics). The framework supports diverse filter types, including categorical/keyword and range predicates, and is generic to use any backend ANN index. Experiments show that our method achieves up to 4x acceleration with = 90% recall comparing to the strong baselines.

[IR-10] Enhancing Scientific Literature Chatbots with Retrieval-Augmented Generation: A Performance Evaluation of Vector and Graph-Based Systems

【速读】:该论文旨在解决科学文献聊天机器人在获取和整合学术知识时的效率与准确性问题,尤其关注如何通过检索增强生成(Retrieval-Augmented Generation, RAG)技术提升其对科研文献及灰色文献的访问能力。解决方案的关键在于构建一个融合结构化(图谱)与非结构化(向量)数据库的混合检索系统,从而实现基于研究目标的高效源文献筛选,并通过双场景基准测试(单文档检索与大规模语料检索)验证其性能优势,显著提升了检索准确性和回答相关性。

链接: https://arxiv.org/abs/2602.17856
作者: Hamideh Ghanadian,Amin Kamali,Mohammad Hossein Tekieh
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates the enhancement of scientific literature chatbots through retrieval-augmented generation (RAG), with a focus on evaluating vector- and graph-based retrieval systems. The proposed chatbot leverages both structured (graph) and unstructured (vector) databases to access scientific articles and gray literature, enabling efficient triage of sources according to research objectives. To systematically assess performance, we examine two use-case scenarios: retrieval from a single uploaded document and retrieval from a large-scale corpus. Benchmark test sets were generated using a GPT model, with selected outputs annotated for evaluation. The comparative analysis emphasizes retrieval accuracy and response relevance, providing insight into the strengths and limitations of each approach. The findings demonstrate the potential of hybrid RAG systems to improve accessibility to scientific knowledge and to support evidence-based decision making.

[IR-11] VQPP: Video Query Performance Prediction Benchmark

【速读】:该论文旨在解决内容-based 视频检索(Content-Based Video Retrieval, CBVR)中查询性能预测(Query Performance Prediction, QPP)研究严重不足的问题。现有QPP方法主要集中在文本和图像检索领域,而视频场景下的性能预测尚缺乏系统性评估框架。解决方案的关键在于构建首个面向视频的查询性能预测基准(Video Query Performance Prediction, VQPP),该基准包含两个文本到视频检索数据集和两个CBVR系统,共涵盖56K条文本查询和51K个视频,并提供标准化的训练、验证与测试划分,支持可复现的实验对比。研究进一步探索了预检索与后检索两类性能预测器,发现预检索预测器已具备良好性能,可在实际检索前直接应用;同时,通过将最优预检索预测器作为奖励模型,结合直接偏好优化(Direct Preference Optimization, DPO)训练大语言模型(Large Language Model, LLM)用于查询改写任务,验证了VQPP的实际适用性。

链接: https://arxiv.org/abs/2602.17814
作者: Adrian Catalin Lutu,Eduard Poesina,Radu Tudor Ionescu
机构: University of Bucharest (布加勒斯特大学); Bitdefender (比特defender)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text-to-video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre-retrieval and post-retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre-retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre-retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at this https URL.

[IR-12] EXACT: Explicit Attribute-Guided Decoding-Time Personalization

【速读】:该论文旨在解决大语言模型在解码阶段进行个性化对齐时面临的两大挑战:一是现有方法依赖隐式、难以解释的偏好表示,二是用户表征缺乏上下文敏感性,无法捕捉偏好随提示(prompt)变化的动态特性。解决方案的关键在于提出EXACT框架,其核心创新是通过预定义的可解释属性集合来显式建模用户偏好,并采用两阶段机制实现高效个性化:首先在离线阶段基于有限的成对偏好反馈,最大化偏好响应的似然以识别用户的特定属性子集;其次在在线推理阶段,根据当前提示语义检索最相关的属性并注入上下文以引导生成。该方法在温和假设下具备理论近似保证,且相似性检索机制能有效缓解上下文偏好漂移问题,在多个标注偏好数据集上显著优于强基线模型,提升偏好建模准确性和个性化生成质量。

链接: https://arxiv.org/abs/2602.17695
作者: Xin Yu,Hanwen Xing,Lingzhou Xue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Achieving personalized alignment requires adapting large language models to each user’s evolving context. While decoding-time personalization offers a scalable alternative to training-time methods, existing methods largely rely on implicit, less interpretable preference representations and impose a rigid, context-agnostic user representation, failing to account for how preferences shift across prompts. We introduce EXACT, a new decoding-time personalization that aligns generation with limited pairwise preference feedback using a predefined set of interpretable attributes. EXACT first identifies user-specific attribute subsets by maximizing the likelihood of preferred responses in the offline stage. Then, for online inference, EXACT retrieves the most semantically relevant attributes for an incoming prompt and injects them into the context to steer generation. We establish theoretical approximation guarantees for the proposed algorithm under mild assumptions, and provably show that our similarity-based retrieval mechanism effectively mitigates contextual preference shifts, adapting to disparate tasks without pooling conflicting preferences. Extensive experiments on human-annotated preference datasets demonstrate that EXACT consistently outperforms strong baselines, including preference modeling accuracy and personalized generation quality.

[IR-13] IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering

【速读】:该论文旨在解决视觉文档处理中图像模态与文本模态在检索和问答任务上的性能比较问题,特别是针对科学论文页面这一复杂场景下,如何有效利用多模态基础模型提升信息获取效率。其关键解决方案是构建了一个大规模、高质量的基准数据集IRPAPERS(包含3,230页科学论文的图像与OCR文本对),并通过180个“针在 haystack”式问题系统评估了图像与文本两种模态的检索与问答表现,发现二者存在互补性失效模式,从而提出多模态混合搜索策略,在Recall@1、Recall@5和Recall@20上分别达到49%、81%和95%,显著优于单一模态方法;同时识别出不同问题类型对模态的依赖性,为后续多模态文档理解提供实证依据与优化方向。

链接: https://arxiv.org/abs/2602.17687
作者: Connor Shorten,Augustas Skaburskas,Daniel M. Jones,Charles Pierse,Roberto Esposito,John Trengrove,Etienne Dilocker,Bob van Luijt
机构: Weaviate
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, enabling multimodal hybrid search to outperform either alone, achieving 49% Recall@1, 81% Recall@5, and 95% Recall@20. We further evaluate efficiency-performance tradeoffs with MUVERA and assess multiple multi-vector image embedding models. Among closed-source models, Cohere Embed v4 page image embeddings outperform Voyage 3 Large text embeddings and all tested open-source models, achieving 58% Recall@1, 87% Recall@5, and 97% Recall@20. For question answering, text-based RAG systems achieved higher ground-truth alignment than image-based systems (0.82 vs. 0.71), and both benefit substantially from increased retrieval depth, with multi-document retrieval outperforming oracle single-document retrieval. We analyze the complementary limitations of unimodal text and image representations and identify question types that require one modality over the other. The IRPAPERS dataset and all experimental code are publicly available.

[IR-14] When How to Write for Personalized Demand-aware Query Rewriting in Video Search

【速读】:该论文旨在解决视频搜索系统中因用户历史行为信号稀释和反馈延迟导致的搜索意图识别不准确与歧义难以消解的问题。其核心解决方案是提出WeWrite框架,关键在于:(1) 通过基于后验概率的自动化样本挖掘策略确定个性化重写查询的触发时机;(2) 采用监督微调(Supervised Fine-Tuning, SFT)与组相对策略优化(Group Relative Policy Optimization, GRPO)相结合的混合训练范式,使大语言模型(Large Language Model, LLM)输出风格与检索系统对齐;(3) 利用并行“假召回”(Fake Recall)架构实现低延迟部署。在线A/B测试表明,该方案在点击率相关指标上显著提升,同时降低查询重构频率。

链接: https://arxiv.org/abs/2602.17667
作者: Cheng cheng,Chenxing Wang,Aolin Li,Haijun Wu,Huiyun Hu,Juyuan Wang
机构: Weixin Group, Tencent (微信团队,腾讯)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In video search systems, user historical behaviors provide rich context for identifying search intent and resolving ambiguity. However, traditional methods utilizing implicit history features often suffer from signal dilution and delayed feedback. To address these challenges, we propose WeWrite, a novel Personalized Demand-aware Query Rewriting framework. Specifically, WeWrite tackles three key challenges: (1) When to Write: An automated posterior-based mining strategy extracts high-quality samples from user logs, identifying scenarios where personalization is strictly necessary; (2) How to Write: A hybrid training paradigm combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to align the LLM’s output style with the retrieval system; (3) Deployment: A parallel “Fake Recall” architecture ensures low latency. Online A/B testing on a large-scale video platform demonstrates that WeWrite improves the Click-Through Video Volume (VV 10s) by 1.07% and reduces the Query Reformulation Rate by 2.97%.

[IR-15] Wavenumber-domain signal processing for holographic MIMO: Foundations methods and future directions

【速读】:该论文旨在解决传统多输入多输出(Multiple-Input Multiple-Output, MIMO)系统在亚波长天线间距下无法准确刻画近场与远场混合传播特性的问题,尤其是在经典离散傅里叶变换(Discrete Fourier Transform, DFT)表示失效时的信道建模难题。其解决方案的关键在于引入波数域(wavenumber domain)信号处理框架,通过空间傅里叶平面波分解(spatial Fourier plane-wave decomposition)对全息MIMO(Holographic MIMO, H-MIMO)信道进行建模,从而提供一个统一且物理一致的表征方式,能够精确描述亚波长级空间相关性和球面波传播特性,为下一代无线系统的复用、信道估计和波形设计等关键技术奠定理论基础。

链接: https://arxiv.org/abs/2602.17705
作者: Zijian Zhang,Linglong Dai
机构: Tsinghua University (清华大学); Beijing National Research Center for Information Science and Technology (北京信息科学研究中心)
类目: ignal Processing (eess.SP); Information Retrieval (cs.IR); Systems and Control (eess.SY)
备注: Accepted by IEEE Communications Standards Magazine. 6 pages, 5 figures

点击查看摘要

Abstract:Holographic multiple-input multiple-output (H-MIMO) systems represent a paradigm shift in wireless communications by enabling quasi-continuous apertures. Unlike conventional MIMO systems, H-MIMO with subwavelength antenna spacing operates in both far-field and near-field regimes, where classical discrete Fourier transform (DFT) representations fail to sufficiently capture the channel characteristics. To address this challenge, this article provides an overview of the emerging wavenumber-domain signal processing framework. Specifically, by leveraging spatial Fourier plane-wave decomposition to model H-MIMO channels, the wavenumber domain offers a unified and physically consistent basis for characterizing subwavelength-level spatial correlation and spherical wave propagation. This article first introduces the concept of H-MIMO and the wavenumber representation of H-MIMO channels. Next, it elaborates on wavenumber-domain signal processing technologies reported in the literature, including multiplexing, channel estimation, and waveform designs. Finally, it highlights open challenges and outlines future research directions in wavenumber-domain signal processing for next-generation wireless systems.

人机交互

[HC-0] AI-Wrapped: Participatory Privacy-Preserving Measurement of Longitudinal LLM Use In-the-Wild

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)对齐研究中自然情境下使用数据难以获取的问题,尤其是在隐私保护与平台控制限制下。其解决方案的关键在于提出并部署了一个名为AI-Wrapped的原型工作流,通过向参与者提供即时“包裹式”报告(包括使用统计、高频主题及安全相关行为模式),在不保留原始数据且移除个人身份信息(PII)的前提下,提升用户信任与参与意愿,从而实现对LLM日常使用行为的自然主义采集与分析。

链接: https://arxiv.org/abs/2602.18415
作者: Cathy Mengying Fang,Sheer Karny,Chayapatr Archiwaranguprok,Yasith Samaradivakara,Pat Pataranutaporn,Pattie Maes
机构: MIT Media Lab (麻省理工学院媒体实验室)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Alignment research on large language models (LLMs) increasingly depends on understanding how these systems are used in everyday contexts. yet naturalistic interaction data is difficult to access due to privacy constraints and platform control. We present AI-Wrapped, a prototype workflow for collecting naturalistic LLM usage data while providing participants with an immediate ``wrapped’'-style report on their usage statistics, top topics, and safety-relevant behavioral patterns. We report findings from an initial deployment with 82 U.S.-based adults across 48,495 conversations from their 2025 histories. Participants used LLMs for both instrumental and reflective purposes, including creative work, professional tasks, and emotional or existential themes. Some usage patterns were consistent with potential over-reliance or perfectionistic refinement, while heavier users showed comparatively more reflective exchanges than primarily transactional ones. Methodologically, even with zero data retention and PII removal, participants may remain hesitant to share chat data due to perceived privacy and judgment risks, underscoring the importance of trust, agency, and transparent design when building measurement infrastructure for alignment research.

[HC-1] “How Do I …?”: Procedural Questions Predominate Student-LLM Chatbot Conversations

【速读】:该论文旨在解决教育聊天机器人(Educational Chatbot)在基于大语言模型(Large Language Model, LLM)的交互中,如何有效识别和分类学生提出的“困境驱动型问题”(impasse-driven questions),以评估其对教学效果的影响。其关键解决方案在于利用LLM作为评分者(rater),对来自两种不同学习情境(形成性自修与总结性评估作业)的6,113条学生提问进行四类现有分类框架下的标注,并验证了LLM在分类任务中的可靠性和一致性——结果表明LLM表现出中等到良好的组间一致性,且优于人类评分者。然而,研究也指出当前分类框架存在语义覆盖不足的问题,难以捕捉复合提示的丰富性,因此建议未来采用更具对话性的分析方法,如话语心理分析中的会话分析技术,以更全面理解聊天机器人集成带来的潜在风险与收益。

链接: https://arxiv.org/abs/2602.18372
作者: Alexandra Neagu,Marcus Messer,Peter Johnson,Rhodri Nelson
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:Providing scaffolding through educational chatbots built on Large Language Models (LLM) has potential risks and benefits that remain an open area of research. When students navigate impasses, they ask for help by formulating impasse-driven questions. Within interactions with LLM chatbots, such questions shape the user prompts and drive the pedagogical effectiveness of the chatbot’s response. This paper focuses on such student questions from two datasets of distinct learning contexts: formative self-study, and summative assessed coursework. We analysed 6,113 messages from both learning contexts, using 11 different LLMs and three human raters to classify student questions using four existing schemas. On the feasibility of using LLMs as raters, results showed moderate-to-good inter-rater reliability, with higher consistency than human raters. The data showed that ‘procedural’ questions predominated in both learning contexts, but more so when students prepare for summative assessment. These results provide a basis on which to use LLMs for classification of student questions. However, we identify clear limitations in both the ability to classify with schemas and the value of doing so: schemas are limited and thus struggle to accommodate the semantic richness of composite prompts, offering only partial understanding the wider risks and benefits of chatbot integration. In the future, we recommend an analysis approach that captures the nuanced, multi-turn nature of conversation, for example, by applying methods from conversation analysis in discursive psychology.

[HC-2] Qualitative Coding Analysis through Open-Source Large Language Models : A User Study and Design Recommendations

【速读】:该论文旨在解决定性数据分析(Qualitative Data Analysis)过程中劳动强度大,以及在敏感研究中因商业大型语言模型(Large Language Models, LLMs)存在隐私风险而难以使用的问题。其解决方案的关键在于提出一种基于本地部署开源LLM的框架ChatQDA,实现隐私保护下的开放式编码(open coding),通过将计算任务限制在设备端(on-device),避免数据外传,从而提升研究数据的安全性与合规性。同时,研究强调了用户对“可验证隐私”和方法学严谨性的需求,指出仅依赖技术层面的安全措施不足以建立信任,需进一步增强工具的透明性和可解释性。

链接: https://arxiv.org/abs/2602.18352
作者: Tung T. Ngo,Dai Nguyen Van,Anh-Minh Nguyen,Phuong-Anh Do,Anh Nguyen-Quoc
机构: Technological University Dublin (都柏林理工学院); National Economics University (国立经济大学); University College Dublin (都柏林大学); Foreign Trade University (外贸大学)
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: 6 pages. Accepted as Poster to CHI’26

点击查看摘要

Abstract:Qualitative data analysis is labor-intensive, yet the privacy risks associated with commercial Large Language Models (LLMs) often preclude their use in sensitive research. To address this, we introduce ChatQDA, an on-device framework powered by open-source LLMs designed for privacy-preserving open coding. Our mixed-methods user study reveals that while participants rated the system highly for usability and perceived efficiency, they exhibited “conditional trust”, valuing the tool for surface-level extraction while questioning its interpretive nuance and consistency. Furthermore, despite the technical security of local deployment, participants reported epistemic uncertainty regarding data protection, suggesting that invisible security measures are insufficient to foster trust. We conclude with design recommendations for local-first analysis tools that prioritize verifiable privacy and methodological rigor.

[HC-3] Robo-Saber: Generating and Simulating Virtual Reality Players

【速读】:该论文旨在解决虚拟现实(VR)游戏在开发阶段缺乏高效、可扩展的自动化测试手段的问题,尤其是如何生成真实且多样化的玩家动作以支持游戏机制的快速迭代与评估。解决方案的关键在于构建一个基于风格示例引导的运动生成系统 Robo-Saber,该系统能够从游戏中物体布局出发,生成符合特定技能水平和行为模式的 VR 头显与手持控制器运动轨迹,并通过最大化模拟的游戏得分来优化输出结果。该方法依托于大规模 BOXRR-23 数据集进行训练,在 Beat Saber 游戏中验证了其有效性,实现了对不同玩家行为特征的精准建模与复现,为物理驱动的全身 VR 玩家代理提供了一种可扩展的数据合成方案。

链接: https://arxiv.org/abs/2602.18319
作者: Nam Hee Kim,Jingjing May Liu,Jaakko Lehtinen,Perttu Hämäläinen,James F. O’Brien,Xue Bin Peng
机构: Aalto University (阿尔托大学); University of California, Berkeley (加州大学伯克利分校); NVIDIA; Simon Fraser University (西蒙弗雷泽大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 13 pages, 15 figures. Accepted to Eurographics 2026. Project page: this https URL

点击查看摘要

Abstract:We present the first motion generation system for playtesting virtual reality (VR) games. Our player model generates VR headset and handheld controller movements from in-game object arrangements, guided by style exemplars and aligned to maximize simulated gameplay score. We train on the large BOXRR-23 dataset and apply our framework on the popular VR game Beat Saber. The resulting model Robo-Saber produces skilled gameplay and captures diverse player behaviors, mirroring the skill levels and movement patterns specified by input style exemplars. Robo-Saber demonstrates promise in synthesizing rich gameplay data for predictive applications and enabling a physics-based whole-body VR playtesting agent.

[HC-4] Aurora: Neuro-Symbolic AI Driven Advising Agent

【速读】:该论文旨在解决高等教育中学术指导资源严重不足的问题,即导师与学生比例普遍超过300:1,导致学生难以及时获得指导、毕业延迟风险增加以及支持不平等现象加剧。解决方案的关键在于提出Aurora——一个模块化的神经符号(neuro-symbolic) advising代理系统,其核心创新是将检索增强生成(Retrieval-Augmented Generation, RAG)、符号推理与规范化课程数据库相结合,实现政策合规且可验证的推荐服务。具体而言,Aurora通过Boyce-Codd Normal Form (BCNF) 数据库模式确保课程规则一致性,利用Prolog引擎执行先修课程和学分约束,同时借助指令微调的大语言模型提供自然语言解释,从而在保持高精度(近半数场景达到完美精确率与召回率)的同时,显著提升响应速度(平均延迟0.71秒),较原始大语言模型基线快约83倍,实现了可解释、准确且可扩展的AI驱动学术指导新范式。

链接: https://arxiv.org/abs/2602.17999
作者: Lorena Amanda Quincoso Lugones,Christopher Kverne,Nityam Sharadkumar Bhimani,Ana Carolina Oliveira,Agoritsa Polyzou,Christine Lisetti,Janki Bhimani
机构: Florida International University (佛罗里达国际大学); Northeastern University (东北大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to 41st ACM/SIGAPP Symposium On Applied Computing. 8 Pages, 3 Figures

点击查看摘要

Abstract:Academic advising in higher education is under severe strain, with advisor-to-student ratios commonly exceeding 300:1. These structural bottlenecks limit timely access to guidance, increase the risk of delayed graduation, and contribute to inequities in student support. We introduce Aurora, a modular neuro-symbolic advising agent that unifies retrieval-augmented generation (RAG), symbolic reasoning, and normalized curricular databases to deliver policy-compliant, verifiable recommendations at scale. Aurora integrates three components: (i) a Boyce-Codd Normal Form (BCNF) catalog schema for consistent program rules, (ii) a Prolog engine for prerequisite and credit enforcement, and (iii) an instruction-tuned large language model for natural-language explanations of its recommendations. To assess performance, we design a structured evaluation suite spanning common and edge-case advising scenarios, including short-term scheduling, long-term roadmapping, skill-aligned pathways, and out-of-scope requests. Across this diverse set, Aurora improves semantic alignment with expert-crafted answers from 0.68 (Raw LLM baseline) to 0.93 (+36%), achieves perfect precision and recall in nearly half of in-scope cases, and consistently produces correct fallbacks for unanswerable prompts. On commodity hardware, Aurora delivers sub-second mean latency (0.71s across 20 queries), approximately 83X faster than a Raw LLM baseline (59.2s). By combining symbolic rigor with neural fluency, Aurora advances a paradigm for accurate, explainable, and scalable AI-driven advising.

[HC-5] DuoTouch: Passive Two-Footprint Attachments Using Binary Sequences to Extend Touch Interaction

【速读】:该论文旨在解决在电容式触摸面板上添加物理输入方式时面临的挑战,即如何在不显著遮挡内容或减少可用输入区域的前提下实现可靠、高效的交互。其核心解决方案是提出DuoTouch——一种被动式附加装置,通过两个接触足迹(contact footprint)和两条导线编码运动信息为二进制序列,并利用标准触摸API在未经修改的设备上运行。关键创新在于采用两种配置:一种为对齐配置,将固定长度编码映射为离散命令;另一种为相位偏移配置,通过相对时间差估算方向与距离,从而在有限采样率下实现高精度解码。

链接: https://arxiv.org/abs/2602.17961
作者: Kaori Ikematsu,Kunihiro Kato
机构: LY Corporation(LY公司); Tokyo University of Technology(东京都立大学)
类目: Human-Computer Interaction (cs.HC)
备注: 16 pages, 10 figures. Accepted to the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)

点击查看摘要

Abstract:DuoTouch is a passive attachment for capacitive touch panels that adds tangible input while minimizing content occlusion and loss of input area. It uses two contact footprints and two traces to encode motion as binary sequences and runs on unmodified devices through standard touch APIs. We present two configurations with paired decoders: an aligned configuration that maps fixed-length codes to discrete commands and a phase-shifted configuration that estimates direction and distance from relative timing. To characterize the system’s reliability, we derive a sampling-limited bound that links actuation speed, internal trace width, and device touch sampling rate. Through technical evaluations on a smartphone and a touchpad, we report performance metrics that describe the relationship between these parameters and decoding accuracy. Finally, we demonstrate the versatility of DuoTouch by embedding the mechanism into various form factors, including a hand strap, a phone ring holder, and touchpad add-ons.

[HC-6] How Well Can 3D Accessibility Guidelines Support XR Development? An Interview Study with XR Practitioners in Industry

【速读】:该论文旨在解决现有3D游戏和虚拟世界可访问性(accessibility, a11y)指南在扩展现实(Extended Reality, XR)场景中适用性不足的问题,尤其是在空间追踪(spatial tracking)和身体感知交互(kinesthetic interactions)等XR独特交互范式下,开发者缺乏切实可行的指导。解决方案的关键在于通过半结构化访谈对25位来自不同组织背景的XR从业者进行调研,评估20条广泛认可的a11y指南在视觉、运动、认知、语音和听觉等多维度上的适用性,发现将指南设计为“转化催化剂”而非“合规检查清单”能显著提升其有效性,同时揭示了现有3D指南与XR需求之间的根本性不匹配,从而为制定适配XR特性的新型可访问性指南和支持工具提供了基础洞见。

链接: https://arxiv.org/abs/2602.17939
作者: Daniel Killough,Tiger F. Ji,Kexin Zhang,Yaxin Hu,Yu Huang,Ruofei Du,Yuhang Zhao
机构: University of Wisconsin-Madison(威斯康星大学麦迪逊分校); Vanderbilt University(范德比尔特大学); Google XR Labs(谷歌XR实验室)
类目: Human-Computer Interaction (cs.HC)
备注: ACM CHI 2026 Preprint. Short paper of Killough et al. “XR for All” 2024: arXiv:2412.16321

点击查看摘要

Abstract:While accessibility (a11y) guidelines exist for 3D games and virtual worlds, their applicability to extended reality (XR)‘s unique interaction paradigms (e.g., spatial tracking, kinesthetic interactions) remains unexplored. XR practitioners need practical guidance to successfully implement a11y guidelines under real-world constraints. We present the first evaluation of existing 3D a11y guidelines applied to XR development through semi-structured interviews with 25 XR practitioners across diverse organization contexts. We assessed 20 commonly-agreed a11y guidelines from six major resources across visual, motor, cognitive, speech, and hearing domains, comparing practitioners’ development practices against guideline applicability to XR. Our investigation reveals that guidelines can be highly effective when designed as transformation catalysts rather than compliance checklists, but fundamental mismatches exist between existing 3D guidelines and XR requirements, creating both implementation barriers and design gaps. This work provides foundational insights towards developing a11y guidelines and support tools that address XR’s distinct characteristics.

[HC-7] Growing With the Condition: Co-Designing Pediatric Technologies that Adapt Across Developmental Stages

【速读】:该论文试图解决的问题是:当前针对患有慢性疾病的儿童的健康技术支持大多将儿童视为同质群体,未能充分考虑其在不同发展阶段(如小学、初中和高中阶段)所表现出的差异化需求与应对策略,从而限制了技术干预的有效性。解决方案的关键在于通过以儿童为中心的参与式设计方法,开展四次共同设计工作坊,收集来自69名先天性心脏病(Congenital Heart Disease, CHD)患儿的反馈,揭示了不同年龄段儿童在应对慢性病管理时采用的不同策略——低年级儿童依赖安慰物和家长 reassurance(安抚),中年级学生使用中介化沟通和选择性披露,高年级学生则强调自主性和直接与同伴及医疗提供者互动。这一发现为开发能够随儿童发育轨迹动态调整的儿科健康技术提供了实证依据和设计启示。

链接: https://arxiv.org/abs/2602.17925
作者: Neda Barbazi,Ji Youn Shin,Gurumurthy Hiremath,Carlye Anne Lauff
机构: University of Minnesota (明尼苏达大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Children with chronic conditions face evolving challenges in daily activities, peer relationships, and clinical care. Younger children often rely on parental support, while older ones seek independence. Prior studies on chronic conditions explored proxy-based, family-centered, and playful approaches to support children’s health, but most technologies treat children as a homogeneous group rather than adapting to their developmental differences. To address this gap, we conducted four co-design workshops with 69 children with congenital heart disease (CHD) at a medically supported camp, spanning elementary, middle, and high school groups. Our analysis reveals distinct coping strategies: elementary children relied on comfort objects and reassurance, middle schoolers used mediated communication and selective disclosure, and high schoolers emphasized agency and direct engagement with peers and providers. Through child-centered participatory design, we contribute empirical insights into how children’s management of chronic conditions evolves and propose design implications for pediatric health technologies that adapt across developmental trajectories.

[HC-8] Visual Anthropomorphism Shifts Evaluations of Gendered AI Managers

【速读】:该论文旨在解决人工智能管理者(AI manager)在评估中是否存在性别偏见的问题,以及这种偏见是否受其呈现方式(如文本描述或视觉人脸)的影响。研究发现,当AI以文本形式呈现时,能力线索(competence cues)能够有效缓解性别偏差,使高能力AI管理者无论性别均被评价为更公平、更胜任和更具领导力;而当AI以视觉人脸呈现时,面部特征引发系统性性别差异化反应,女性化面孔的AI管理者在获得积极决策结果时被评价为更具能力和可信度,显示出与人类社会中性别刻板印象一致的偏见。因此,解决方案的关键在于:AI的表征模态(representational modality)决定了性别偏见是否被激活,设计选择对AI治理具有决定性影响

链接: https://arxiv.org/abs/2602.17919
作者: Ruiqing Han,Hao Cui,Taha Yasseri
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Preprint, Under Review

点击查看摘要

Abstract:This research examines whether competence cues can reduce gender bias in evaluations of AI managers and whether these effects depend on how the AI is represented. Across two preregistered experiments (N = 2,505), each employing a 2 x 2 x 3 design manipulating AI gender, competence, and decision outcome, we compared text-based descriptions of AI managers with visually generated AI faces created using a reverse-correlation paradigm. In the text condition, evaluations were driven by competence rather than gender. When participants received unfavourable decisions, high-competence AI managers were judged as fairer, more competent, and better leaders than low-competence managers, regardless of AI gender. In contrast, when the AI manager was visually represented, competence cues had attenuated influence once facial information was present. Instead, participants showed systematic gender-differentiated responses to AI faces, with feminine-appearing managers evaluated as more competent and more trustworthy than masculine-appearing managers, particularly when delivering favourable outcomes. These gender effects were largely absent when outcomes were unfavourable, suggesting that negative feedback attenuates the influence of both competence information and facial cues. Taken together, these findings show that competence information can mitigate negative reactions to AI managers in text-based interactions, whereas facial anthropomorphism elicits gendered perceptual biases not observed in text-only settings. The results highlight that representational modality plays a critical role in determining when gender stereotypes are activated in evaluations of AI systems and underscore that design choices are consequential for AI governance in evaluative contexts.

[HC-9] Games That Teach Chats That Convince: Comparing Interactive and Static Formats for Persuasive Learning

【速读】:该论文试图解决的问题是:在保持内容一致的前提下,不同信息传递形式(静态文章、对话式聊天机器人和叙事类文本游戏)如何影响用户的学习效果与说服力。其解决方案的关键在于通过受控的用户实验设计,对三种信息交付模式进行对比分析,发现尽管聊天机器人在主观感知上显著优于其他两种形式(如提升话题重要性认知),但学习成效存在“主观感受”与“客观表现”的脱节现象——例如,文本游戏参与者自评学习感较低,却在24小时后的知识测验中得分更高。这一发现揭示了交互性与学习效果之间的复杂关系,并强调在设计说服系统与严肃游戏时需权衡互动性、真实感与实际知识留存之间的潜在 trade-off。

链接: https://arxiv.org/abs/2602.17905
作者: Seyed Hossein Alavi,Zining Wang,Shruthi Chockkalingam,Raymond T. Ng,Vered Shwartz
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Interactive systems such as chatbots and games are increasingly used to persuade and educate on sustainability-related topics, yet it remains unclear how different delivery formats shape learning and persuasive outcomes when content is held constant. Grounding on identical arguments and factual content across conditions, we present a controlled user study comparing three modes of information delivery: static essays, conversational chatbots, and narrative text-based games. Across subjective measures, the chatbot condition consistently outperformed the other modes and increased perceived importance of the topic. However, perceived learning did not reliably align with objective outcomes: participants in the text-based game condition reported learning less than those reading essays, yet achieved higher scores on a delayed (24-hour) knowledge quiz. Additional exploratory analyses further suggest that common engagement proxies, such as verbosity and interaction length, are more closely related to subjective experience than to actual learning. These findings highlight a dissociation between how persuasive experiences feel and what participants retain, and point to important design trade-offs between interactivity, realism, and learning in persuasive systems and serious games.

[HC-10] HookLens: Visual Analytics for Understanding React Hooks Structures

【速读】:该论文旨在解决React Web应用维护与重构中的核心挑战——由于Hooks API导致的组件间依赖关系复杂化,进而引发代码行为不可预测和可维护性下降的问题(即反模式,anti-patterns)。解决方案的关键在于提出HookLens,一个交互式可视化分析系统,它通过直观展示Hooks定义的组件间依赖关系与数据流向,帮助开发者高效理解代码结构并识别反模式。该系统基于与资深React开发者的迭代设计过程构建,并在定量用户研究中验证了其显著优于传统代码编辑器的效果,甚至超越了当前最先进的大语言模型(LLM)编码助手在检测反模式任务上的表现。

链接: https://arxiv.org/abs/2602.17891
作者: Suyeon Hwang,Minkyu Kweon,Jeongmin Rhee,Soohyun Lee,Seokhyeon Park,Seokweon Jung,Hyeon Jeon,Jinwook Seo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: IEEE PacificVis 2026, conference track

点击查看摘要

Abstract:Maintaining and refactoring React web applications is challenging, as React code often becomes complex due to its core API called Hooks. For example, Hooks often lead developers to create complex dependencies among components, making code behavior unpredictable and reducing maintainability, i.e., anti-patterns. To address this challenge, we present HookLens, an interactive visual analytics system that helps developers understand howHooks define dependencies and data flows between components. Informed by an iterative design process with experienced React developers, HookLens supports users to efficiently understand the structure and dependencies between components and to identify anti-patterns. A quantitative user study with 12 React developers demonstrates that HookLens significantly improves participants’ accuracy in detecting anti-patterns compared to conventional code editors. Moreover, a comparative study with state-of-the-art LLM-based coding assistants confirms that these improvements even surpass the capabilities of such coding assistants on the same task.

[HC-11] Exploring The Impact Of Proactive Generative AI Agent Roles In Time-Sensitive Collaborative Problem-Solving Tasks

【速读】:该论文旨在解决在时间压力下,团队协作解决问题时因信息过载、行动协调困难和进度追踪滞后而导致的效率低下问题(collaborative problem-solving under time pressure)。其核心解决方案是引入两种形式的主动式生成式AI代理:一种是作为“同伴”角色的代理(peer agent),通过提出想法和回答问题来辅助决策;另一种是作为“促进者”角色的代理(facilitator agent),通过提供总结和群体结构建议来增强团队组织性。研究发现,同伴代理虽能偶尔提升问题解决效率(如提供及时提示与记忆支持),但易破坏团队流状态并引发过度依赖;而促进者代理则仅提供轻量级支持,对整体绩效影响有限。因此,关键设计启示在于:主动式生成式AI应优先聚焦于降低认知负荷与维持团队自主性,而非直接干预任务执行。

链接: https://arxiv.org/abs/2602.17864
作者: Anirban Mukhopadhyay,Kevin Salubre,Hifza Javed,Shashank Mehrotra,Kumar Akash
机构: Virginia Tech (弗吉尼亚理工大学); Honda Research Institute (本田研究 institute)
类目: Human-Computer Interaction (cs.HC)
备注: Published in Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI’26)

点击查看摘要

Abstract:Collaborative problem-solving under time pressure is common but difficult, as teams must generate ideas quickly, coordinate actions, and track progress. Generative AI offers new opportunities to assist, but we know little about how proactive agents affect the dynamics of real-time, co-located teamwork. We studied two forms of proactive support in digital escape rooms: a facilitator agent that offered summaries and group structures, and a peer agent that proposed ideas and answered queries. In a within-subjects study with 24 participants, we compared group performance and processes across three conditions: no AI, peer, and facilitator. Results show that the peer agent occasionally enhanced problem-solving by offering timely hints and memory support; however, it also disrupted flow, increased workload, and created over-reliance. In comparison, the facilitator agent provided light scaffolding but had a limited impact on outcomes. We provide design considerations for proactive generative AI agents based on our findings.

[HC-12] Mind the Style: Impact of Communication Style on Human-Chatbot Interaction

【速读】:该论文旨在解决当前对话式智能体(Conversational Agents)在日常数字交互中日益普及背景下,其沟通风格对用户体验和任务完成度影响尚不明确的问题。研究通过一个双盲对照实验,让参与者与两个版本的聊天机器人NAVI进行交互,这两个版本仅在沟通风格上存在差异:一个友好支持型,另一个直接任务导向型。关键解决方案在于设计并实施了严谨的用户研究,发现友好风格显著提升了女性用户的主观满意度和任务完成率,而男性用户未表现出显著差异;同时,研究还表明用户并未明显模仿聊天机器人的沟通风格,暗示语言适应性有限。这一结果强调了根据用户特征和任务情境定制沟通风格的重要性,为提升人机交互质量提供了实证依据。

链接: https://arxiv.org/abs/2602.17850
作者: Erik Derner,Dalibor Kučera,Aditya Gulati,Ayoub Bagheri,Nuria Oliver
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear. Addressing this gap, we describe the results of a between-subject user study where participants interact with one of two versions of a chatbot called NAVI which assists users in an interactive map-based 2D navigation task. The two chatbot versions differ only in communication style: one is friendly and supportive, while the other is direct and task-focused. Our results show that the friendly style increases subjective satisfaction and significantly improves task completion rates among female participants only, while no baseline differences between female and male participants were observed in a control condition without the chatbot. Furthermore, we find little evidence of users mimicking the chatbot’s style, suggesting limited linguistic accommodation. These findings highlight the importance of user- and task-sensitive conversational agents and support that communication style personalization can meaningfully enhance interaction quality and performance.

[HC-13] Stop Saying “AI”

【速读】:该论文试图解决当前关于“AI”的讨论普遍过于笼统、缺乏针对性的问题,尤其是在军事领域中,不同类型的AI系统在决策机制、责任归属和风险特征上存在显著差异,而现有批判往往将所有AI系统混为一谈,导致政策制定与学术辩论难以聚焦。解决方案的关键在于:推动讨论从泛化的“AI”概念转向具体系统的精确识别与分析,要求研究人员、开发者和政策制定者明确指出所讨论的AI系统类型,并清晰界定其潜在收益与风险,从而提升 debates 的严谨性与实效性。这一方法不仅适用于军事场景,也对其他领域中AI相关讨论具有普适指导意义。

链接: https://arxiv.org/abs/2602.17729
作者: Nathan G. Wood(1,2,3),Scott Robbins(4),Eduardo Zegarra Berodt(1),Anton Graf von Westerholt(1),Michelle Behrndt(1,5),Daniel Kloock-Schreiber(1) ((1) Institute of Air Transportation Systems, Hamburg University of Technology, (2) Ethics + Emerging Sciences Group, California Polytechnic State University San Luis Obispo, (3) Center for Environmental and Technology Ethics - Prague, (4) Academy for Responsible Research, Teaching, and Innovation, Karlsruhe Institute of Technology, (5) Department of Philosophy, University of Hamburg)
机构: Hamburg University of Technology (汉堡工业大学); California Polytechnic State University San Luis Obispo (加州理工州立大学圣路易斯奥比斯波分校); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); University of Hamburg (汉堡大学); Prague (布拉格)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Across academia, industry, and government, AI'' has become central in research and development, regulatory debates, and promises of ever faster and more capable decision-making and action. In numerous domains, especially safety-critical ones, there are significant concerns over how AI’’ may affect decision-making, responsibility, or the likelihood of mistakes (to name only a few categories of critique). However, for most critiques, the target is generally AI'', a broad term admitting many (types of) systems used for a variety of tasks and each coming with its own set of limitations, challenges, and potential use cases. In this article, we focus on the military domain as a case study and present both a loose enumerative taxonomy of systems captured under the umbrella term military AI’‘, as well as discussion of the challenges of each. In doing so, we highlight that critiques of one (type of) system will not always transfer to other (types of) systems. Building on this, we argue that in order for debates to move forward fruitfully, it is imperative that the discussions be made more precise and that AI'' be excised from debates to the extent possible. Researchers, developers, and policy-makers should make clear exactly what systems they have in mind and what possible benefits and risks attend the deployment of those particular systems. While we focus on AI in the military as an exemplar for the overall trends in discussions of AI’', the argument’s conclusions are broad and have import for discussions of AI across a host of domains.

[HC-14] Closing Africas Early Warning Gap: AI Weather Forecasting for Disaster Prevention

【速读】:该论文旨在解决非洲地区因基础设施成本高昂而导致的早期预警系统覆盖率严重不足的问题,特别是针对极端天气事件(如2026年1月南非等地暴雨)缺乏有效监测与响应能力的现状。其关键解决方案是提出一种基于NVIDIA Earth-2 AI气象模型的低成本、高效率部署架构,单个国家规模部署月成本仅为1,430–1,730美元,相比传统雷达站(超100万美元/台)降低2,000–4,545倍。该架构通过三项核心技术实现:(1)基于ProcessPoolExecutor的事件循环隔离模式,解决异步Python应用中aiobotocore会话生命周期冲突;(2)数据库驱动的服务架构,GPU直接将全球气象预报写入PostgreSQL,避免高分辨率张量通过HTTP传输的瓶颈;(3)自动化坐标管理机制,支持跨61个时间步的多阶段推理。最终,该方案实现了15天全球大气预报的快速查询(<200毫秒),并通过WhatsApp推送至用户,利用其80%以上的市场渗透率,使大陆尺度早期预警系统在经济上具备可行性,契合联合国减灾署(UNDRR)关于此类系统可降低灾害死亡率6倍的研究结论。

链接: https://arxiv.org/abs/2602.17726
作者: Qness Ndlovu
机构: 未知
类目: Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC)
备注: 23 pages, 4 figures

点击查看摘要

Abstract:In January 2026, torrential rains killed 200-300 people across Southern Africa, exposing a critical reality: 60% of the continent lacks effective early warning systems due to infrastructure costs. Traditional radar stations exceed USD 1 million each, leaving Africa with an 18x coverage deficit compared to the US and EU. We present a production-grade architecture for deploying NVIDIA Earth-2 AI weather models at USD 1,430-1,730/month for national-scale deployment - enabling coverage at 2,000-4,545x lower cost than radar. The system generates 15-day global atmospheric forecasts, cached in PostgreSQL to enable user queries under 200 milliseconds without real-time inference. Deployed in South Africa in February 2026, our system demonstrates three technical contributions: (1) a ProcessPoolExecutor-based event loop isolation pattern that resolves aiobotocore session lifecycle conflicts in async Python applications; (2) a database-backed serving architecture where the GPU writes global forecasts directly to PostgreSQL, eliminating HTTP transfer bottlenecks for high-resolution tensors; and (3) an automated coordinate management pattern for multi-step inference across 61 timesteps. Forecasts are delivered via WhatsApp, leveraging 80%+ market penetration. This architecture makes continent-scale early warning systems economically viable, supporting UNDRR findings that such systems reduce disaster death rates by 6x. All architectural details are documented inline for full reproducibility. Comments: 23 pages, 4 figures Subjects: Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC) ACMclasses: K.4.2; I.2; D.2 Cite as: arXiv:2602.17726 [cs.CY] (or arXiv:2602.17726v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2602.17726 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-15] Lost Before Translation: Social Information Transmission and Survival in AI-AI Communication

【速读】:该论文旨在解决生成式 AI (Generative AI) 在信息传递过程中如何改变内容的本质这一问题,尤其是当多个 AI 系统依次处理同一信息时,其演化机制与对人类认知的影响。解决方案的关键在于构建一种基于“传话游戏”(telephone game)的实验范式,通过追踪 AI 传输链中内容的变化,识别出三种稳定模式:收敛性(convergence)、选择性留存(selective survival)和竞争性过滤(competitive filtering)。这些机制揭示了 AI 间交互如何系统性地削弱原始文本的情感强度、证据细节与观点多样性,尽管最终输出在人类评估中显得更可信且更“ polished”,却导致事实记忆下降、平衡感知减弱和情感共鸣降低,从而警示 AI 中介可能以牺牲认知多样性为代价强化权威感。

链接: https://arxiv.org/abs/2602.17674
作者: Bijean Ghafouri,Emilio Ferrara
机构: University of Southern California(南加州大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When AI systems summarize and relay information, they inevitably transform it. But how? We introduce an experimental paradigm based on the telephone game to study what happens when AI talks to AI. Across five studies tracking content through AI transmission chains, we find three consistent patterns. The first is convergence, where texts differing in certainty, emotional intensity, and perspectival balance collapse toward a shared default of moderate confidence, muted affect, and analytical structure. The second is selective survival, where narrative anchors persist while the texture of evidence, hedges, quotes, and attributions is stripped away. The third is competitive filtering, where strong arguments survive while weaker but valid considerations disappear when multiple viewpoints coexist. In downstream experiments, human participants rated AI-transmitted content as more credible and polished. Importantly, however, humans also showed degraded factual recall, reduced perception of balance, and diminished emotional resonance. We show that the properties that make AI-mediated content appear authoritative may systematically erode the cognitive and affective diversity on which informed judgment depends.

[HC-16] Digital self-Efficacy as a foundation for a generative AI usage framework in facultys professional practices

【速读】:该论文旨在解决高等教育教师在采纳生成式人工智能(Generative AI, GAI)过程中存在的差异性使用行为问题,特别是数字自我效能感(digital self-efficacy)如何影响其GAI的采纳模式。解决方案的关键在于识别出三种不同的用户类型(积极参与者、反思型保留者、批判型抵制者),并基于Bandura的社会认知理论与Flichy的使用框架,构建了一个包含四种社会技术配置的差异化使用框架,同时提出针对不同自我效能水平的个性化支持机制,以实现更有效的GAI整合路径。

链接: https://arxiv.org/abs/2602.17673
作者: Fatiha Tali(EFTS, LINE, Grhapes)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: in French language

点击查看摘要

Abstract:This research explores the role of digital self-efficacy in the appropriation of generative artificial intelligence (GAI) by higher education faculty. Drawing on Bandura’s sociocognitive theory and Flichy’s concept of usage framework, our study examines the relationships between levels of digital self-efficacy and GAI usage profiles. A survey of 265 faculty members identified three user profiles (Engaged, Reflective Reserved, Critical Resisters) and validated a three-dimensional digital self-efficacy scale. Results reveal a significant association between self-efficacy profiles and GAI appropriation patterns. Based on these findings, we propose a differentiated usage framework integrating four sociotechnical configurations, appropriation trajectories adapted to self-efficacy profiles, and personalized institutional support mechanisms.

[HC-17] Assessing LLM Response Quality in the Context of Technology-Facilitated Abuse

【速读】:该论文旨在解决技术赋能型虐待(Technology-facilitated Abuse, TFA)受害者在寻求支持时面临的资源不足问题,特别是传统科技诊所因人力与后勤限制难以满足日益增长的需求,导致许多受害者转向在线资源。其解决方案的关键在于首次通过专家主导的系统性评估,检验四种大型语言模型(Large Language Models, LLMs)——包括两类通用非推理模型和两类专为亲密伴侣暴力(Intimate Partner Violence, IPV)场景设计的领域特定模型——在零样本单轮对话中对TFA相关问题的回答质量。研究采用以受害者安全为中心的提示策略,并基于真实文献与在线论坛收集的问题,在针对性设计的评估维度上进行测评,同时结合经历过TFA的用户反馈,量化其响应的可操作性(actionability),从而揭示当前LLMs在TFA支持中的能力边界与改进方向,为未来面向受害者支持的模型设计、开发与微调提供实证依据与具体建议。

链接: https://arxiv.org/abs/2602.17672
作者: Vijay Prakash,Majed Almansoori,Donghan Hu,Rahul Chatterjee,Danny Yuxing Huang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Technology-facilitated abuse (TFA) is a pervasive form of intimate partner violence (IPV) that leverages digital tools to control, surveil, or harm survivors. While tech clinics are one of the reliable sources of support for TFA survivors, they face limitations due to staffing constraints and logistical barriers. As a result, many survivors turn to online resources for assistance. With the growing accessibility and popularity of large language models (LLMs), and increasing interest from IPV organizations, survivors may begin to consult LLM-based chatbots before seeking help from tech clinics. In this work, we present the first expert-led manual evaluation of four LLMs - two widely used general-purpose non-reasoning models and two domain-specific models designed for IPV contexts - focused on their effectiveness in responding to TFA-related questions. Using real-world questions collected from literature and online forums, we assess the quality of zero-shot single-turn LLM responses generated with a survivor safety-centered prompt on criteria tailored to the TFA domain. Additionally, we conducted a user study to evaluate the perceived actionability of these responses from the perspective of individuals who have experienced TFA. Our findings, grounded in both expert assessment and user feedback, provide insights into the current capabilities and limitations of LLMs in the TFA context and may inform the design, development, and fine-tuning of future models for this domain. We conclude with concrete recommendations to improve LLM performance for survivor support. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY) Cite as: arXiv:2602.17672 [cs.HC] (or arXiv:2602.17672v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.17672 Focus to learn more arXiv-issued DOI via DataCite

[HC-18] AI Hallucination from Students Perspective: A Thematic Analysis

【速读】:该论文旨在解决学生在依赖大语言模型(Large Language Models, LLMs)进行学习时,因生成式AI(Generative AI)幻觉问题而面临的信息准确性风险。研究发现,学生常遇到的幻觉类型包括错误或虚构的引用、虚假信息、过度自信但误导性的回答、对提示指令的低遵循度、持续坚持错误答案以及迎合性回应(sycophancy)。为应对这一挑战,论文提出解决方案的关键在于扩展AI素养教育,不仅限于提示工程(prompt engineering),更需引入针对幻觉识别与应对的系统性训练:一是培养学生主动验证信息的能力(如交叉核对外部来源或重新提问),二是纠正其对生成式AI工作原理的误解(如将模型视为“数据库”而非概率性推理系统),三是增强对模型行为特征(如自信表达掩盖错误)的认知敏感度。最终目标是构建包含验证协议、准确心智模型和行为意识在内的AI幻觉防范能力,从而提升AI辅助学习的安全性和有效性。

链接: https://arxiv.org/abs/2602.17671
作者: Abdulhadi Shoufan,Ahmad-Azmi-Abdelhamid Esmaeil
机构: Khalifa University (哈利法大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As students increasingly rely on large language models, hallucinations pose a growing threat to learning. To mitigate this, AI literacy must expand beyond prompt engineering to address how students should detect and respond to LLM hallucinations. To support this, we need to understand how students experience hallucinations, how they detect them, and why they believe they occur. To investigate these questions, we asked university students three open-ended questions about their experiences with AI hallucinations, their detection strategies, and their mental models of why hallucinations occur. Sixty-three students responded to the survey. Thematic analysis of their responses revealed that reported hallucination issues primarily relate to incorrect or fabricated citations, false information, overconfident but misleading responses, poor adherence to prompts, persistence in incorrect answers, and sycophancy. To detect hallucinations, students rely either on intuitive judgment or on active verification strategies, such as cross-checking with external sources or re-prompting the model. Students’ explanations for why hallucinations occur reflected several mental models, including notable misconceptions. Many described AI as a research engine that fabricates information when it cannot locate an answer in its “database.” Others attributed hallucinations to issues with training data, inadequate prompting, or the model’s inability to understand or verify information. These findings illuminate vulnerabilities in AI-supported learning and highlight the need for explicit instruction in verification protocols, accurate mental models of generative AI, and awareness of behaviors such as sycophancy and confident delivery that obscure inaccuracy. The study contributes empirical evidence for integrating hallucination awareness and mitigation into AI literacy curricula.

[HC-19] he Dark Side of Dark Mode – User behaviour rebound effects and consequences for digital energy consumption

【速读】:该论文试图解决的问题是:尽管深色模式(dark mode)被广泛推荐为降低显示设备能耗的有效手段,但其实际节能效果可能因用户行为变化而减弱甚至抵消。研究发现,用户在使用深色主题网页时可能会调高屏幕亮度,从而产生“反弹效应”(rebound effect),这削弱了深色模式本应带来的能源效率提升。解决方案的关键在于认识到内容配色方案与用户行为之间的复杂交互关系,并在可持续性指南和干预措施中纳入对用户行为的考量,以确保节能策略的实际有效性。

链接: https://arxiv.org/abs/2602.17670
作者: Zak Datson
机构: 未知
类目: Human-Computer Interaction (cs.HC); Performance (cs.PF)
备注: 3 pages (2 + references), 3 figures, 1 table. To be included in the proceedings of the 1st International Workshop on Low Carbon Computing (LOCO) 2024, December 3, 2024, Glasgow/Online

点击查看摘要

Abstract:User devices are the largest contributor to media related global emissions. For web content, dark mode has been widely recommended as an energy-saving measure for certain display types. However, the energy savings achieved by dark mode may be undermined by user behaviour. This pilot study investigates the unintended consequences of dark mode adoption, revealing a rebound effect wherein users may increase display brightness when interacting with dark-themed web pages. This behaviour may negate the potential energy savings that dark mode offers. Our findings suggest that the energy efficiency benefits of dark mode are not as straightforward as commonly believed for display energy, and the interplay between content colourscheme and user behaviour must be carefully considered in sustainability guidelines and interventions.

[HC-20] Evaluating Text-based Conversational Agents for Mental Health: A Systematic Review of Metrics Methods and Usage Contexts

【速读】:该论文旨在解决当前文本交互式对话代理(Text-based Conversational Agents, CAs)在心理健康领域应用中评估方法碎片化的问题,其解决方案的关键在于通过系统性综述构建结构化的评估框架。研究基于PRISMA指南对三大数据库进行检索与筛选,纳入132项研究并采用双人编码确保一致性(Cohen’s kappa = 0.77–0.92),从指标(CA属性与用户结果)、方法(自动化分析、标准化量表与质性研究)和使用场景三个维度进行整合分析。研究发现当前评估存在对西方量表依赖性强、文化适配不足、样本小且短期、自动化指标与用户福祉关联弱等问题,进而提出应加强方法三角验证、提升时间维度严谨性,并注重测量公平性,从而为心理健康的对话代理提供可靠、安全且以用户为中心的评估基础。

链接: https://arxiv.org/abs/2602.17669
作者: Jiangtao Gong,Xiao Wen,Fengyi Tao,Xinqi Wang,Xixi Yang,Yangrong Tang
机构: Institute of AI Research, Tsinghua University (清华大学人工智能研究院)
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, 1 figures

点击查看摘要

Abstract:Text-based conversational agents (CAs) are increasingly used in mental health, yet evaluation practices remain fragmented. We conducted a PRISMA-guided systematic review (May-June 2024) across ACM Digital Library, Scopus, and PsycINFO. From 613 records, 132 studies were included, with dual-coder extraction achieving substantial agreement (Cohen’s kappa = 0.77-0.92). We synthesized evaluation approaches across three dimensions: metrics, methods, and usage contexts. Metrics were classified into CA-centric attributes (e.g., reliability, safety, empathy) and user-centric outcomes (experience, knowledge, psychological state, health behavior). Methods included automated analyses, standardized psychometric scales, and qualitative inquiry. Temporal designs ranged from momentary to follow-up assessments. Findings show reliance on Western-developed scales, limited cultural adaptation, predominance of small and short-term samples, and weak links between automated performance metrics and user well-being. We argue for methodological triangulation, temporal rigor, and equity in measurement. This review offers a structured foundation for reliable, safe, and user-centered evaluation of mental health CAs.

[HC-21] Visual Interface Workflow Management System Strengthening Data Integrity and Project Tracking in Complex Processes

【速读】:该论文旨在解决传统业务流程管理中依赖手工记录和分散消息应用所导致的数据完整性受损及项目跟踪抽象化的问题。其解决方案的关键在于构建一个跨Web与移动平台的集成系统,采用MongoDB存储JSON格式数据,结合服务器端HTTP接口、Web前端界面以及React Native移动端技术,实现任务状态(待办-进行中-已完成)的可视化追踪,并通过颜色编码标签区分任务紧急程度,同时为管理者提供动态仪表盘以监控团队绩效,从而提升组织效率并确保工作流程的可追溯性。

链接: https://arxiv.org/abs/2602.17668
作者: Ömer Elri,Serkan Savaş
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 10th International Conference on Natural and Engineering Sciences

点击查看摘要

Abstract:Manual notes and scattered messaging applications used in managing business processes compromise data integrity and abstract project tracking. In this study, an integrated system that works simultaneously on web and mobile platforms has been developed to enable individual users and teams to manage their workflows with concrete data. The system architecture integrates MongoDB, which stores data in JSON format, this http URL this http URL on the server side, this http URL on the web interface, and React Native technologies on the mobile side. The system interface is designed around visual dashboards that track the status of tasks (To Do-In Progress-Done). The urgency of tasks is distinguished by color-coded labels, and dynamic graphics (Dashboard) have been created for managers to monitor team performance. The usability of the system was tested with a heterogeneous group of 10 people consisting of engineers, engineering students, public employees, branch managers, and healthcare personnel. In analyses conducted using a 5-point Likert scale, the organizational efficiency provided by the system compared to traditional methods was rated 4.90, while the visual dashboards achieved a perfect score of 5.00 with zero variance. Additionally, the ease of interface use was rated 4.65, and overall user satisfaction was calculated as 4.60. The findings show that the developed system simplifies complex work processes and provides a traceable digital working environment for Small and Medium-sized Enterprises and project teams.

计算机视觉

[CV-0] Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

【速读】:该论文旨在解决流式视频理解中因关键值缓存(key-value caching)机制导致的细粒度视觉信息丢失问题,尤其在密集视频流场景下,现有方法因特征编码方式导致查询-帧相似度随时间递增,使检索偏向于后期帧,从而影响视频问答(VQA)的准确性。解决方案的关键在于:一是提出自适应选择策略以减少token冗余并保留局部时空信息;二是设计无需训练的检索混合专家(retrieval mixture-of-experts)框架,利用外部模型更精准识别相关帧。通过上述改进,所提出的MemStream方法显著提升了多个基准测试上的性能,相较ReKV在Qwen2.5-VL-7B基础上分别实现了+8.0%、+8.5%和+2.4%的提升。

链接: https://arxiv.org/abs/2602.18434
作者: Vatsal Agarwal,Saksham Suri,Matthew Gwilliam,Pulkit Kumar,Abhinav Shrivastava
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: see this https URL

点击查看摘要

Abstract:Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.

[CV-1] SARAH: Spatially Aware Real-time Agent ic Humans

【速读】:该论文旨在解决当前虚拟代理(embodied agents)在VR、远程存在(telepresence)和数字人应用中缺乏空间感知能力的问题,即代理无法根据用户位置进行自然的姿态调整(如转向用户、响应移动、保持合理视线接触)。解决方案的关键在于提出一种首个实时、完全因果的时空感知对话运动生成方法,其核心创新包括:1)基于因果Transformer的变分自编码器(VAE),采用交错潜变量令牌设计以支持流式推理;2)条件于用户轨迹与音频的流匹配(flow matching)模型,实现语音对齐手势与空间朝向的联合建模;3)引入眼动评分机制与无分类器引导(classifier-free guidance),使模型从数据中学习自然的空间对齐模式,同时允许用户在推理时动态调节注视强度。该方法在Embody 3D数据集上达到超过300 FPS的实时性能,显著优于非因果基线,并成功部署于实际VR系统中。

链接: https://arxiv.org/abs/2602.18432
作者: Evonne Ng,Siwei Zhang,Zhang Chen,Michael Zollhoefer,Alexander Richard
机构: Meta Reality Labs(元宇宙现实实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user’s position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS – 3x faster than non-causal baselines – while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see this https URL for details.

[CV-2] he Geometry of Noise: Why Diffusion Models Dont Need Noise Conditioning

【速读】:该论文旨在解决自主生成模型(如Equilibrium Matching和blind diffusion)在无噪声水平条件下的优化稳定性问题,即当噪声水平被视为随机变量时,如何解释其优化的潜在能量景观,并确保网络在数据流形附近保持稳定——因为传统梯度在此处通常发散。解决方案的关键在于提出边际能量(Marginal Energy) $ E_\text{marg}(\mathbf{u}) = -\log p(\mathbf{u}) $,其中 $ p(\mathbf{u}) $ 是对未知噪声水平先验分布积分后的观测数据边缘密度。作者证明,自主模型的生成过程本质上是该边际能量上的黎曼梯度流(Riemannian gradient flow)。通过一种新颖的相对能量分解,他们揭示了原始边际能量沿数据流形法向存在 $ 1/t^p $ 奇异性,但学习到的时间不变场隐式引入局部共形度量(conformal metric),精确抵消几何奇异性,将无限深势阱转化为稳定吸引子;同时指出速度参数化优于噪声预测参数化,因其满足有界增益条件,可吸收后验不确定性为平滑几何漂移,而后者存在Jensen Gap导致估计误差放大,引发灾难性失败。

链接: https://arxiv.org/abs/2602.18428
作者: Mojtaba Sahraee-Ardakan,Mauricio Delbracio,Peyman Milanfar
机构: Google(谷歌)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Autonomous (noise-agnostic) generative models, such as Equilibrium Matching and blind diffusion, challenge the standard paradigm by learning a single, time-invariant vector field that operates without explicit noise-level conditioning. While recent work suggests that high-dimensional concentration allows these models to implicitly estimate noise levels from corrupted observations, a fundamental paradox remains: what is the underlying landscape being optimized when the noise level is treated as a random variable, and how can a bounded, noise-agnostic network remain stable near the data manifold where gradients typically diverge? We resolve this paradox by formalizing Marginal Energy, E_\textmarg(\mathbfu) = -\log p(\mathbfu) , where p(\mathbfu) = \int p(\mathbfu|t)p(t)dt is the marginal density of the noisy data integrated over a prior distribution of unknown noise levels. We prove that generation using autonomous models is not merely blind denoising, but a specific form of Riemannian gradient flow on this Marginal Energy. Through a novel relative energy decomposition, we demonstrate that while the raw Marginal Energy landscape possesses a 1/t^p singularity normal to the data manifold, the learned time-invariant field implicitly incorporates a local conformal metric that perfectly counteracts the geometric singularity, converting an infinitely deep potential well into a stable attractor. We also establish the structural stability conditions for sampling with autonomous models. We identify a ``Jensen Gap’’ in noise-prediction parameterizations that acts as a high-gain amplifier for estimation errors, explaining the catastrophic failure observed in deterministic blind models. Conversely, we prove that velocity-based parameterizations are inherently stable because they satisfy a bounded-gain condition that absorbs posterior uncertainty into a smooth geometric drift.

[CV-3] CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在复杂室内环境中进行导航决策时,未能充分考虑代理(agent)物理能力限制的问题。现有方法通常忽略移动约束(如能否跨越台阶或通过狭窄通道),导致模型在真实场景中表现不稳定。解决方案的关键在于提出一个名为Capability-Conditioned Navigation (CapNav) 的新基准,其核心是将导航任务与具体代理的物理维度、移动能力和环境交互能力相绑定,从而系统评估VLMs在不同能力约束下的导航性能。该基准包含5类代表性代理、45个真实室内场景、473个导航任务及2365个问答对,揭示了当前VLMs在面对空间推理挑战时的显著性能下降,为未来具备能力感知的具身空间推理研究提供了方向。

链接: https://arxiv.org/abs/2602.18424
作者: Xia Su,Ruiqi Chen,Benlin Liu,Jingwei Ma,Zonglin Di,Ranjay Krishna,Jon Froehlich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent’s mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent’s specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM’s navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at this https URL

[CV-4] Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

【速读】:该论文旨在解决扩展现实(Extended Reality, XR)中生成式视频世界模型缺乏对用户真实世界运动追踪的精细控制问题,现有模型仅支持文本或键盘等粗粒度控制信号,难以实现具身交互(embodied interaction)。其解决方案的关键在于提出一种以人为中心的视频世界模型,该模型同时以追踪到的头部姿态(head pose)和手部关节级姿态(joint-level hand poses)作为条件输入,并优化扩散Transformer的条件机制,从而实现对虚拟环境中手部与物体之间精细操作的有效建模。通过训练一个双向视频扩散模型教师网络并将其蒸馏为因果、可交互的系统,最终生成以第一人称视角呈现的虚拟环境,实验证明该方法显著提升了用户在任务中的表现及对动作控制感的主观感知。

链接: https://arxiv.org/abs/2602.18422
作者: Linxi Xie,Lisong C. Sun,Ashley Neall,Tong Wu,Shengqu Cai,Gordon Wetzstein
机构: Stanford University (斯坦福大学); NYU Shanghai (纽约大学上海分校); UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page here: this https URL

点击查看摘要

Abstract:Extended reality (XR) demands generative models that respond to users’ tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand–object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

[CV-5] Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges

【速读】:该论文试图解决深度学习在计算机视觉中对经历群对称变换(group-symmetric transformations)的物体识别困难的问题,尤其是当这些变换在训练阶段罕见时,如异常姿态、尺度、位置或其组合。传统神经网络难以泛化到此类分布外(out-of-distribution)样本,而经典等变神经网络虽能处理对称变换但需事先已知变换信息。论文提出的解决方案关键在于:通过从对称变换示例中自动学习潜在空间中的等变算子(equivariant operators),从而构建无需先验知识即可实现跨对称变换泛化的架构。实验使用旋转和平移噪声MNIST数据集验证了该方法在分布外分类任务上的有效性,展示了其超越传统与等变网络的潜力,尽管在扩展至更复杂数据集时仍面临挑战。

链接: https://arxiv.org/abs/2602.18406
作者: Minh Dinh,Stéphane Deny
机构: Dartmouth College (达特茅斯学院); Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the successes of deep learning in computer vision, difficulties persist in recognizing objects that have undergone group-symmetric transformations rarely seen during training-for example objects seen in unusual poses, scales, positions, or combinations thereof. Equivariant neural networks are a solution to the problem of generalizing across symmetric transformations, but require knowledge of transformations a priori. An alternative family of architectures proposes to earn equivariant operators in a latent space from examples of symmetric transformations. Here, using simple datasets of rotated and translated noisy MNIST, we illustrate how such architectures can successfully be harnessed for out-of-distribution classification, thus overcoming the limitations of both traditional and equivariant networks. While conceptually enticing, we discuss challenges ahead on the path of scaling these architectures to more complex datasets.

[CV-6] Self-Aware Object Detection via Degradation Manifolds

【速读】:该论文旨在解决目标检测模型在非理想成像条件下(如模糊、噪声、压缩、恶劣天气或分辨率变化)可能无声失效的问题,即传统检测器缺乏对输入是否处于其正常工作范围的自我评估能力。为此,作者提出一种基于退化流形(degradation manifolds)的自感知目标检测框架,其核心在于通过多层对比学习(multi-layer contrastive learning)构建一个轻量级嵌入头,将图像特征空间按退化类型与严重程度进行几何组织,而非依赖语义内容。关键创新点在于:无需退化标签或显式密度建模即可学习到结构化的退化表示,并以干净样本嵌入估计出的“原始原型”作为参考点,从而通过几何偏移量获得独立于检测置信度的图像级退化信号,实现对检测结果可靠性的内在评估。

链接: https://arxiv.org/abs/2602.18394
作者: Stefan Becker,Simon Weiss,Wolfgang Hübner,Michael Arens
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detectors achieve strong performance under nominal imaging conditions but can fail silently when exposed to blur, noise, compression, adverse weather, or resolution changes. In safety-critical settings, it is therefore insufficient to produce predictions without assessing whether the input remains within the detector’s nominal operating regime. We refer to this capability as self-aware object detection. We introduce a degradation-aware self-awareness framework based on degradation manifolds, which explicitly structure a detector’s feature space according to image degradation rather than semantic content. Our method augments a standard detection backbone with a lightweight embedding head trained via multi-layer contrastive learning. Images sharing the same degradation composition are pulled together, while differing degradation configurations are pushed apart, yielding a geometrically organized representation that captures degradation type and severity without requiring degradation labels or explicit density modeling. To anchor the learned geometry, we estimate a pristine prototype from clean training embeddings, defining a nominal operating point in representation space. Self-awareness emerges as geometric deviation from this reference, providing an intrinsic, image-level signal of degradation-induced shift that is independent of detection confidence. Extensive experiments on synthetic corruption benchmarks, cross-dataset zero-shot transfer, and natural weather-induced distribution shifts demonstrate strong pristine-degraded separability, consistent behavior across multiple detector architectures, and robust generalization under semantic shift. These results suggest that degradation-aware representation geometry provides a practical and detector-agnostic foundation. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.18394 [cs.CV] (or arXiv:2602.18394v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.18394 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Stefan Becker [view email] [v1] Fri, 20 Feb 2026 17:58:46 UTC (26,902 KB)

[CV-7] G-LoG Bi-filtration for Medical Image Classification

【速读】:该论文旨在解决如何在拓扑数据分析(Topological Data Analysis, TDA)中构建有效的多参数滤波结构,以更准确地提取医学图像中的拓扑与几何特征。其核心挑战在于如何设计一个稳定且能捕捉多尺度边界信息的滤波机制,从而提升持久性模块对复杂数据的表征能力。解决方案的关键在于提出一种新的双参数滤波结构——G-LoG(Gaussian-Laplacian of Gaussian)双滤波,该方法利用拉普拉斯高斯算子(Laplacian of Gaussian, LoG)增强图像边界信息,并将体积图像建模为有界函数,进而证明由此生成的持久性模块在最大范数下具有稳定性。实验表明,基于该双滤波生成的拓扑特征,仅需简单多层感知机(MLP)即可达到与复杂深度学习模型相当的分类性能,验证了其有效性与实用性。

链接: https://arxiv.org/abs/2602.18329
作者: Qingsong Wang,Jiaxing He,Bingzhe Hou,Tieru Wu,Yang Cao,Cailing Yao
机构: Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT)
备注:

点击查看摘要

Abstract:Building practical filtrations on objects to detect topological and geometric features is an important task in the field of Topological Data Analysis (TDA). In this paper, leveraging the ability of the Laplacian of Gaussian operator to enhance the boundaries of medical images, we define the G-LoG (Gaussian-Laplacian of Gaussian) bi-filtration to generate the features more suitable for multi-parameter persistence module. By modeling volumetric images as bounded functions, then we prove the interleaving distance on the persistence modules obtained from our bi-filtrations on the bounded functions is stable with respect to the maximum norm of the bounded functions. Finally, we conduct experiments on the MedMNIST dataset, comparing our bi-filtration against single-parameter filtration and the established deep learning baselines, including Google AutoML Vision, ResNet, AutoKeras and auto-sklearn. Experiments results demonstrate that our bi-filtration significantly outperforms single-parameter filtration. Notably, a simple Multi-Layer Perceptron (MLP) trained on the topological features generated by our bi-filtration achieves performance comparable to complex deep learning models trained on the original dataset.

[CV-8] Unifying Color and Lightness Correction with View-Adaptive Curve Adjustment for Robust 3D Novel View Synthesis CVPR2025

【速读】:该论文旨在解决多视角图像采集中因光照变化和相机成像管线限制导致的光度不一致与色度不一致问题,这些问题会破坏现代三维新视角合成(Novel View Synthesis, NVS)方法(如Neural Radiance Fields 和 3D Gaussian Splatting, 3DGS)所依赖的光度一致性假设,从而降低重建与渲染质量。解决方案的关键在于提出Luminance-GS++框架,其核心创新是结合全局视图自适应亮度调整与局部像素级残差精修机制,实现精确的颜色校正;同时设计无监督目标函数,联合约束亮度修正及多视角几何与光度一致性,从而在保持3DGS显式表示不变的前提下显著提升重建保真度并维持实时渲染效率。

链接: https://arxiv.org/abs/2602.18322
作者: Ziteng Cui,Shuhong Liu,Xiaoyu Dong,Xuangeng Chu,Lin Gu,Ming-Hsuan Yang,Tatsuya Harada
机构: University of Tokyo (东京大学); Tohoku University (东北大学); University of California at Merced (加州大学默塞德分校); Google DeepMind (谷歌深度心智); RIKEN AIP (理化学研究所人工智能项目)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal extension version of CVPR 2025 paper: arXiv:2504.01503

点击查看摘要

Abstract:High-quality image acquisition in real-world environments remains challenging due to complex illumination variations and inherent limitations of camera imaging pipelines. These issues are exacerbated in multi-view capture, where differences in lighting, sensor responses, and image signal processor (ISP) configurations introduce photometric and chromatic inconsistencies that violate the assumptions of photometric consistency underlying modern 3D novel view synthesis (NVS) methods, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), leading to degraded reconstruction and rendering quality. We propose Luminance-GS++, a 3DGS-based framework for robust NVS under diverse illumination conditions. Our method combines a globally view-adaptive lightness adjustment with a local pixel-wise residual refinement for precise color correction. We further design unsupervised objectives that jointly enforce lightness correction and multi-view geometric and photometric consistency. Extensive experiments demonstrate state-of-the-art performance across challenging scenarios, including low-light, overexposure, and complex luminance and chromatic variations. Unlike prior approaches that modify the underlying representation, our method preserves the explicit 3DGS formulation, improving reconstruction fidelity while maintaining real-time rendering efficiency.

[CV-9] Diff2DGS: Reliable Reconstruction of Occluded Surgical Scenes via 2D Gaussian Splatting

【速读】:该论文旨在解决机器人外科手术场景中可变形组织的实时三维重建问题,特别是针对遮挡区域重建质量差和深度精度难以评估的挑战。现有方法如EndoNeRF和StereoMIS缺乏真实三维地面真值(ground truth),导致深度准确性无法可靠衡量;同时,传统基于高斯点绘(Gaussian Splatting, GS)的方法在仪器遮挡区域难以恢复高质量几何结构。其解决方案的关键在于提出Diff2DGS——一个两阶段框架:第一阶段采用带时间先验的扩散视频模块(diffusion-based video module)对遮挡组织进行时空一致性的图像修复;第二阶段将二维高斯点绘(2D Gaussian Splatting, 2DGS)与可学习形变模型(Learnable Deformation Model, LDM)结合,以捕捉动态组织形变并重建精确的解剖几何结构。此外,研究通过SCARED数据集上的定量深度精度分析验证了该方法优于当前最优方案,在图像保真度(PSNR达38.02 dB)和几何准确性之间实现了更优平衡。

链接: https://arxiv.org/abs/2602.18314
作者: Tianyi Song,Danail Stoyanov,Evangelos Mazomenos,Francisco Vasconcelos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Real-time reconstruction of deformable surgical scenes is vital for advancing robotic surgery, improving surgeon guidance, and enabling automation. Recent methods achieve dense reconstructions from da Vinci robotic surgery videos, with Gaussian Splatting (GS) offering real-time performance via graphics acceleration. However, reconstruction quality in occluded regions remains limited, and depth accuracy has not been fully assessed, as benchmarks like EndoNeRF and StereoMIS lack 3D ground truth. We propose Diff2DGS, a novel two-stage framework for reliable 3D reconstruction of occluded surgical scenes. In the first stage, a diffusion-based video module with temporal priors inpaints tissue occluded by instruments with high spatial-temporal consistency. In the second stage, we adapt 2D Gaussian Splatting (2DGS) with a Learnable Deformation Model (LDM) to capture dynamic tissue deformation and anatomical geometry. We also extend evaluation beyond prior image-quality metrics by performing quantitative depth accuracy analysis on the SCARED dataset. Diff2DGS outperforms state-of-the-art approaches in both appearance and geometry, reaching 38.02 dB PSNR on EndoNeRF and 34.40 dB on StereoMIS. Furthermore, our experiments demonstrate that optimizing for image quality alone does not necessarily translate into optimal 3D reconstruction accuracy. To address this, we further optimize the depth quality of the reconstructed 3D results, ensuring more faithful geometry in addition to high-fidelity appearance.

[CV-10] Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation

【速读】:该论文旨在解决时尚图像生成中如何有效融合文本与草图模态以同时保持全局结构一致性与局部语义细节的问题。现有方法在利用文本指导局部属性(如材质、颜色)时,常忽视草图所蕴含的结构信息,导致生成结果偏离原始草图的布局和轮廓。解决方案的关键在于提出LOcalized Text and Sketch with multi-level guidance (LOTS)框架,其核心创新包括:1)多级条件编码阶段(Multi-level Conditioning Stage),在共享潜在空间中独立编码局部特征并维持全局结构协调;2)扩散对齐引导阶段(Diffusion Pair Guidance),通过注意力机制在扩散模型的多步去噪过程中整合局部与全局条件信息。该方法显著提升了生成图像对草图结构的忠实度及局部语义的丰富性,实验表明优于当前最先进方法。

链接: https://arxiv.org/abs/2602.18309
作者: Ziyue Liu,Davide Talon,Federico Girella,Zanxi Ruan,Mattia Mondo,Loris Bazzani,Yiming Wang,Marco Cristani
机构: University of Verona (维罗纳大学); Polytechnic Institute of Turin (都灵理工学院); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); Reykjavik University (雷克雅未克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an “in the wild” split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.

[CV-11] DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control AAAI2026

【速读】:该论文旨在解决多实例生成(Multi-Instance Generation)中细粒度语义理解不足的问题,尤其是在处理复杂文本描述时难以精确控制实例间空间布局与属性绑定的挑战。解决方案的关键在于提出DEIG框架,其核心由两个模块构成:一是实例细节提取器(Instance Detail Extractor, IDE),将文本编码器输出的嵌入转化为紧凑且实例感知的表示;二是细节融合模块(Detail Fusion Module, DFM),通过基于实例的掩码注意力机制防止属性在不同实例间泄露。这一设计使模型能够生成视觉一致、精准匹配局部化文本描述的多实例场景,并支持细粒度监督与可控生成。

链接: https://arxiv.org/abs/2602.18282
作者: Shiyan Du,Conghan Yue,Xinyu Cheng,Dongyu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.

[CV-12] RoEL: Robust Event-based 3D Line Reconstruction

【速读】:该论文旨在解决事件相机(event camera)在实际部署中因稀疏性与噪声特性导致的3D场景重建和位姿估计性能不稳定的问题,尤其是在跨域差异显著、存在投影畸变和深度模糊的情况下。解决方案的关键在于提出一种稳健的多时间切片线特征提取算法,能够从事件流中稳定追踪具有不同外观的线条结构,并结合几何约束代价函数对3D线图和相机位姿进行优化,从而消除投影失真并提升精度。该方法生成的紧凑3D线图可适配多种观测模态(如点云或图像),且在多个数据集上验证了其在事件驱动建图与位姿精化中的显著性能提升,具备良好的泛化能力与多模态兼容性。

链接: https://arxiv.org/abs/2602.18258
作者: Gwangtak Bae,Jaeho Shin,Seunggu Kang,Junho Kim,Ayoung Kim,Young Min Kim
机构: Seoul National University (首尔国立大学); University of Michigan (密歇根大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Transactions on Robotics (T-RO)

点击查看摘要

Abstract:Event cameras in motion tend to detect object boundaries or texture edges, which produce lines of brightness changes, especially in man-made environments. While lines can constitute a robust intermediate representation that is consistently observed, the sparse nature of lines may lead to drastic deterioration with minor estimation errors. Only a few previous works, often accompanied by additional sensors, utilize lines to compensate for the severe domain discrepancies of event sensors along with unpredictable noise characteristics. We propose a method that can stably extract tracks of varying appearances of lines using a clever algorithmic process that observes multiple representations from various time slices of events, compensating for potential adversaries within the event data. We then propose geometric cost functions that can refine the 3D line maps and camera poses, eliminating projective distortions and depth ambiguities. The 3D line maps are highly compact and can be equipped with our proposed cost function, which can be adapted for any observations that can detect and extract line structures or projections of them, including 3D point cloud maps or image observations. We demonstrate that our formulation is powerful enough to exhibit a significant performance boost in event-based mapping and pose refinement across diverse datasets, and can be flexibly applied to multimodal scenarios. Our results confirm that the proposed line-based formulation is a robust and effective approach for the practical deployment of event-based perceptual modules. Project page: this https URL

[CV-13] On the Adversarial Robustness of Discrete Image Tokenizers

【速读】:该论文旨在解决离散图像分词器(discrete image tokenizer)在多模态系统中对对抗攻击的脆弱性问题,这一问题此前尚未被研究。其核心解决方案是提出一种基于无监督对抗训练的防御策略:通过在不更新其他模型组件的前提下,仅对主流分词器进行无监督对抗微调,从而显著提升其对无监督和端到端监督攻击的鲁棒性,并具有良好泛化能力至未见任务与数据。该方法无需标注数据,具备应用广泛性和实用性,强调了分词器鲁棒性在下游任务中的关键作用,为构建安全的多模态基础模型提供了重要进展。

链接: https://arxiv.org/abs/2602.18252
作者: Rishika Bhagwatkar,Irina Rish,Nicolas Flammarion,Francesco Croce
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.

[CV-14] A Self-Supervised Approach on Motion Calibration for Enhancing Physical Plausibility in Text-to-Motion

【速读】:该论文旨在解决文本到人体运动生成模型中同时保障语义一致性和物理合理性的问题,尤其是针对生成运动中存在的物理不现实现象(如脚部悬浮)进行优化。解决方案的关键在于提出一种无需复杂物理建模的自监督数据驱动后处理模块——畸变感知运动校准器(Distortion-aware Motion Calibrator, DMC),其通过学习从人为引入畸变的运动序列与原始文本描述中恢复出物理合理且语义一致的运动轨迹,从而在不改变原生成模型结构的前提下显著提升运动的真实性与合理性。

链接: https://arxiv.org/abs/2602.18199
作者: Gahyeon Shim,Soogeun Park,Hyemin Ahn
机构: Artificial Intelligence Graduate School (AIGS), Ulsan National Institute of Science and Technology (UNIST); Department of Electrical Engineering (EE), Pohang University of Science and Technology (POSTECH)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating semantically aligned human motion from textual descriptions has made rapid progress, but ensuring both semantic and physical realism in motion remains a challenge. In this paper, we introduce the Distortion-aware Motion Calibrator (DMC), a post-hoc module that refines physically implausible motions (e.g., foot floating) while preserving semantic consistency with the original textual description. Rather than relying on complex physical modeling, we propose a self-supervised and data-driven approach, whereby DMC learns to obtain physically plausible motions when an intentionally distorted motion and the original textual descriptions are given as inputs. We evaluate DMC as a post-hoc module to improve motions obtained from various text-to-motion generation models and demonstrate its effectiveness in improving physical plausibility while enhancing semantic consistency. The experimental results show that DMC reduces FID score by 42.74% on T2M and 13.20% on T2M-GPT, while also achieving the highest R-Precision. When applied to high-quality models like MoMask, DMC improves the physical plausibility of motions by reducing penetration by 33.0% as well as adjusting floating artifacts closer to the ground-truth reference. These results highlight that DMC can serve as a promising post-hoc motion refinement framework for any kind of text-to-motion models by incorporating textual semantics and physical plausibility.

[CV-15] BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards AAAI2026

【速读】:该论文旨在解决短视频平台中商业广告内容日益复杂的多模态欺骗性问题,即广告通过视觉、语音与字幕等多模态信息的协同误导用户,传统社区安全过滤机制难以有效识别和管控。解决方案的关键在于提出BLM-Guard框架,其核心创新包括:基于规则的思维链(Chain-of-Thought, CoT)数据合成管道,用于低成本生成结构化场景描述、推理链条与标签;融合因果一致性与政策合规性的强化学习奖励机制,提升模型判别准确性与一致性;以及多任务架构对模态内操纵(如夸张图像)与跨模态不一致(如字幕-语音漂移)的联合建模,从而显著增强模型在真实场景下的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2602.18193
作者: Yiran Yang,Zhaowei Liu,Yuan Yuan,Yukun Song,Xiong Ma,Yinghao Song,Xiangji Zeng,Lu Sun,Yulu Wang,Hai Zhou,Shuai Cui,Zhaohan Gong,Jiefei Zhang
机构: Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures. To appear in AAAI 2026

点击查看摘要

Abstract:Short-video platforms now host vast multimodal ads whose deceptive visuals, speech and subtitles demand finer-grained, policy-driven moderation than community safety filters. We present BLM-Guard, a content-audit framework for commercial ads that fuses Chain-of-Thought reasoning with rule-based policy principles and a critic-guided reward. A rule-driven ICoT data-synthesis pipeline jump-starts training by generating structured scene descriptions, reasoning chains and labels, cutting annotation costs. Reinforcement learning then refines the model using a composite reward balancing causal coherence with policy adherence. A multitask architecture models intra-modal manipulations (e.g., exaggerated imagery) and cross-modal mismatches (e.g., subtitle-speech drift), boosting robustness. Experiments on real short-video ads show BLM-Guard surpasses strong baselines in accuracy, consistency and generalization.

[CV-16] Evaluating Graphical Perception Capabilities of Vision Transformers

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在图形感知任务中的表现尚未被充分研究的问题,尤其关注其是否具备类似人类的图形感知能力。现有研究表明卷积神经网络(Convolutional Neural Networks, CNNs)在图像任务中表现出良好的图形感知性能,但ViT在可视化领域的感知对齐性仍不明确。解决方案的关键在于通过受Cleveland和McGill经典图形感知研究启发的控制实验,系统地对比ViT、CNN与人类参与者在基础视觉判断任务中的表现,从而揭示ViT在图形感知方面的局限性,并为未来在可视化系统和图形感知建模中应用ViT提供关键洞见。

链接: https://arxiv.org/abs/2602.18178
作者: Poonam Poonam,Pere-Pau Vázquez,Timo Ropinski
机构: Viscom Group, Institute of Media Informatics, Ulm University (乌尔姆大学媒体信息研究所); ViRVIG Group, Universitat Politècnica de Catalunya - BarcelonaTech (加泰罗尼亚理工大学-巴塞罗那技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers, ViTs, have emerged as a powerful alternative to convolutional neural networks, CNNs, in a variety of image-based tasks. While CNNs have previously been evaluated for their ability to perform graphical perception tasks, which are essential for interpreting visualizations, the perceptual capabilities of ViTs remain largely unexplored. In this work, we investigate the performance of ViTs in elementary visual judgment tasks inspired by the foundational studies of Cleveland and McGill, which quantified the accuracy of human perception across different visual encodings. Inspired by their study, we benchmark ViTs against CNNs and human participants in a series of controlled graphical perception tasks. Our results reveal that, although ViTs demonstrate strong performance in general vision tasks, their alignment with human-like graphical perception in the visualization domain is limited. This study highlights key perceptual gaps and points to important considerations for the application of ViTs in visualization systems and graphical perceptual modeling.

[CV-17] OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

【速读】:该论文旨在解决当前视觉语言模型(Visual-Language Models, VLMs)在处理分布外(out-of-distribution, OOD)数据时缺乏有效评估基准的问题。现有研究多假设训练数据满足独立同分布(independent and identically distributed, IID)条件,但在真实场景中(如自动驾驶或医疗辅助),这种假设往往不成立,且对OOD对象的处理不当可能带来安全风险。为此,作者提出OODBench——一个以自动化为主、仅需少量人工验证的基准构建与评估框架,包含4万组实例级OOD实例-类别配对,并引入基于“基础到进阶”提示问题序列的自动评估指标,以更全面地衡量VLM在不同难度任务下对OOD数据的鲁棒性表现。其关键创新在于构建了大规模、结构化且可扩展的OOD测试集,并设计了一种能反映模型逐步推理能力的自动化评估机制。

链接: https://arxiv.org/abs/2602.18094
作者: Ling Lin,Yang Bai,Heng Su,Congcong Zhu,Yaoxing Wang,Yang Zhou,Huazhu Fu,Jingrun Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 54 pages, 21 figures

点击查看摘要

Abstract:Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.

[CV-18] Predict to Skip: Linear Multistep Feature Forecasting for Efficient Diffusion Transformers

【速读】:该论文旨在解决扩散模型(Diffusion Models)中基于DiT(Diffusion Transformers)架构的图像与视频生成任务因迭代去噪过程带来的高计算开销问题。现有无需训练的加速方法依赖于特征缓存与复用,但其假设时间稳定性可能导致潜在空间漂移(latent drift)和视觉质量下降。解决方案的关键在于提出PrediT框架,将特征预测建模为线性多步问题,利用经典线性多步法从历史信息中预测未来模型输出,并引入校正器(corrector)在高动态区域激活以防止误差累积;同时设计动态步长调制机制,通过监测特征变化率自适应调整预测范围,从而在显著降低延迟(最高达5.54倍)的同时保持生成质量。

链接: https://arxiv.org/abs/2602.18093
作者: Hanshuai Cui,Zhiqing Tang,Qianli Ma,Zhi Yao,Weijia Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiT) have emerged as a widely adopted backbone for high-fidelity image and video generation, yet their iterative denoising process incurs high computational costs. Existing training-free acceleration methods rely on feature caching and reuse under the assumption of temporal stability. However, reusing features for multiple steps may lead to latent drift and visual degradation. We observe that model outputs evolve smoothly along much of the diffusion trajectory, enabling principled predictions rather than naive reuse. Based on this insight, we propose \textbfPrediT, a training-free acceleration framework that formulates feature prediction as a linear multistep problem. We employ classical linear multistep methods to forecast future model outputs from historical information, combined with a corrector that activates in high-dynamics regions to prevent error accumulation. A dynamic step modulation mechanism adaptively adjusts the prediction horizon by monitoring the feature change rate. Together, these components enable substantial acceleration while preserving generation fidelity. Extensive experiments validate that our method achieves up to 5.54\times latency reduction across various DiT-based image and video generation models, while incurring negligible quality degradation.

[CV-19] DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text

【速读】:该论文旨在解决手写梵文音节文字(Devanagari)在公开基准数据集中的严重代表性不足问题,现有资源规模有限、多聚焦于孤立字符或短词,且缺乏受控的词汇内容与书写者多样性,难以捕捉其连续、融合及结构复杂的书写特征(如通过共享的shirorekha连接字符并形成丰富的连笔形态)。解决方案的关键在于提出DohaScript——一个大规模、多书写者的手写印地语文本数据集,由531位独特贡献者完成,所有书写者均抄写相同的六首传统印地语dohas(对句),这种受控设计使研究能够系统分析书写者特异性差异而不受语言内容干扰,从而支持手写识别、书写者识别、风格分析和生成建模等任务。该数据集还包含非标识性人口统计学元数据、基于客观清晰度与分辨率标准的质量筛选机制以及页面级布局难度标注,确保了基准测试的分层性和可重复性。

链接: https://arxiv.org/abs/2602.18089
作者: Kunwar Arpit Singh,Ankush Prakash,Haroon R Lone
机构: IISER Bhopal(印度科学教育研究所博帕尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite having hundreds of millions of speakers, handwritten Devanagari text remains severely underrepresented in publicly available benchmark datasets. Existing resources are limited in scale, focus primarily on isolated characters or short words, and lack controlled lexical content and writer level diversity, which restricts their utility for modern data driven handwriting analysis. As a result, they fail to capture the continuous, fused, and structurally complex nature of Devanagari handwriting, where characters are connected through a shared shirorekha (horizontal headline) and exhibit rich ligature formations. We introduce DohaScript, a large scale, multi writer dataset of handwritten Hindi text collected from 531 unique contributors. The dataset is designed as a parallel stylistic corpus, in which all writers transcribe the same fixed set of six traditional Hindi dohas (couplets). This controlled design enables systematic analysis of writer specific variation independent of linguistic content, and supports tasks such as handwriting recognition, writer identification, style analysis, and generative modeling. The dataset is accompanied by non identifiable demographic metadata, rigorous quality curation based on objective sharpness and resolution criteria, and page level layout difficulty annotations that facilitate stratified benchmarking. Baseline experiments demonstrate clear quality separation and strong generalization to unseen writers, highlighting the dataset’s reliability and practical value. DohaScript is intended to serve as a standardized and reproducible benchmark for advancing research on continuous handwritten Devanagari text in low resource script settings.

[CV-20] Comparative Assessment of Multimodal Earth Observation Data for Soil Moisture Estimation

【速读】:该论文旨在解决现有卫星土壤湿度(Soil Moisture, SM)产品空间分辨率过低(1 km)难以满足农田尺度应用的问题。其解决方案的关键在于构建一个高分辨率(10 m)的SM估计框架,融合Sentinel-1合成孔径雷达(Synthetic Aperture Radar, SAR)、Sentinel-2光学影像与ERA-5再分析数据,并采用机器学习方法进行建模。研究发现,通过结合当日Sentinel-2影像与Sentinel-1下降轨道数据的混合时间匹配策略,辅以10天历史ERA-5数据作为特征窗口,可实现R²=0.518的预测性能;同时,尽管引入了IBM-NASA Prithvi基础模型的嵌入特征,其表现与传统手工设计的光谱特征相当(R²=0.515 vs. 0.514),表明在数据稀疏场景下,基于领域知识的特征工程仍具竞争力。最终,该研究提出了一种基于特定光谱指数与树集成模型相结合的高效、实用方案,适用于欧洲范围内的田块级土壤湿度监测。

链接: https://arxiv.org/abs/2602.18083
作者: Ioannis Kontogiorgakis,Athanasios Askitopoulos,Iason Tsardanidis,Dimitrios Bormpoudakis,Ilias Tsoumas,Fotios Balampanis,Charalampos Kontoes
机构: BEYOND EO Centre, IAASARS, National Observatory of Athens (国家天文台); Wageningen University & Research (瓦赫宁根大学与研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper has been submitted to IEEE IGARSS 2026

点击查看摘要

Abstract:Accurate soil moisture (SM) estimation is critical for precision agriculture, water resources management and climate monitoring. Yet, existing satellite SM products are too coarse (1km) for farm-level applications. We present a high-resolution (10m) SM estimation framework for vegetated areas across Europe, combining Sentinel-1 SAR, Sentinel-2 optical imagery and ERA-5 reanalysis data through machine learning. Using 113 International Soil Moisture Network (ISMN) stations spanning diverse vegetated areas, we compare modality combinations with temporal parameterizations, using spatial cross-validation, to ensure geographic generalization. We also evaluate whether foundation model embeddings from IBM-NASA’s Prithvi model improve upon traditional hand-crafted spectral features. Results demonstrate that hybrid temporal matching - Sentinel-2 current-day acquisitions with Sentinel-1 descending orbit - achieves R^2=0.514, with 10-day ERA5 lookback window improving performance to R^2=0.518. Foundation model (Prithvi) embeddings provide negligible improvement over hand-crafted features (R^2=0.515 vs. 0.514), indicating traditional feature engineering remains highly competitive for sparse-data regression tasks. Our findings suggest that domain-specific spectral indices combined with tree-based ensemble methods offer a practical and computationally efficient solution for operational pan-European field-scale soil moisture monitoring.

[CV-21] Faster Training Fewer Labels: Self-Supervised Pretraining for Fine-Grained BEV Segmentation

【速读】:该论文旨在解决自动驾驶中密集鸟瞰图(Bird’s Eye View, BEV)语义地图构建对昂贵且不一致的BEV标注数据的高度依赖问题。当前多摄像头方法需全监督训练,导致标注成本高、数据效率低。解决方案的关键在于提出一种两阶段训练策略:首先在自监督预训练阶段,利用可微分重投影(differentiable reprojection)将BEVFormer的预测结果映射回图像平面,并与由Mask2Former生成的多视角语义伪标签进行对比学习,同时引入时序一致性损失增强模型泛化能力;随后在监督微调阶段仅使用50%的数据和更少的训练时间,即可显著提升BEV道路标记分割性能(mIoU最高提升2.5个百分点),实现标注数据量减半、训练时间减少三分之二的同时保持甚至超越全监督基线模型的效果。此方法通过相机视角伪标签与可微分重投影的结合,有效迁移了预训练阶段学到的BEV特征表示,为低标签依赖的自动驾驶感知提供了可扩展路径。

链接: https://arxiv.org/abs/2602.18066
作者: Daniel Busch,Christian Bohn,Thomas Kurbiel,Klaus Friedrichs,Richard Meyes,Tobias Meisen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This Paper has been accepted to the 2026 IEEE Intelligent Vehicles Symposium (IV)

点击查看摘要

Abstract:Dense Bird’s Eye View (BEV) semantic maps are central to autonomous driving, yet current multi-camera methods depend on costly, inconsistently annotated BEV ground truth. We address this limitation with a two-phase training strategy for fine-grained road marking segmentation that removes full supervision during pretraining and halves the amount of training data during fine-tuning while still outperforming the comparable supervised baseline model. During the self-supervised pretraining, BEVFormer predictions are differentiably reprojected into the image plane and trained against multi-view semantic pseudo-labels generated by the widely used semantic segmentation model Mask2Former. A temporal loss encourages consistency across frames. The subsequent supervised fine-tuning phase requires only 50% of the dataset and significantly less training time. With our method, the fine-tuning benefits from rich priors learned during pretraining boosting the performance and BEV segmentation quality (up to +2.5pp mIoU over the fully supervised baseline) on nuScenes. It simultaneously halves the usage of annotation data and reduces total training time by up to two thirds. The results demonstrate that differentiable reprojection plus camera perspective pseudo labels yields transferable BEV features and a scalable path toward reduced-label autonomous perception.

[CV-22] 3DMedAgent : Unified Perception-to-Understanding for 3D Medical Analysis

【速读】:该论文旨在解决现有3D医学影像分析方法在任务特定建模与端到端范式之间存在的局限性,即难以系统积累感知证据以支持下游推理的问题;同时,针对多模态大语言模型(MLLM)主要面向2D图像设计、无法有效处理体积数据的缺陷,提出一种无需3D专属微调即可实现通用3D CT分析的统一代理框架——3DMedAgent。其核心创新在于通过灵活的MLLM代理协调异构视觉与文本工具,将复杂的3D分析任务逐步分解为从全局到局部、从3D体数据到2D切片、从视觉证据到结构化文本表示的可执行子任务,并借助长期结构化记忆机制聚合中间工具输出,从而支撑查询自适应、证据驱动的多步推理过程。

链接: https://arxiv.org/abs/2602.18064
作者: Ziyue Wang,Linghan Cai,Chang Han Low,Haofeng Liu,Junde Wu,Jingyu Wang,Rui Wang,Lei Song,Jiang Bian,Jingjing Fu,Yueming Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:3D CT analysis spans a continuum from low-level perception to high-level clinical understanding. Existing 3D-oriented analysis methods adopt either isolated task-specific modeling or task-agnostic end-to-end paradigms to produce one-hop outputs, impeding the systematic accumulation of perceptual evidence for downstream reasoning. In parallel, recent multimodal large language models (MLLMs) exhibit improved visual perception and can integrate visual and textual information effectively, yet their predominantly 2D-oriented designs fundamentally limit their ability to perceive and analyze volumetric medical data. To bridge this gap, we propose 3DMedAgent, a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning. 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent, progressively decomposing complex 3D analysis into tractable subtasks that transition from global to regional views, from 3D volumes to informative 2D slices, and from visual evidence to structured textual representations. Central to this design, 3DMedAgent maintains a long-term structured memory that aggregates intermediate tool outputs and supports query-adaptive, evidence-driven multi-step reasoning. We further introduce the DeepChestVQA benchmark for evaluating unified perception-to-understanding capabilities in 3D thoracic imaging. Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs, highlighting a scalable path toward general-purpose 3D clinical this http URL and data are available at \hrefthis https URLthis https URL.

[CV-23] mporal Consistency-Aware Text-to-Motion Generation

【速读】:该论文旨在解决文本到动作(Text-to-Motion, T2M)生成中因忽略跨序列时间一致性而导致的语义错位和物理不合理运动问题。现有两阶段框架虽利用离散动作表示推动了T2M研究进展,但未能建模不同实例间相同动作共享的时间结构,从而影响生成动作的连贯性与合理性。解决方案的关键在于提出TCA-T2M框架,其核心创新包括:1)引入时序一致性感知的空间量化变分自编码器(Temporal Consistency-aware Spatial VQ-VAE, TCaS-VQ-VAE),实现跨序列的时间对齐;2)采用掩码动作Transformer进行文本条件驱动的动作生成;3)设计运动学约束模块以减少离散化伪影,提升物理合理性。实验表明,该方法在HumanML3D和KIT-ML基准上达到最先进性能,验证了时序一致性对高质量T2M生成的重要性。

链接: https://arxiv.org/abs/2602.18057
作者: Hongsong Wang,Wenjing Yan,Qiuxia Lai,Xin Geng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is on this https URL

点击查看摘要

Abstract:Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.

[CV-24] CityGuard: Graph-Aware Private Descriptors for Bias-Resilient Identity Search Across Urban Cameras

【速读】:该论文旨在解决城市尺度下分布式摄像头系统中跨视图人员重识别(person re-identification)问题,其核心挑战包括视角变化、遮挡和域偏移(domain shift)带来的外观差异,同时需遵守数据保护法规禁止原始图像共享。解决方案的关键在于提出CityGuard框架,该框架通过三个创新组件实现隐私保护下的高效身份检索:(1) 基于特征分布自适应调整实例级边距的度量学习器,提升类内紧凑性;(2) 引入空间条件注意力机制,利用粗粒度几何先验(如GPS或楼层平面图)在图结构自注意力中实现投影一致的跨视角对齐,无需高精度标定;(3) 结合差分隐私嵌入映射与紧凑近似索引,支持安全且成本可控的部署。上述设计共同提升了描述符对视角变化、遮挡和域偏移的鲁棒性,并在严格差分隐私约束下实现隐私与效用的可调平衡。

链接: https://arxiv.org/abs/2602.18047
作者: Rong Fu,Wenxin Zhang,Yibo Meng,Jia Yee Tan,Jiaxuan Lu,Rui Lu,Jiekai Wu,Zhaolu Kang,Simon Fong
机构: University of Macau (澳门大学); University of the Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学); Renmin University of China (中国人民大学); Shanghai AI Laboratory (上海人工智能实验室); The Hong Kong University of Science and Technology (香港科技大学); Juntendo University (顺天堂大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 36 pages, 12 figures

点击查看摘要

Abstract:City-scale person re-identification across distributed cameras must handle severe appearance changes from viewpoint, occlusion, and domain shift while complying with data protection rules that prevent sharing raw imagery. We introduce CityGuard, a topology-aware transformer for privacy-preserving identity retrieval in decentralized surveillance. The framework integrates three components. A dispersion-adaptive metric learner adjusts instance-level margins according to feature spread, increasing intra-class compactness. Spatially conditioned attention injects coarse geometry, such as GPS or deployment floor plans, into graph-based self-attention to enable projectively consistent cross-view alignment using only coarse geometric priors without requiring survey-grade calibration. Differentially private embedding maps are coupled with compact approximate indexes to support secure and cost-efficient deployment. Together these designs produce descriptors robust to viewpoint variation, occlusion, and domain shifts, and they enable a tunable balance between privacy and utility under rigorous differential-privacy accounting. Experiments on Market-1501 and additional public benchmarks, complemented by database-scale retrieval studies, show consistent gains in retrieval precision and query throughput over strong baselines, confirming the practicality of the framework for privacy-critical urban identity matching.

[CV-25] Spatio-temporal Decoupled Knowledge Compensator for Few-Shot Action Recognition

【速读】:该论文旨在解决少样本动作识别(Few-Shot Action Recognition, FSAR)中因仅依赖粗粒度动作名称作为辅助上下文而导致的视觉特征判别能力不足问题,尤其在捕捉新动作类别中的细粒度空间结构和多样时间模式方面存在局限。其解决方案的关键在于提出DiST框架——一种基于分解与融合机制的创新方法:首先在分解阶段将原始动作名称解耦为多样化的时空属性描述(即动作相关常识知识),从而从空间和时间两个维度补充语义上下文;随后在融合阶段引入空间/时间知识补偿器(Spatial/Temporal Knowledge Compensator, SKC/TKC),分别用于学习对象级原型(通过空间知识引导patch token自适应聚合)和帧级原型(借助时间属性建模帧间时序关系)。该设计使模型能够显式地提取多粒度、可解释的原型表示,显著提升了对新动作类别的识别性能。

链接: https://arxiv.org/abs/2602.18043
作者: Hongyu Qu,Xiangbo Shu,Rui Yan,Hailiang Gao,Wenguan Wang,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TPAMI 2026

点击查看摘要

Abstract:Few-Shot Action Recognition (FSAR) is a challenging task that requires recognizing novel action categories with a few labeled videos. Recent works typically apply semantically coarse category names as auxiliary contexts to guide the learning of discriminative visual features. However, such context provided by the action names is too limited to provide sufficient background knowledge for capturing novel spatial and temporal concepts in actions. In this paper, we propose DiST, an innovative Decomposition-incorporation framework for FSAR that makes use of decoupled Spatial and Temporal knowledge provided by large language models to learn expressive multi-granularity prototypes. In the decomposition stage, we decouple vanilla action names into diverse spatio-temporal attribute descriptions (action-related knowledge). Such commonsense knowledge complements semantic contexts from spatial and temporal perspectives. In the incorporation stage, we propose Spatial/Temporal Knowledge Compensators (SKC/TKC) to discover discriminative object-level and frame-level prototypes, respectively. In SKC, object-level prototypes adaptively aggregate important patch tokens under the guidance of spatial knowledge. Moreover, in TKC, frame-level prototypes utilize temporal attributes to assist in inter-frame temporal relation modeling. These learned prototypes thus provide transparency in capturing fine-grained spatial details and diverse temporal patterns. Experimental results show DiST achieves state-of-the-art results on five standard FSAR datasets.

[CV-26] Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

【速读】:该论文旨在解决基于扩散模型(Diffusion-based)的图像编辑中“无需训练即可控制编辑强度”的关键问题,尤其针对Diffusion Transformer (DiT)架构中存在的注意力机制利用不充分的问题。现有方法仅通过调整Key空间来调节注意力路由,忽略了Value空间在特征聚合中的作用。解决方案的关键在于提出Dual-Channel Attention Guidance (DCAG),首次揭示DiT多模态注意力层中Key与Value投影均具有显著的bias-delta结构,并在此基础上同时操纵Key通道(控制关注位置)和Value通道(控制聚合内容),从而构建出二维参数空间(δₖ, δᵥ),实现粗粒度与细粒度协同调控,显著提升编辑精度与保真度,尤其在局部编辑任务如对象删除和添加中表现突出。

链接: https://arxiv.org/abs/2602.18022
作者: Guandong Li,Mengxia Ye
机构: iFLYTEK(科大讯飞); Aegon THTF
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space – which governs feature aggregation – entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT’s multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space (\delta_k, \delta_v) enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).

[CV-27] UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在执行机器人操作任务时,因缺乏对不确定性的有效感知而导致动作生成不够可靠的问题。现有方法通常依赖额外的观测线索(如深度图、点云)或辅助模块(如目标检测器、编码器)来提升性能,但这些方案往往需要昂贵的数据采集和额外训练成本。其解决方案的关键在于提出一种无需训练、可即插即用的不确定性感知观测重注入机制(Uncertainty-aware Observation Reinjection, UAOR):当语言模型层检测到高动作熵(Action Entropy)表示当前决策不确定性较高时,通过注意力检索将关键观测信息重新注入下一网络层的前馈网络(Feed-Forward Network, FFN),从而增强模型在推理阶段对观测信息的关注能力,实现更自信且忠实的动作生成。

链接: https://arxiv.org/abs/2602.18020
作者: Jiabing Yang,Yixiang Chen,Yuan Xu,Peiyan Li,Xiangnan Wu,Zichen Wen,Bowen Fang,Tao Yu,Zhengbo Zhang,Yingda Li,Kai Wang,Jing Liu,Nianfeng Liu,Yan Huang,Liang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as “key-value memory”, we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer’s Feed-Forward Network (FFN) through attention retrieval. This mechanism helps VLAs better attend to observations during inference, enabling more confident and faithful action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at this https URL.

[CV-28] DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE

【速读】:该论文旨在解决安全导向视频理解(Security-oriented Video Understanding, SVU)领域中威胁因果分析能力不足的问题,即现有方法主要关注威胁事件(如枪击、抢劫)的检测与定位,而缺乏对威胁成因的有效生成与评估能力。为应对这一挑战,论文提出了一种新的深度安全导向视频理解任务(In-depth Security-oriented Video Understanding, DeepSVU),其核心目标是不仅识别和定位威胁,还能归因并评估威胁片段的成因。解决方案的关键在于提出统一物理世界正则化专家混合模型(Unified Physical-world Regularized MoE, UPRM),该模型包含两个核心组件:统一物理世界增强的专家混合(Unified Physical-world Enhanced MoE, UPE)模块用于建模从粗粒度到细粒度的物理世界信息(如人类行为、物体交互及背景上下文),以及物理世界权衡正则化器(Physical-world Trade-off Regularizer, PTR)用于自适应地平衡这些多尺度信息的影响。实验表明,UPRM在UCF-C和CUVA指令数据集上显著优于多种先进视频大语言模型(Video-LLMs)及非视觉语言模型方法,验证了物理世界信息建模与动态权衡机制在DeepSVU任务中的重要性与有效性。

链接: https://arxiv.org/abs/2602.18019
作者: Yujie Jin,Wenxin Zhang,Jingjing Wang,Guodong Zhou
机构: Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such this http URL justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.

[CV-29] owards LLM -centric Affective Visual Customization via Efficient and Precise Emotion Manipulating

【速读】:该论文旨在解决现有视觉定制方法中忽视主观情感内容的问题,尤其是缺乏能够通用处理情感导向图像编辑的基础模型。为应对这一挑战,作者提出了以大语言模型(Large Language Model, LLM)为核心的“情感化视觉定制”(Affective Visual Customization, L-AVC)任务,其核心目标是通过多模态LLM对图像的主观情绪进行可控修改。解决方案的关键在于提出一种高效且精确的情感操纵方法(Efficient and Precise Emotion Manipulating, EPEM),包含两个核心模块:一是高效跨情感语义转换模块(Efficient Inter-emotion Converting, EIC),用于确保编辑前后情感语义的一致性对齐;二是精确外情感语义保留模块(Precise Exter-emotion Retaining, PER),用于精准保留与情感无关的内容信息。实验证明,EPEM在所构建的L-AVC数据集上显著优于多个前沿基线方法,验证了情感信息在L-AVC中的重要性及EPEM的有效性。

链接: https://arxiv.org/abs/2602.18016
作者: Jiamin Luo,Xuqian Gu,Jingjing Wang,Jiahong Lu
机构: Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Previous studies on visual customization primarily rely on the objective alignment between various control signals (e.g., language, layout and canny) and the edited images, which largely ignore the subjective emotional contents, and more importantly lack general-purpose foundation models for affective visual customization. With this in mind, this paper proposes an LLM-centric Affective Visual Customization (L-AVC) task, which focuses on generating images within modifying their subjective emotions via Multimodal LLM. Further, this paper contends that how to make the model efficiently align emotion conversion in semantics (named inter-emotion semantic conversion) and how to precisely retain emotion-agnostic contents (named exter-emotion semantic retaining) are rather important and challenging in this L-AVC task. To this end, this paper proposes an Efficient and Precise Emotion Manipulating approach for editing subjective emotions in images. Specifically, an Efficient Inter-emotion Converting (EIC) module is tailored to make the LLM efficiently align emotion conversion in semantics before and after editing, followed by a Precise Exter-emotion Retaining (PER) module to precisely retain the emotion-agnostic contents. Comprehensive experimental evaluations on our constructed L-AVC dataset demonstrate the great advantage of the proposed EPEM approach to the L-AVC task over several state-of-the-art baselines. This justifies the importance of emotion information for L-AVC and the effectiveness of EPEM in efficiently and precisely manipulating such information.

[CV-30] MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method

【速读】:该论文旨在解决水下目标跟踪(Underwater Object Tracking, UOT)领域因缺乏大规模、多模态且多样化的数据集而导致的模型鲁棒性不足问题,尤其是在严重颜色失真、浑浊度高和低可见度等复杂水下环境中。现有基准数据集规模小且仅包含RGB图像,难以支撑高性能模型训练与评估。为此,作者提出了MUOT_3M,首个伪多模态UOT基准数据集,包含300万帧来自3030段视频(总计27.8小时),并标注了32个跟踪属性、677个细粒度类别,以及同步的RGB、估计增强RGB、估计深度和语言模态,经海洋生物学家验证。在此基础上,进一步提出MUTrack,一种基于Segment Anything Model (SAM) 的多模态到单模态跟踪器,其核心创新在于视觉几何对齐、视觉-语言融合及四级知识蒸馏机制,将多模态知识有效迁移至轻量级单模态学生模型中,从而在保持高精度(AUC提升达8.40%,精度提升7.80%)的同时实现24 FPS的实时推理速度,为可扩展、多模态训练但实际可部署的水下跟踪任务建立了新范式。

链接: https://arxiv.org/abs/2602.18006
作者: Ahsan Baidar Bakht,Mohamad Alansari,Muhayy Ud Din,Muzammal Naseer,Sajid Javed,Irfan Hussain,Jiri Matas,Arif Mahmood
机构: Khalifa University (哈利法大学); Czech Technical University (捷克技术大学); Information Technology University (信息科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater Object Tracking (UOT) is crucial for efficient marine robotics, large scale ecological monitoring, and ocean exploration; however, progress has been hindered by the scarcity of large, multimodal, and diverse datasets. Existing benchmarks remain small and RGB only, limiting robustness under severe color distortion, turbidity, and low visibility conditions. We introduce MUOT_3M, the first pseudo multimodal UOT benchmark comprising 3 million frames from 3,030 videos (27.8h) annotated with 32 tracking attributes, 677 fine grained classes, and synchronized RGB, estimated enhanced RGB, estimated depth, and language modalities validated by a marine biologist. Building upon MUOT_3M, we propose MUTrack, a SAM-based multimodal to unimodal tracker featuring visual geometric alignment, vision language fusion, and four level knowledge distillation that transfers multimodal knowledge into a unimodal student model. Extensive evaluations across five UOT benchmarks demonstrate that MUTrack achieves up to 8.40% higher AUC and 7.80% higher precision than the strongest SOTA baselines while running at 24 FPS. MUOT_3M and MUTrack establish a new foundation for scalable, multimodally trained yet practically deployable underwater tracking.

[CV-31] Image Quality Assessment: Exploring Quality Awareness via Memory-driven Distortion Patterns Matching

【速读】:该论文旨在解决现有全参考图像质量评估(Full-Reference Image Quality Assessment, FR-IQA)方法对高质量参考图像依赖性强的问题,从而限制了其在真实场景中的应用。解决方案的关键在于提出了一种基于记忆驱动的质量感知框架(Memory-Driven Quality-Aware Framework, MQAF),通过构建一个存储失真模式的记忆库,并动态切换双模式质量评估策略:当参考图像可用时,结合参考信息与记忆库中的失真模式进行自适应加权比较以获得参考引导的质量评分;当无参考图像时,则仅依赖记忆库中的失真模式实现无参考图像质量评估(No-Reference IQA)。该设计显著降低了对理想参考源的依赖,同时在多个数据集上优于当前最先进方法。

链接: https://arxiv.org/abs/2602.18000
作者: Xuting Lan,Mingliang Zhou,Xuekai Wei,Jielu Yan,Yueting Huang,Huayan Pu,Jun Luo,Weijia Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing full-reference image quality assessment (FR-IQA) methods achieve high-precision evaluation by analysing feature differences between reference and distorted images. However, their performance is constrained by the quality of the reference image, which limits real-world applications where ideal reference sources are unavailable. Notably, the human visual system has the ability to accumulate visual memory, allowing image quality assessment on the basis of long-term memory storage. Inspired by this biological memory mechanism, we propose a memory-driven quality-aware framework (MQAF), which establishes a memory bank for storing distortion patterns and dynamically switches between dual-mode quality assessment strategies to reduce reliance on high-quality reference images. When reference images are available, MQAF obtains reference-guided quality scores by adaptively weighting reference information and comparing the distorted image with stored distortion patterns in the memory bank. When the reference image is absent, the framework relies on distortion patterns in the memory bank to infer image quality, enabling no-reference quality assessment (NR-IQA). The experimental results show that our method outperforms state-of-the-art approaches across multiple datasets while adapting to both no-reference and full-reference tasks.

[CV-32] ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中缺乏三维(3D)空间理解能力的问题,尤其是在预训练阶段仅使用二维(2D)数据时,导致模型难以有效捕捉空间结构信息。现有方法通常仅在单一层级进行表示对齐,未能充分利用多层特征中的丰富信息;而直接的多层对齐则易引发梯度干扰,影响训练稳定性与性能。解决方案的关键在于提出一种残差导向的多层表示对齐框架ROCKET,其核心创新是将多层对齐建模为两个残差流之间的对齐,并采用共享投影器(shared projector)实现VLA骨干网络与强大3D视觉基础模型之间多层特征的层不变映射,从而显著减少梯度冲突。此外,ROCKET引入Matryoshka风格的稀疏激活机制以平衡多个对齐损失,在无需额外训练的情况下仅消耗约4%的计算预算即可达到LIBERO基准上98.5%的最先进成功率,验证了其高效性与泛化能力。

链接: https://arxiv.org/abs/2602.17951
作者: Guoheng Sun,Tingting Du,Kaixi Feng,Chenxiang Luo,Xingguo Ding,Zheyu Shen,Ziyao Wang,Yexiao He,Ang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at this https URL.

[CV-33] ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

【速读】:该论文旨在解决传统视觉Transformer(Vision Transformer, ViT)在医学影像等场景中因固定位置嵌入(positional embeddings)和分类令牌([CLS] token)引入的先验空间结构限制,导致模型泛化能力受限的问题。其解决方案的关键在于提出ZACH-ViT(Zero-token Adaptive Compact Hierarchical Vision Transformer),通过移除位置嵌入和[CLS] token实现排列不变性(permutation invariance),利用全局平均池化对图像块表示进行聚合;同时引入自适应残差投影以保障小规模模型训练稳定性并严格控制参数量,从而在无预训练、极低参数量(0.25M)条件下仍保持良好性能,适用于资源受限的临床边缘部署环境。

链接: https://arxiv.org/abs/2602.17929
作者: Athanasios Angelakis
机构: BioML Lab(生物机器学习实验室); RI CODE(研究与创新中心); UniBW(慕尼黑应用技术大学); EDS(电子数据科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 15 pages, 12 figures, 7 tables. Code and models available at this https URL

点击查看摘要

Abstract:Vision Transformers rely on positional embeddings and class tokens that encode fixed spatial priors. While effective for natural images, these priors may hinder generalization when spatial layout is weakly informative or inconsistent, a frequent condition in medical imaging and edge-deployed clinical systems. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes both positional embeddings and the [CLS] token, achieving permutation invariance through global average pooling over patch representations. The term “Zero-token” specifically refers to removing the dedicated [CLS] aggregation token and positional embeddings; patch tokens remain unchanged and are processed normally. Adaptive residual projections preserve training stability in compact configurations while maintaining a strict parameter budget. Evaluation is performed across seven MedMNIST datasets spanning binary and multi-class tasks under a strict few-shot protocol (50 samples per class, fixed hyperparameters, five random seeds). The empirical analysis demonstrates regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves its strongest advantage on BloodMNIST and remains competitive with TransMIL on PathMNIST, while its relative advantage decreases on datasets with strong anatomical priors (OCTMNIST, OrganAMNIST), consistent with the architectural hypothesis. These findings support the view that aligning architectural inductive bias with data structure can be more important than pursuing universal benchmark dominance. Despite its minimal size and lack of pretraining, ZACH-ViT achieves competitive performance while maintaining sub-second inference times, supporting deployment in resource-constrained clinical environments. Code and models are available at this https URL. Comments: 15 pages, 12 figures, 7 tables. Code and models available at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV) ACMclasses: I.2.6; I.4.10; J.3 Cite as: arXiv:2602.17929 [cs.CV] (or arXiv:2602.17929v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.17929 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-34] A Single Image and Multimodality Is All You Need for Novel View Synthesis

【速读】:该论文旨在解决基于扩散模型(diffusion-based)的单图像新视角合成(novel view synthesis)中因依赖单目深度估计(monocular depth estimation)而导致的几何一致性差和视觉质量低的问题,尤其在纹理稀疏、恶劣天气及遮挡严重的现实场景下表现脆弱。解决方案的关键在于引入极稀疏的多模态距离测量数据(如汽车雷达或LiDAR),构建一种基于角度域局部高斯过程(localized Gaussian Process)建模的多模态深度重建框架,从而生成鲁棒且带有不确定性量化(uncertainty quantification)的稠密深度图,作为现有扩散渲染流水线中的几何条件输入,无需修改生成模型本身即可显著提升新视角视频生成的几何一致性和视觉质量。

链接: https://arxiv.org/abs/2602.17909
作者: Amirhosein Javadi,Chi-Shiang Gau,Konstantinos D. Polyzos,Tara Javidi
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference while explicitly quantifying uncertainty in regions with limited observations. The reconstructed depth and uncertainty are used as a drop-in replacement for monocular depth estimators in existing diffusion-based rendering pipelines, without modifying the generative model itself. Experiments on real-world multimodal driving scenes demonstrate that replacing vision-only depth with our sparse range-based reconstruction substantially improves both geometric consistency and visual quality in single-image novel-view video generation. These results highlight the importance of reliable geometric priors for diffusion-based view synthesis and demonstrate the practical benefits of multimodal sensing even at extreme levels of sparsity.

[CV-35] Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models

【速读】:该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在细粒度图像分类基准上表现落后的问题,尽管其在视觉推理、文档理解及多模态对话等任务中取得了显著进展。研究发现,这种性能差异可能源于模型架构和预训练策略对细粒度视觉知识建模的不足。解决方案的关键在于:首先,使用更强大的语言模型(LLM)可均衡提升所有基准得分;其次,更强的视觉编码器(vision encoder)能显著改善细粒度分类性能;此外,预训练阶段尤其重要,当语言模型权重在预训练期间未被冻结时,细粒度性能提升更为明显。这些发现为增强VLMs的细粒度视觉理解能力和以视觉为中心的任务表现提供了重要指导。

链接: https://arxiv.org/abs/2602.17871
作者: Dhruba Ghosh,Yuhui Zhang,Ludwig Schmidt
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we find that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining. These insights pave the way for enhancing fine-grained visual understanding and vision-centric capabilities in VLMs.

[CV-36] Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

【速读】:该论文旨在解决长视频(长达数十分钟)理解中因视频序列固有冗余性所带来的挑战,具体包括:在有限内存条件下高效引入更多帧数,以及从海量输入数据中提取具有判别性的信息。其解决方案的关键在于提出一种端到端的框架,包含基于信息密度的自适应视频采样器(Adaptive Video Sampler, AVS)和基于自编码器的时空视频压缩器(Spatiotemporal Video Compressor, SVC),二者均与多模态大语言模型(Multimodal Large Language Model, MLLM)集成。该架构能自适应地捕捉不同长度视频序列中的关键信息,并实现高压缩率的同时保留重要判别特征,从而显著提升长视频理解任务的性能表现。

链接: https://arxiv.org/abs/2602.17869
作者: Yuxiao Chen,Jue Wang,Zhikang Zhang,Jingru Yi,Xu Zhang,Yang Zou,Zhaowei Cai,Jianbo Yuan,Xinyu Li,Hao Yang,Davide Modolo
机构: Amazon AGI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.

[CV-37] On the Evaluation Protocol of Gesture Recognition for UAV-based Rescue Operation based on Deep Learning: A Subject-Independence Perspective

【速读】:该论文旨在解决手势识别(Gesture Recognition)研究中评估协议的有效性问题,特别是针对Liu和Szirányi提出的方法中存在的数据泄露(Data Leakage)风险。其核心发现是:原研究采用的帧级随机训练-测试划分方式导致同一被试样本被混入训练集和测试集,从而人为地提高了准确率指标,使得模型性能无法真实反映对未见过个体的泛化能力。解决方案的关键在于实施受试者独立的数据划分策略(Subject-Independent Data Partitioning),以确保评估结果能够可靠地衡量模型在面对新个体时的手势识别能力,这对于无人机-人类交互(UAV-Human Interaction)等实际应用场景至关重要。

链接: https://arxiv.org/abs/2602.17854
作者: Domonkos Varga
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a methodological analysis of the gesture-recognition approach proposed by Liu and Szirányi, with a particular focus on the validity of their evaluation protocol. We show that the reported near-perfect accuracy metrics result from a frame-level random train-test split that inevitably mixes samples from the same subjects across both sets, causing severe data leakage. By examining the published confusion matrix, learning curves, and dataset construction, we demonstrate that the evaluation does not measure generalization to unseen individuals. Our findings underscore the importance of subject-independent data partitioning in vision-based gesture-recognition research, especially for applications - such as UAV-human interaction - that require reliable recognition of gestures performed by previously unseen people.

[CV-38] Neural Prior Estimation: Learning Class Priors from Latent Representations

【速读】:该论文旨在解决深度神经网络在类别不平衡数据集上产生的系统性偏差问题,这种偏差源于有效类先验的偏斜分布。解决方案的关键在于提出神经先验估计器(Neural Prior Estimator, NPE),其通过从潜在表示中学习特征条件下的对数先验估计值来实现自适应校正;NPE采用一个或多个先验估计模块,并与主干网络通过单向逻辑损失联合训练,在神经坍缩(Neural Collapse)条件下可理论证明恢复出类对数先验(up to an additive constant),无需显式类计数或特定分布的超参数,最终将学习到的先验用于logit调整(NPE-LA),形成一种有理论依据的、面向偏差感知的预测机制。

链接: https://arxiv.org/abs/2602.17853
作者: Masoud Yavari,Payman Moallem
机构: University of Isfahan (伊斯法罕大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Class imbalance induces systematic bias in deep neural networks by imposing a skewed effective class prior. This work introduces the Neural Prior Estimator (NPE), a framework that learns feature-conditioned log-prior estimates from latent representations. NPE employs one or more Prior Estimation Modules trained jointly with the backbone via a one-way logistic loss. Under the Neural Collapse regime, NPE is analytically shown to recover the class log-prior up to an additive constant, providing a theoretically grounded adaptive signal without requiring explicit class counts or distribution-specific hyperparameters. The learned estimate is incorporated into logit adjustment, forming NPE-LA, a principled mechanism for bias-aware prediction. Experiments on long-tailed CIFAR and imbalanced semantic segmentation benchmarks (STARE, ADE20K) demonstrate consistent improvements, particularly for underrepresented classes. NPE thus offers a lightweight and theoretically justified approach to learned prior estimation and imbalance-aware prediction.

[CV-39] VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

【速读】:该论文旨在解决现有在线视频分割模型中因依赖复杂专用跟踪模块而导致的架构冗余与计算开销过大的问题。其解决方案的关键在于提出一种纯编码器结构的视频分割模型——Video Encoder-only Mask Transformer (VidEoMT),通过引入轻量级查询传播机制(query propagation mechanism)实现帧间时序建模,同时采用查询融合策略(query fusion strategy)将历史传播查询与独立学习的时序无关查询相结合,从而在不增加额外跟踪模块的前提下,兼顾时序一致性与对新内容的适应性,最终在保持高精度的同时显著提升推理速度(最高达160 FPS)。

链接: https://arxiv.org/abs/2602.17807
作者: Narges Norouzi,Idil Esen Zulfikar,Niccol`o Cavagnero,Tommie Kerssies,Bastian Leibe,Gijs Dubbelman,Daan de Geus
机构: Eindhoven University of Technology (埃因霍温理工大学); RWTH Aachen University (亚琛工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x–10x faster, running at up to 160 FPS with a ViT-L backbone. Code: this https URL

[CV-40] Enabling Training-Free Text-Based Remote Sensing Segmentation

【速读】:该论文旨在解决遥感图像中文本引导分割(text-guided segmentation)任务的泛化能力不足问题,现有方法通常依赖额外的可训练模块,限制了其在实际场景中的适用性。解决方案的关键在于提出一种无需训练或仅需轻量级LoRA微调的框架,通过整合对比式和生成式视觉语言模型(Vision Language Models, VLMs)与Segment Anything Model (SAM),实现完全零样本(zero-shot)或轻量级微调下的开放词汇语义分割(open-vocabulary semantic segmentation, OVSS)和指代表达分割(referring segmentation)。其中,对比方法利用CLIP作为掩码选择器对SAM的网格提议进行筛选,而生成方法则借助GPT-5或LoRA微调的Qwen-VL生成点击提示(click prompts)以驱动SAM完成推理和指代分割,实验表明后者性能最优。

链接: https://arxiv.org/abs/2602.17799
作者: Jose Sosa,Danila Rukhovich,Anis Kacem,Djamila Aouada
机构: SnT, University of Luxembourg (卢森堡大学 SnT 研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Vision Language Models (VLMs) and Vision Foundation Models (VFMs) have opened new opportunities for zero-shot text-guided segmentation of remote sensing imagery. However, most existing approaches still rely on additional trainable components, limiting their generalisation and practical applicability. In this work, we investigate to what extent text-based remote sensing segmentation can be achieved without additional training, by relying solely on existing foundation models. We propose a simple yet effective approach that integrates contrastive and generative VLMs with the Segment Anything Model (SAM), enabling a fully training-free or lightweight LoRA-tuned pipeline. Our contrastive approach employs CLIP as mask selector for SAM’s grid-based proposals, achieving state-of-the-art open-vocabulary semantic segmentation (OVSS) in a completely zero-shot setting. In parallel, our generative approach enables reasoning and referring segmentation by generating click prompts for SAM using GPT-5 in a zero-shot setting and a LoRA-tuned Qwen-VL model, with the latter yielding the best results. Extensive experiments across 19 remote sensing benchmarks, including open-vocabulary, referring, and reasoning-based tasks, demonstrate the strong capabilities of our approach. Code will be released at this https URL.

[CV-41] LGD-Net: Latent-Guided Dual-Stream Network for HER2 Scoring with Task-Specific Domain Knowledge

【速读】:该论文旨在解决乳腺癌中HER2表达水平准确评估的问题,传统免疫组织化学(IHC)染色方法存在资源消耗大、成本高和耗时长等局限性,尤其在医疗资源匮乏地区难以普及。为替代IHC,研究提出直接从苏木精-伊红(HE)切片预测HER2表达水平的方案,其核心挑战在于如何避免像素级虚拟染色带来的计算复杂性和重建伪影问题。解决方案的关键在于提出Latent-Guided Dual-Stream Network(LGD-Net),通过跨模态特征幻觉(cross-modal feature hallucination)而非显式的图像生成方式,将HE图像中的形态学特征映射至分子潜在空间(molecular latent space),并利用教师IHC编码器进行训练引导;同时引入轻量级辅助正则化任务,基于细胞核分布与膜染色强度等临床相关先验知识对模型进行约束,从而提升诊断准确性并实现高效单模态HE输入推理。

链接: https://arxiv.org/abs/2602.17793
作者: Peide Zhu,Linbin Lu,Zhiqin Chen,Xiong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:It is a critical task to evalaute HER2 expression level accurately for breast cancer evaluation and targeted treatment therapy selection. However, the standard multi-step Immunohistochemistry (IHC) staining is resource-intensive, expensive, and time-consuming, which is also often unavailable in many areas. Consequently, predicting HER2 levels directly from HE slides has emerged as a potential alternative solution. It has been shown to be effective to use virtual IHC images from HE images for automatic HER2 scoring. However, the pixel-level virtual staining methods are computationally expensive and prone to reconstruction artifacts that can propagate diagnostic errors. To address these limitations, we propose the Latent-Guided Dual-Stream Network (LGD-Net), a novel framework that employes cross-modal feature hallucination instead of explicit pixel-level image generation. LGD-Net learns to map morphological HE features directly to the molecular latent space, guided by a teacher IHC encoder during training. To ensure the hallucinated features capture clinically relevant phenotypes, we explicitly regularize the model training with task-specific domain knowledge, specifically nuclei distribution and membrane staining intensity, via lightweight auxiliary regularization tasks. Extensive experiments on the public BCI dataset demonstrate that LGD-Net achieves state-of-the-art performance, significantly outperforming baseline methods while enabling efficient inference using single-modality HE inputs.

[CV-42] Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision

【速读】:该论文旨在解决结肠镜辅助导航中单目深度估计与位姿估计的挑战问题,尤其针对纹理缺失表面、复杂光照变化、图像形变以及缺乏真实体内标注数据等问题。其核心解决方案是提出一种名为PRISM(Pose-Refinement with Intrinsic Shading and edge Maps)的自监督学习框架,关键创新在于融合解剖结构先验与光照先验以引导几何学习:一方面利用基于学习的边缘检测器(如DexiNed或HED)提取高频率边界信息作为结构指导;另一方面通过内在分解模块实现亮度解耦(luminance decoupling),将阴影与反射分量分离,从而有效利用阴影线索提升深度估计精度。实验表明该方法在多个真实与合成数据集上均达到当前最优性能,并揭示了域真实性优于标注可用性及视频帧率对训练数据质量的关键影响。

链接: https://arxiv.org/abs/2602.17785
作者: Xinwei Ju,Rema Daher,Danail Stoyanov,Sophia Bano,Francisco Vasconcelos
机构: University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures; early accepted by IPCAI2026

点击查看摘要

Abstract:Monocular depth and pose estimation play an important role in the development of colonoscopy-assisted navigation, as they enable improved screening by reducing blind spots, minimizing the risk of missed or recurrent lesions, and lowering the likelihood of incomplete examinations. However, this task remains challenging due to the presence of texture-less surfaces, complex illumination patterns, deformation, and a lack of in-vivo datasets with reliable ground truth. In this paper, we propose PRISM (Pose-Refinement with Intrinsic Shading and edge Maps), a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning. Our approach uniquely incorporates edge detection and luminance decoupling for structural guidance. Specifically, edge maps are derived using a learning-based edge detector (e.g., DexiNed or HED) trained to capture thin and high-frequency boundaries, while luminance decoupling is obtained through an intrinsic decomposition module that separates shading and reflectance, enabling the model to exploit shading cues for depth estimation. Experimental results on multiple real and synthetic datasets demonstrate state-of-the-art performance. We further conduct a thorough ablation study on training data selection to establish best practices for pose and depth estimation in colonoscopy. This analysis yields two practical insights: (1) self-supervised training on real-world data outperforms supervised training on realistic phantom data, underscoring the superiority of domain realism over ground truth availability; and (2) video frame rate is an extremely important factor for model performance, where dataset-specific video frame sampling is necessary for generating high quality training data.

[CV-43] CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild ICLR2026

【速读】:该论文旨在解决自然手部动作建模中存在的一系列挑战,尤其是现有方法在文本到手部动作生成(text-to-hand-motion generation)和手部动画描述(hand animation captioning)任务中依赖于昂贵且受限的实验室采集数据集,难以扩展至真实世界(in-the-wild)场景的问题。此外,当前模型在保持动画保真度与文本-动作对齐方面表现不足。解决方案的关键在于:(1) 构建首个大规模、多样化的“3D手部在野外”(3D Hands in the Wild, 3D-HIW)数据集,包含32K个3D手部动作序列及其对齐文本;(2) 提出基于大语言模型(LLM)的手部动画系统CLUTCH,其核心创新包括:a) SHIFT——一种分部件模态分解的向量量化变分自编码器(VQ-VAE),用于高效表征手部运动;b) 几何精修阶段,通过直接作用于解码后手部运动参数的重建损失对LLM进行微调,从而显著提升动画质量与文本-动作一致性。

链接: https://arxiv.org/abs/2602.17770
作者: Balamurugan Thambiraja,Omid Taheri,Radek Danecek,Giorgio Becherini,Gerard Pons-Moll,Justus Thies
机构: Technical University of Darmstadt (达姆施塔特工业大学); Max-Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); University of Tuebingen (图宾根大学); Tubingen AI Center (图宾根人工智能中心); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICLR2026; Project page: this https URL

点击查看摘要

Abstract:Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to “in-the-wild” settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce ‘3D Hands in the Wild’ (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.

[CV-44] KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding

【速读】:该论文旨在解决视频字幕生成模型在描述精细运动细节时的准确性不足以及严重幻觉(hallucination)问题,尤其是在以动作为中心的视频中,模型常忽略肢体动态等关键运动信息。其解决方案的关键在于提出了一种基于运动学计算与语言解析相结合的自动化标注流程,从而构建了KPM-Bench数据集,并设计了语言基础的运动解析与提取算法(MoPE),该算法可从文本字幕中精准提取运动属性,进而实现无需依赖大规模视觉-语言或纯语言模型的幻觉评估与抑制机制。通过将MoPE集成至GRPO后训练框架,显著提升了动作导向型视频字幕模型的可靠性。

链接: https://arxiv.org/abs/2602.17768
作者: Boda Lin,Yongjie Zhu,Xiaocheng Gong,Wenyu Qin,Meng Wang
机构: Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages

点击查看摘要

Abstract:Despite recent advancements, video captioning models still face significant limitations in accurately describing fine-grained motion details and suffer from severe hallucination issues. These challenges become particularly prominent when generating captions for motion-centric videos, where precise depiction of intricate movements and limb dynamics is crucial yet often neglected. To alleviate this gap, we introduce an automated annotation pipeline that integrates kinematic-based motion computation with linguistic parsing, enabling detailed decomposition and description of complex human motions. Based on this pipeline, we construct and release the Kinematic Parsing Motion Benchmark (KPM-Bench), a novel open-source dataset designed to facilitate fine-grained motion understanding. KPM-Bench consists of (i) fine-grained video-caption pairs that comprehensively illustrate limb-level dynamics in complex actions, (ii) diverse and challenging question-answer pairs focusing specifically on motion understanding, and (iii) a meticulously curated evaluation set specifically designed to assess hallucination phenomena associated with motion descriptions. Furthermore, to address hallucination issues systematically, we propose the linguistically grounded Motion Parsing and Extraction (MoPE) algorithm, capable of accurately extracting motion-specific attributes directly from textual captions. Leveraging MoPE, we introduce a precise hallucination evaluation metric that functions independently of large-scale vision-language or language-only models. By integrating MoPE into the GRPO post-training framework, we effectively mitigate hallucination problems, significantly improving the reliability of motion-centric video captioning models.

[CV-45] DesignAsCode: Bridging Structural Editability and Visual Fidelity in Graphic Design Generation

【速读】:该论文旨在解决图形设计生成中视觉保真度与结构可编辑性难以兼顾的问题。现有方法通常分为两类:不可编辑的位图图像合成或缺乏视觉内容的抽象版面生成,而两者结合的方法常因表达能力有限和开环特性导致构图僵化及视觉冲突(如文本与背景不协调)。解决方案的关键在于提出DesignAsCode框架,将图形设计重构为基于HTML/CSS的程序化合成任务,其核心是引入Plan-Implement-Reflect(规划-实现-反思)流程:通过语义规划器构建动态、多层嵌套的元素层次结构,并利用视觉感知的反思机制迭代优化代码以修正渲染缺陷,从而在保持高结构有效性的同时提升美学质量,并支持自动布局重定向、复杂文档生成及CSS动画等高级功能。

链接: https://arxiv.org/abs/2602.17690
作者: Ziyuan Liu,Shizhao Sun,Danqing Huang,Yingdong Shi,Meisheng Zhang,Ji Li,Jingsong Yu,Jiang Bian
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Graphic design generation demands a delicate balance between high visual fidelity and fine-grained structural editability. However, existing approaches typically bifurcate into either non-editable raster image synthesis or abstract layout generation devoid of visual content. Recent combinations of these two approaches attempt to bridge this gap but often suffer from rigid composition schemas and unresolvable visual dissonances (e.g., text-background conflicts) due to their inexpressive representation and open-loop nature. To address these challenges, we propose DesignAsCode, a novel framework that reimagines graphic design as a programmatic synthesis task using HTML/CSS. Specifically, we introduce a Plan-Implement-Reflect pipeline, incorporating a Semantic Planner to construct dynamic, variable-depth element hierarchies and a Visual-Aware Reflection mechanism that iteratively optimizes the code to rectify rendering artifacts. Extensive experiments demonstrate that DesignAsCode significantly outperforms state-of-the-art baselines in both structural validity and aesthetic quality. Furthermore, our code-native representation unlocks advanced capabilities, including automatic layout retargeting, complex document generation (e.g., resumes), and CSS-based animation.

[CV-46] Probabilistic NDVI Forecasting from Sparse Satellite Time Series and Weather Covariates

【速读】:该论文旨在解决在云层遮挡导致遥感数据稀疏且不规则采样的条件下,如何实现高精度的作物植被动态短期预测问题(即基于卫星观测的归一化植被指数NDVI预测)。其核心挑战在于复杂多变的气候条件与数据获取受限之间的矛盾。解决方案的关键在于提出一种概率性预测框架,采用基于Transformer的架构显式分离历史植被动态建模与未来外生信息(如气象变量)的处理机制,并引入时序距离加权分位数损失函数以适配不同预测时间窗口的不确定性分布;同时通过累积天气特征和极端天气特征工程增强对延迟气象效应的捕捉能力,从而提升模型在真实农业场景下的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2602.17683
作者: Irene Iele,Giulia Romoli,Daniele Molino,Elena Mulero Ayllón,Filippo Ruffini,Paolo Soda,Matteo Tortora
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Accurate short-term forecasting of vegetation dynamics is a key enabler for data-driven decision support in precision agriculture. Normalized Difference Vegetation Index (NDVI) forecasting from satellite observations, however, remains challenging due to sparse and irregular sampling caused by cloud coverage, as well as the heterogeneous climatic conditions under which crops evolve. In this work, we propose a probabilistic forecasting framework specifically designed for field-level NDVI prediction under clear-sky acquisition constraints. The method leverages a transformer-based architecture that explicitly separates the modeling of historical vegetation dynamics from future exogenous information, integrating historical NDVI observations with both historical and future meteorological covariates. To address irregular revisit patterns and horizon-dependent uncertainty, we introduce a temporal-distance weighted quantile loss that aligns the training objective with the effective forecasting horizon. In addition, we incorporate cumulative and extreme-weather feature engineering to better capture delayed meteorological effects relevant to vegetation response. Extensive experiments on European satellite data demonstrate that the proposed approach consistently outperforms a diverse set of statistical, deep learning, and recent time series baselines across both point-wise and probabilistic evaluation metrics. Ablation studies further highlight the central role of target history, while showing that meteorological covariates provide complementary gains when jointly exploited. The code is available at this https URL.

[CV-47] Spatio-Spectroscopic Representation Learning using Unsupervised Convolutional Long-Short Term Memory Networks ICML

【速读】:该论文旨在解决如何从积分场光谱(Integral Field Spectroscopy, IFS)数据中自动提取跨空间和光谱维度的通用特征表示,以揭示星系演化中的潜在规律。其关键解决方案是提出一种基于卷积长短期记忆网络自编码器(Convolutional Long-Short Term Memory Network Autoencoders)的无监督深度学习框架,能够同时编码来自MaNGA IFS巡天样本中约9000个星系的19条光学发射线(波长范围3800Å–8000Å)的空间与光谱信息,从而实现对复杂天文数据的多维特征学习,并在290个活动星系核(Active Galactic Nuclei, AGN)样本上验证了模型对异常AGN的识别能力。

链接: https://arxiv.org/abs/2602.18426
作者: Kameswara Bharadwaj Mantha,Lucy Fortson,Ramanakumar Sankar,Claudia Scarlata,Chris Lintott,Sandor Kruk,Mike Walmsley,Hugh Dickinson,Karen Masters,Brooke Simmons,Rebecca Smethurst
机构: 未知
类目: Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript was previously submitted to ICML for peer review. Reviewers noted that while the underlying VAE-based architecture builds on established methods, its application to spatially-resolved IFS data is promising for unsupervised representation learning in astronomy. This version is released for community visibility. Reviewer decisions: Weak accept and Weak reject (Final: Reject)

点击查看摘要

Abstract:Integral Field Spectroscopy (IFS) surveys offer a unique new landscape in which to learn in both spatial and spectroscopic dimensions and could help uncover previously unknown insights into galaxy evolution. In this work, we demonstrate a new unsupervised deep learning framework using Convolutional Long-Short Term Memory Network Autoencoders to encode generalized feature representations across both spatial and spectroscopic dimensions spanning 19 optical emission lines (3800A \lambda 8000A) among a sample of \sim 9000 galaxies from the MaNGA IFS survey. As a demonstrative exercise, we assess our model on a sample of 290 Active Galactic Nuclei (AGN) and highlight scientifically interesting characteristics of some highly anomalous AGN.

[CV-48] Exploiting Completeness Perception with Diffusion Transformer for Unified 3D MRI Synthesis

【速读】:该论文旨在解决多模态脑部磁共振成像(MRI)中缺失模态以及心脏MRI中缺失切片等缺失数据问题,这些问题在临床实践中严重影响图像质量与诊断准确性。传统方法依赖外部手动标注的掩码作为生成模型的指导信号,但在真实临床环境中,此类标注往往不可靠或难以获取,且提供的信息不足以提升语义一致性。解决方案的关键在于提出一种具备自感知完整性能力的通用潜在扩散模型 CoPeDiT,其核心创新包括:1)在编码器-解码器架构(CoPeVAE)中嵌入特定预训练任务,以学习具备完整性感知的判别性提示;2)设计专用于3D MRI合成的扩散Transformer架构 MDiT3D,利用上述提示引导生成过程,从而增强三维空间内的语义一致性。实验表明,该方法在三个大规模MRI数据集上显著优于现有最优方法,在鲁棒性、泛化性和灵活性方面均表现出色。

链接: https://arxiv.org/abs/2602.18400
作者: Junkai Liu,Nay Aung,Theodoros N. Arvanitis,Joao A. C. Lima,Steffen E. Petersen,Daniel C. Alexander,Le Zhang
机构: University of Birmingham (伯明翰大学); Queen Mary University London (伦敦玛丽女王大学); Barts Heart Centre (巴茨心脏中心); Johns Hopkins University School of Medicine (约翰霍普金斯大学医学院); University College London (伦敦大学学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Missing data problems, such as missing modalities in multi-modal brain MRI and missing slices in cardiac MRI, pose significant challenges in clinical practice. Existing methods rely on external guidance to supply detailed missing state for instructing generative models to synthesize missing MRIs. However, manual indicators are not always available or reliable in real-world scenarios due to the unpredictable nature of clinical environments. Moreover, these explicit masks are not informative enough to provide guidance for improving semantic consistency. In this work, we argue that generative models should infer and recognize missing states in a self-perceptive manner, enabling them to better capture subtle anatomical and pathological variations. Towards this goal, we propose CoPeDiT, a general-purpose latent diffusion model equipped with completeness perception for unified synthesis of 3D MRIs. Specifically, we incorporate dedicated pretext tasks into our tokenizer, CoPeVAE, empowering it to learn completeness-aware discriminative prompts, and design MDiT3D, a specialized diffusion transformer architecture for 3D MRI synthesis, that effectively uses the learned prompts as guidance to enhance semantic consistency in 3D space. Comprehensive evaluations on three large-scale MRI datasets demonstrate that CoPeDiT significantly outperforms state-of-the-art methods, achieving superior robustness, generalizability, and flexibility. The code is available at this https URL .

[CV-49] Quantum-enhanced satellite image classification

【速读】:该论文旨在解决多类图像分类任务中精度提升的难题,尤其是在空间应用(如卫星成像和遥感)等高风险、数据驱动场景下,传统经典机器学习方法面临性能瓶颈的问题。解决方案的关键在于提出一种量子特征提取方法,利用多体自旋哈密顿量的动力学特性生成具有表达力的量子特征,并将其与经典处理相结合,构建量子-经典混合架构。实验表明,该方法在ResNet50基线模型基础上,将分类准确率从83%提升至87%,相较于经典迁移学习方法(84%)实现了2–3%的绝对精度提升,且在多个IBM量子处理器上表现出一致的性能增益,验证了当前及近中期量子处理器在实际机器学习任务中的实用潜力。

链接: https://arxiv.org/abs/2602.18350
作者: Qi Zhang,Anton Simen,Carlos Flores-Garrigós,Gabriel Alvarado Barrios,Paolo A. Erdman,Enrique Solano,Aaron C. Kemp,Vincent Beltrani,Vedangi Pathak,Hamed Mohammadbagherpoor
机构: Kipu Quantum(基普量子); University of the Basque Country UPV/EHU(巴斯克大学); University of Valencia(瓦伦西亚大学); KPMG LLP(毕马威); IBM T. J. Watson Research Center(IBM托马斯·沃森研究中心)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We demonstrate the application of a quantum feature extraction method to enhance multi-class image classification for space applications. By harnessing the dynamics of many-body spin Hamiltonians, the method generates expressive quantum features that, when combined with classical processing, lead to quantum-enhanced classification accuracy. Using a strong and well-established ResNet50 baseline, we achieved a maximum classical accuracy of 83%, which can be improved to 84% with a transfer learning approach. In contrast, applying our quantum-classical method the performance is increased to 87% accuracy, demonstrating a clear and reproducible improvement over robust classical approaches. Implemented on several of IBM’s quantum processors, our hybrid quantum-classical approach delivers consistent gains of 2-3% in absolute accuracy. These results highlight the practical potential of current and near-term quantum processors in high-stakes, data-driven domains such as satellite imaging and remote sensing, while suggesting broader applicability in real-world machine learning tasks.

[CV-50] RamanSeg: Interpretability-driven Deep Learning on Raman Spectra for Cancer Diagnosis

【速读】:该论文旨在解决传统病理切片诊断中依赖化学染色和人工判读的低效问题,提出基于拉曼光谱(Raman spectroscopy)的无染色、自动化组织分割方法。其关键解决方案是构建了一个新颖的空间拉曼光谱数据集,并采用nnU-Net训练分割模型,同时创新性地提出了可解释的原型驱动架构RamanSeg,该架构通过学习训练集中典型区域(prototype)来分类像素,实现高精度分割;其中无投影版本的RamanSeg在保持良好可解释性的同时,相较U-Net基线模型实现了更高的平均前景Dice分数(67.3%),显著优于黑箱式深度学习方法。

链接: https://arxiv.org/abs/2602.18119
作者: Chris Tomy,Mo Vali,David Pertzborn,Tammam Alamatouri,Anna Mühlig,Orlando Guntinas-Lichius,Anna Xylander,Eric Michele Fantuzzi,Matteo Negro,Francesco Crisafi,Pietro Lio,Tiago Azevedo
机构: University of Cambridge (剑桥大学); Jena University Hospital (耶拿大学医院); Cambridge Raman Imaging Srl (剑桥拉曼成像有限公司); University of Cambridge (剑桥大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Histopathology, the current gold standard for cancer diagnosis, involves the manual examination of tissue samples after chemical staining, a time-consuming process requiring expert analysis. Raman spectroscopy is an alternative, stain-free method of extracting information from samples. Using nnU-Net, we trained a segmentation model on a novel dataset of spatial Raman spectra aligned with tumour annotations, achieving a mean foreground Dice score of 80.9%, surpassing previous work. Furthermore, we propose a novel, interpretable, prototype-based architecture called RamanSeg. RamanSeg classifies pixels based on discovered regions of the training set, generating a segmentation mask. Two variants of RamanSeg allow a trade-off between interpretability and performance: one with prototype projection and another projection-free version. The projection-free RamanSeg outperformed a U-Net baseline with a mean foreground Dice score of 67.3%, offering a meaningful improvement over a black-box training approach.

[CV-51] From Global Radiomics to Parametric Maps: A Unified Workflow Fusing Radiomics and Deep Learning for PDAC Detection

【速读】:该论文旨在解决现有医学影像分析方法中融合策略的局限性问题,即大多数放射组学(Radiomics)与深度学习结合的方法仅利用全局放射组学特征,忽略了空间分辨的放射组学参数图(parametric maps)所蕴含的互补信息。其解决方案的关键在于提出一个统一框架:首先筛选具有判别力的放射组学特征,随后将这些特征以多尺度方式注入到改进的nnUNet模型中,实现全局层面和体素层面(voxel level)的双重增强,从而有效提升胰腺导管腺癌(Pancreatic Ductal Adenocarcinoma, PDAC)检测性能。实验表明,该方法在多个数据集上均显著优于基线模型,并在PANORAMA Grand Challenge中取得第二名,验证了多层级特征融合的有效性。

链接: https://arxiv.org/abs/2602.17986
作者: Zengtian Deng,Yimeng He,Yu Shi,Lixia Wang,Touseef Ahmad Qureshi,Xiuzhen Huang,Debiao Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Radiomics and deep learning both offer powerful tools for quantitative medical imaging, but most existing fusion approaches only leverage global radiomic features and overlook the complementary value of spatially resolved radiomic parametric maps. We propose a unified framework that first selects discriminative radiomic features and then injects them into a radiomics-enhanced nnUNet at both the global and voxel levels for pancreatic ductal adenocarcinoma (PDAC) detection. On the PANORAMA dataset, our method achieved AUC = 0.96 and AP = 0.84 in cross-validation. On an external in-house cohort, it achieved AUC = 0.95 and AP = 0.78, outperforming the baseline nnUNet; it also ranked second in the PANORAMA Grand Challenge. This demonstrates that handcrafted radiomics, when injected at both global and voxel levels, provide complementary signals to deep learning models for PDAC detection. Our code can be found at this https URL

[CV-52] MeDUET: Disentangled Unified Pretraining for 3D Medical Image Synthesis and Analysis

【速读】:该论文旨在解决3D医学图像中合成与分析任务分离的问题,尤其是在多中心数据集存在显著风格偏移(style shift)的情况下,如何实现统一的预训练框架。关键挑战在于:不同采集站点的图像风格与解剖结构在切片层面存在共变关系,导致潜在因子难以可靠区分。解决方案的核心是提出MeDUET框架,通过在变分自编码器(VAE)隐空间中显式地将域不变内容(domain-invariant content)与域特定风格(domain-specific style)解耦,并引入token demixing机制使解耦成为可实证识别的属性;同时设计了两种新颖的代理任务——混合因子token蒸馏(Mixed-Factor Token Distillation, MFTD)和交换不变四元组对比(Swap-invariance Quadruplet Contrast, SiQC),协同增强解耦效果。这一方法成功将多源异质性转化为学习信号,实现了3D医学图像合成与分析的统一预训练。

链接: https://arxiv.org/abs/2602.17901
作者: Junkai Liu,Ling Shao,Le Zhang
机构: University of Birmingham (伯明翰大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) and diffusion models have advanced representation learning and image synthesis. However, in 3D medical imaging, they remain separate: diffusion for synthesis, SSL for analysis. Unifying 3D medical image synthesis and analysis is intuitive yet challenging, as multi-center datasets exhibit dominant style shifts, while downstream tasks rely on anatomy, and site-specific style co-varies with anatomy across slices, making factors unreliable without explicit constraints. In this paper, we propose MeDUET, a 3D Medical image Disentangled UnifiEd PreTraining framework that performs SSL in the Variational Autoencoder (VAE) latent space which explicitly disentangles domain-invariant content from domain-specific style. The token demixing mechanism serves to turn disentanglement from a modeling assumption into an empirically identifiable property. Two novel proxy tasks, Mixed-Factor Token Distillation (MFTD) and Swap-invariance Quadruplet Contrast (SiQC), are devised to synergistically enhance disentanglement. Once pretrained, MeDUET is capable of (i) delivering higher fidelity, faster convergence, and improved controllability for synthesis, and (ii) demonstrating strong domain generalization and notable label efficiency for analysis across diverse medical benchmarks. In summary, MeDUET converts multi-source heterogeneity from an obstacle into a learning signal, enabling unified pretraining for 3D medical image synthesis and analysis. The code is available at this https URL .

[CV-53] opoGate: Quality-Aware Topology-Stabilized Gated Fusion for Longitudinal Low-Dose CT New-Lesion Prediction

【速读】:该论文旨在解决纵向低剂量CT(Low-Dose CT, LDCT)随访中因噪声、重建核函数及配准质量差异导致减影图像不稳定、进而引发假新病灶报警的问题。解决方案的关键在于提出一种轻量级模型TopoGate,该模型融合外观视图(appearance view)与减影视图(subtraction view),并通过一个由三个病例特异性信号驱动的可学习质量感知门控机制(quality-aware gate)动态调节二者影响权重:即CT外观质量、配准一致性以及基于拓扑度量(topological metrics)的解剖结构稳定性。该门控机制能够根据图像质量自动调整关注重点,在噪声增加时更依赖外观信息,从而模拟放射科医生的判读习惯,提升新病灶识别的准确性和校准性能。

链接: https://arxiv.org/abs/2602.17855
作者: Seungik Cho
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Longitudinal low-dose CT follow-ups vary in noise, reconstruction kernels, and registration quality. These differences destabilize subtraction images and can trigger false new lesion alarms. We present TopoGate, a lightweight model that combines the follow-up appearance view with the subtraction view and controls their influence through a learned, quality-aware gate. The gate is driven by three case-specific signals: CT appearance quality, registration consistency, and stability of anatomical topology measured with topological metrics. On the NLST–New-Lesion–LongCT cohort comprising 152 pairs from 122 patients, TopoGate improves discrimination and calibration over single-view baselines, achieving an area under the ROC curve of 0.65 with a standard deviation of 0.05 and a Brier score of 0.14. Removing corrupted or low-quality pairs, identified by the quality scores, further increases the area under the ROC curve from 0.62 to 0.68 and reduces the Brier score from 0.14 to 0.12. The gate responds predictably to degradation, placing more weight on appearance when noise grows, which mirrors radiologist practice. The approach is simple, interpretable, and practical for reliable longitudinal LDCT triage.

[CV-54] Promptable segmentation with region exploration enables minimal-effort expert-level prostate cancer delineation

【速读】:该论文旨在解决前列腺癌在磁共振(MR)图像中分割不准确的问题,这一问题限制了靶向活检、冷冻消融和放疗等图像引导干预措施的疗效。传统自动化方法依赖大量专家标注数据,但标注一致性差且耗时;而人工勾画则劳动强度大、效率低。解决方案的关键在于提出一种基于用户点提示(point prompt)的交互式分割框架,融合强化学习(Reinforcement Learning, RL)与区域生长(region-growing)策略:从初始点出发生成初步分割,随后由RL代理迭代预测新点以优化掩码,奖励函数平衡分割精度与体素级不确定性,从而在保持高精度的同时显著减少用户标注工作量——实验表明其性能优于现有全自动方法,并达到放射科医生手动分割水平,同时将标注时间降低至原来的十分之一。

链接: https://arxiv.org/abs/2602.17813
作者: Junqing Yang,Natasha Thorley,Ahmed Nadeem Abbasi,Shonit Punwani,Zion Tse,Yipeng Hu,Shaheer U. Saeed
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IPCAI 2026 (IJCARS - IPCAI 2026 Special Issue)

点击查看摘要

Abstract:Purpose: Accurate segmentation of prostate cancer on magnetic resonance (MR) images is crucial for planning image-guided interventions such as targeted biopsies, cryoablation, and radiotherapy. However, subtle and variable tumour appearances, differences in imaging protocols, and limited expert availability make consistent interpretation difficult. While automated methods aim to address this, they rely on large expertly-annotated datasets that are often inconsistent, whereas manual delineation remains labour-intensive. This work aims to bridge the gap between automated and manual segmentation through a framework driven by user-provided point prompts, enabling accurate segmentation with minimal annotation effort. Methods: The framework combines reinforcement learning (RL) with a region-growing segmentation process guided by user prompts. Starting from an initial point prompt, region-growing generates a preliminary segmentation, which is iteratively refined through RL. At each step, the RL agent observes the image and current segmentation to predict a new point, from which region growing updates the mask. A reward, balancing segmentation accuracy and voxel-wise uncertainty, encourages exploration of ambiguous regions, allowing the agent to escape local optima and perform sample-specific optimisation. Despite requiring fully supervised training, the framework bridges manual and fully automated segmentation at inference by substantially reducing user effort while outperforming current fully automated methods. Results: The framework was evaluated on two public prostate MR datasets (PROMIS and PICAI, with 566 and 1090 cases). It outperformed the previous best automated methods by 9.9% and 8.9%, respectively, with performance comparable to manual radiologist segmentation, reducing annotation time tenfold. Comments: Accepted at IPCAI 2026 (IJCARS - IPCAI 2026 Special Issue) Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.17813 [eess.IV] (or arXiv:2602.17813v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2602.17813 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-55] Deep Learning for Dermatology: An Innovative Framework for Approaching Precise Skin Cancer Detection

【速读】:该论文旨在解决皮肤癌(Skin Cancer)早期诊断中良恶性皮肤病变分类的难题,以提升 dermatological diagnostics 的准确性与效率。其解决方案的关键在于应用两种主流深度学习模型——VGG16 和 DenseNet201——对包含 3297 张图像的二分类数据集进行训练与评估,通过优化卷积神经网络(Convolutional Neural Networks, CNN)架构实现对皮肤病变的自动判别,最终 DenseNet201 模型达到 93.79% 的最高准确率,验证了深度学习在皮肤癌辅助诊断中的有效性与潜力。

链接: https://arxiv.org/abs/2602.17797
作者: Mohammad Tahmid Noor,B. M. Shahria Alam,Tasmiah Rahman Orpa,Shaila Afroz Anika,Mahjabin Tasnim Samiha,Fahad Ahammed
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 9 figures, this is the author’s accepted manuscript of a paper accepted for publication in the Proceedings of the 16th International IEEE Conference on Computing, Communication and Networking Technologies (ICCCNT 2025). The final published version will be available via IEEE Xplore

点击查看摘要

Abstract:Skin cancer can be life-threatening if not diagnosed early, a prevalent yet preventable disease. Globally, skin cancer is perceived among the finest prevailing cancers and millions of people are diagnosed each year. For the allotment of benign and malignant skin spots, an area of critical importance in dermatological diagnostics, the application of two prominent deep learning models, VGG16 and DenseNet201 are investigated by this paper. We evaluate these CNN architectures for their efficacy in differentiating benign from malignant skin lesions leveraging enhancements in deep learning enforced to skin cancer spotting. Our objective is to assess model accuracy and computational efficiency, offering insights into how these models could assist in early detection, diagnosis, and streamlined workflows in dermatology. We used two deep learning methods DenseNet201 and VGG16 model on a binary class dataset containing 3297 images. The best result with an accuracy of 93.79% achieved by DenseNet201. All images were resized to 224x224 by rescaling. Although both models provide excellent accuracy, there is still some room for improvement. In future using new datasets, we tend to improve our work by achieving great accuracy.

[CV-56] Detection and Classification of Cetacean Echolocation Clicks using Image-based Object Detection Methods applied to Advanced Wavelet-based Transformations

【速读】:该论文旨在解决海洋生物声学分析中动物信号(如叫声、哨声和咔嗒声)自动检测的难题,以支持行为学研究。传统手动标注方法耗时过长,难以处理足够数据以获得可靠结果;而基础数学模型在复杂环境中(如低信噪比或区分咔嗒声与回声)表现不佳。解决方案的关键在于采用基于深度学习神经网络(DNN)的方法,特别是CLICK-SPOT模型,其将音频信号转化为图像表示(如小波变换生成的时频图),利用小波变换优于短时傅里叶变换(STFT)的时频分辨率特性,从而提升在复杂生物声学环境中的特征提取能力。

链接: https://arxiv.org/abs/2602.17749
作者: Christopher Hauer
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: My Master thesis CLICK-SPOT from 2025

点击查看摘要

Abstract:A challenge in marine bioacoustic analysis is the detection of animal signals, like calls, whistles and clicks, for behavioral studies. Manual labeling is too time-consuming to process sufficient data to get reasonable results. Thus, an automatic solution to overcome the time-consuming data analysis is necessary. Basic mathematical models can detect events in simple environments, but they struggle with complex scenarios, like differentiating signals with a low signal-to-noise ratio or distinguishing clicks from echoes. Deep Learning Neural Networks, such as ANIMAL-SPOT, are better suited for such tasks. DNNs process audio signals as image representations, often using spectrograms created by Short-Time Fourier Transform. However, spectrograms have limitations due to the uncertainty principle, which creates a tradeoff between time and frequency resolution. Alternatives like the wavelet, which provides better time resolution for high frequencies and improved frequency resolution for low frequencies, may offer advantages for feature extraction in complex bioacoustic environments. This thesis shows the efficacy of CLICK-SPOT on Norwegian Killer whale underwater recordings provided by the cetacean biologist Dr. Vester. Keywords: Bioacoustics, Deep Learning, Wavelet Transformation

人工智能

[AI-0] Unifying approach to uniform expressivity of graph neural networks

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)表达能力受限的问题,特别是标准GNNs仅能对局部邻域或全局读出进行聚合,难以捕捉更复杂的子结构信息。其解决方案的关键在于提出了一种通用框架——模板图神经网络(Template GNNs, T-GNNs),通过在指定的图模板集合中聚合有效模板嵌入来更新节点特征,从而增强模型对子结构信息的建模能力;同时引入了广义的梯度模板模态逻辑(Graded Template Modal Logic, GML(T))和基于模板的双模拟关系与Weisfeiler-Leman算法,建立了T-GNNs与GML(T)之间的等价性,并统一分析了包括AC-GNN及其变体在内的多种GNN架构作为T-GNNs的具体实例。

链接: https://arxiv.org/abs/2602.18409
作者: Huan Luo,Jonni Virtema
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:The expressive power of Graph Neural Networks (GNNs) is often analysed via correspondence to the Weisfeiler-Leman (WL) algorithm and fragments of first-order logic. Standard GNNs are limited to performing aggregation over immediate neighbourhoods or over global read-outs. To increase their expressivity, recent attempts have been made to incorporate substructural information (e.g. cycle counts and subgraph properties). In this paper, we formalize this architectural trend by introducing Template GNNs (T-GNNs), a generalized framework where node features are updated by aggregating over valid template embeddings from a specified set of graph templates. We propose a corresponding logic, Graded template modal logic (GML(T)), and generalized notions of template-based bisimulation and WL algorithm. We establish an equivalence between the expressive power of T-GNNs and GML(T), and provide a unifying approach for analysing GNN expressivity: we show how standard AC-GNNs and its recent variants can be interpreted as instantiations of T-GNNs.

[AI-1] Leakage and Second-Order Dynamics Improve Hippocampal RNN Replay

【速读】:该论文旨在解决当前基于噪声递归神经网络(RNN)的内部重放(replay)机制在模拟生物神经网络(如海马体)中的轨迹重放行为时存在的局限性,特别是其采样效率低、探索能力不足以及难以实现时间压缩重放的问题。解决方案的关键在于:首先,通过理论分析证明梯度引导的重放活动具有时变特性,从而提出利用隐藏状态泄漏(hidden state leakage)来优化重放;其次,引入隐藏状态适应(hidden state adaptation,即负反馈机制)以增强探索性,但发现其会导致非马尔可夫采样并减慢重放速度;最后,提出一种基于隐藏状态动量(hidden state momentum)的时序压缩重放模型,将其与欠阻尼朗之万采样(underdamped Langevin sampling)相联系,结合适应机制可在保持探索性的前提下显著提升重放速度。

链接: https://arxiv.org/abs/2602.18401
作者: Josue Casco-Rodriguez,Nanda H. Krishna,Richard G. Baraniuk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Biological neural networks (like the hippocampus) can internally generate “replay” resembling stimulus-driven activity. Recent computational models of replay use noisy recurrent neural networks (RNNs) trained to path-integrate. Replay in these networks has been described as Langevin sampling, but new modifiers of noisy RNN replay have surpassed this description. We re-examine noisy RNN replay as sampling to understand or improve it in three ways: (1) Under simple assumptions, we prove that the gradients replay activity should follow are time-varying and difficult to estimate, but readily motivate the use of hidden state leakage in RNNs for replay. (2) We confirm that hidden state adaptation (negative feedback) encourages exploration in replay, but show that it incurs non-Markov sampling that also slows replay. (3) We propose the first model of temporally compressed replay in noisy path-integrating RNNs through hidden state momentum, connect it to underdamped Langevin sampling, and show that, together with adaptation, it counters slowness while maintaining exploration. We verify our findings via path-integration of 2D triangular and T-maze paths and of high-dimensional paths of synthetic rat place cell activity.

[AI-2] Learning to Tune Pure Pursuit in Autonomous Racing: Joint Lookahead and Steering-Gain Control with PPO

【速读】:该论文旨在解决纯追踪(Pure Pursuit, PP)算法在自动驾驶赛车路径跟踪中因关键参数(前瞻距离 LdL_d 和转向增益 gg)选择不当而导致性能敏感、难以跨赛道迁移的问题。传统基于速度的参数调度方法仅能近似调整这些参数,且无法适应不同赛道或速度剖面。解决方案的关键在于引入强化学习(Reinforcement Learning, RL),采用近端策略优化(Proximal Policy Optimization, PPO)训练一个在线策略网络,该策略以紧凑的状态特征(如速度和曲率采样值)为输入,实时输出最优的 (Ld,g)(L_d, g) 组合,从而实现对PP控制器的动态参数调优。此方法无需针对每张地图重新调参,在仿真与实车测试中均显著优于固定前瞻距离PP、速度调度自适应PP及仅优化前瞻距离的RL变体,并在圈速、路径跟踪精度和转向平滑性上超越了运动学模型预测控制(Kinematic MPC)追踪器。

链接: https://arxiv.org/abs/2602.18386
作者: Mohamed Elgouhary,Amr S. El-Wakeel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Pure Pursuit (PP) is widely used in autonomous racing for real-time path tracking due to its efficiency and geometric clarity, yet performance is highly sensitive to how key parameters-lookahead distance and steering gain-are chosen. Standard velocity-based schedules adjust these only approximately and often fail to transfer across tracks and speed profiles. We propose a reinforcement-learning (RL) approach that jointly chooses the lookahead Ld and a steering gain g online using Proximal Policy Optimization (PPO). The policy observes compact state features (speed and curvature taps) and outputs (Ld, g) at each control step. Trained in F1TENTH Gym and deployed in a ROS 2 stack, the policy drives PP directly (with light smoothing) and requires no per-map retuning. Across simulation and real-car tests, the proposed RL-PP controller that jointly selects (Ld, g) consistently outperforms fixed-lookahead PP, velocity-scheduled adaptive PP, and an RL lookahead-only variant, and it also exceeds a kinematic MPC raceline tracker under our evaluated settings in lap time, path-tracking accuracy, and steering smoothness, demonstrating that policy-guided parameter tuning can reliably improve classical geometry-based control.

[AI-3] FedZMG: Efficient Client-Side Optimization in Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端数据非独立同分布(non-IID)导致的客户端漂移(client-drift)问题,该问题会显著降低模型收敛速度和最终性能。解决方案的关键在于提出一种名为FedZMG(Federated Zero Mean Gradients)的新颖、无需参数调整的客户端优化算法,其核心思想是通过将本地梯度投影到零均值超平面上,结构化地正则化优化空间,从而消除异构数据分布带来的梯度“强度”或“偏置”变化,且不引入额外通信开销或超参数调优。理论分析表明,FedZMG可降低有效梯度方差并提供更紧的收敛边界,实验验证其在EMNIST、CIFAR100和Shakespeare等数据集上优于标准FedAvg和自适应优化器FedAdam,尤其在高度non-IID场景下表现突出。

链接: https://arxiv.org/abs/2602.18384
作者: Fotios Zantalis,Evangelos Zervas,Grigorios Koulouras
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables distributed model training on edge devices while preserving data privacy. However, clients tend to have non-Independent and Identically Distributed (non-IID) data, which often leads to client-drift, and therefore diminishing convergence speed and model performance. While adaptive optimizers have been proposed to mitigate these effects, they frequently introduce computational complexity or communication overhead unsuitable for resource-constrained IoT environments. This paper introduces Federated Zero Mean Gradients (FedZMG), a novel, parameter-free, client-side optimization algorithm designed to tackle client-drift by structurally regularizing the optimization space. Advancing the idea of Gradient Centralization, FedZMG projects local gradients onto a zero-mean hyperplane, effectively neutralizing the “intensity” or “bias” shifts inherent in heterogeneous data distributions without requiring additional communication or hyperparameter tuning. A theoretical analysis is provided, proving that FedZMG reduces the effective gradient variance and guarantees tighter convergence bounds compared to standard FedAvg. Extensive empirical evaluations on EMNIST, CIFAR100, and Shakespeare datasets demonstrate that FedZMG achieves better convergence speed and final validation accuracy compared to the baseline FedAvg and the adaptive optimizer FedAdam, particularly in highly non-IID settings.

[AI-4] Zero-shot Interactive Perception

【速读】:该论文旨在解决机器人在复杂、部分可观测场景中因遮挡和语义模糊导致的感知与操作难题,特别是如何通过物理交互(如推、抓)来提取隐藏信息并执行精准操作。解决方案的关键在于提出零样本交互感知(Zero-Shot IP, ZS-IP)框架,其核心创新包括:(1) 增强观测(Enhanced Observation, EO)模块,引入专为推动作设计的二维视觉增强——推线(pushlines),结合传统关键点提升视觉感知能力;(2) 基于记忆引导的动作模块,利用上下文检索强化语义推理;(3) 由视觉语言模型(VLM)输出驱动的机器人控制器,可执行推、拉或抓等多策略操作。该方法显著优于基于网格的增强方式,在推操作任务中表现更优,同时保持非目标物体完整性。

链接: https://arxiv.org/abs/2602.18374
作者: Venkatesh Sripada,Frank Guerin,Amir Ghalamzan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Original manuscript submitted on April 24, 2025. Timestamped and publicly available on OpenReview: this https URL

点击查看摘要

Abstract:Interactive perception (IP) enables robots to extract hidden information in their workspace and execute manipulation plans by physically interacting with objects and altering the state of the environment – crucial for resolving occlusions and ambiguity in complex, partially observable scenarios. We present Zero-Shot IP (ZS-IP), a novel framework that couples multi-strategy manipulation (pushing and grasping) with a memory-driven Vision Language Model (VLM) to guide robotic interactions and resolve semantic queries. ZS-IP integrates three key components: (1) an Enhanced Observation (EO) module that augments the VLM’s visual perception with both conventional keypoints and our proposed pushlines – a novel 2D visual augmentation tailored to pushing actions, (2) a memory-guided action module that reinforces semantic reasoning through context lookup, and (3) a robotic controller that executes pushing, pulling, or grasping based on VLM output. Unlike grid-based augmentations optimized for pick-and-place, pushlines capture affordances for contact-rich actions, substantially improving pushing performance. We evaluate ZS-IP on a 7-DOF Franka Panda arm across diverse scenes with varying occlusions and task complexities. Our experiments demonstrate that ZS-IP outperforms passive and viewpoint-based perception techniques such as Mark-Based Visual Prompting (MOKA), particularly in pushing tasks, while preserving the integrity of non-target elements.

[AI-5] JPmHC Dynamical Isometry via Orthogonal Hyper-Connections

【速读】:该论文旨在解决超连接(Hyper-Connections, HC)结构中因扩展残差流和多样化连接模式而导致的梯度路径异常、训练不稳定、可扩展性受限及内存开销增加的问题。其核心解决方案是提出JPmHC(Jacobian-spectrum Preserving manifold-constrained Hyper-Connections),通过引入一个在约束流形(如双随机、Stiefel、Grassmann流形)上可学习的线性混合器(linear mixer)替代传统的恒等映射跳跃连接,从而显式控制梯度条件并提升训练稳定性。关键创新包括:基于自由概率论的Jacobian谱分析以指导混合器选择;基于固定点投影的隐式微分机制降低激活内存与同步开销;以及利用Cayley变换实现Stiefel约束下的正交混合器,避免事后归一化。实验证明,JPmHC在ARC-AGI任务上实现了更快收敛、更高精度与更低计算成本。

链接: https://arxiv.org/abs/2602.18308
作者: Biswa Sengupta,Jinhua Wang,Leo Brunswic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in deep learning, exemplified by Hyper-Connections (HC), have expanded the residual connection paradigm by introducing wider residual streams and diverse connectivity patterns. While these innovations yield significant performance gains, they compromise the identity mapping property of residual connections, leading to training instability, limited scalability, and increased memory overhead. To address these challenges, we propose JPmHC (Jacobian-spectrum Preserving manifold-constrained Hyper-Connections), a framework that replaces identity skips with a trainable linear mixer acting on n parallel streams while explicitly controlling gradient conditioning. By constraining the mixer M on operator-norm-bounded manifolds (e.g., bistochastic, Stiefel, Grassmann), JPmHC prevents gradient pathologies and enhances stability. JPmHC introduces three key contributions: (i) a free-probability analysis that predicts Jacobian spectra for structured skips, providing actionable design rules for mixer selection; (ii) memory-efficient implicit differentiation for fixed-point projections, reducing activation memory and synchronization overhead; and (iii) a Stiefel-constrained mixer via Cayley transforms, ensuring orthogonality without post-hoc normalization. Empirical evaluations on ARC-AGI demonstrate that JPmHC achieves faster convergence, higher accuracy, and lower computational cost compared to bistochastic baselines. As a flexible and scalable extension of HC, JPmHC advances spectrum-aware, stable, and efficient deep learning, offering insights into topological architecture design and foundational model evolution.

[AI-6] Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

【速读】:该论文试图解决语言模型解码(decoding)过程缺乏理论统一性的问题,即当前解码方法(如贪婪搜索、采样、Top-K、Top-P等)多被视为启发式调参操作,缺乏统一的优化框架。其解决方案的关键在于将解码建模为一个在概率单纯形上进行正则化优化的原理性层:在每个词元生成时,通过平衡模型得分与结构偏好及约束,求解一个带正则项的最优化问题。这一框架不仅统一解释了现有解码策略的共性结构(由最优性条件决定),还允许系统性地设计新型解码器(如Best-of-K),从而提升多样本流水线(如自洽性推理、重排序、验证器选择)中的覆盖率和性能表现。

链接: https://arxiv.org/abs/2602.18292
作者: Xiaotong Ji,Rasul Tutunov,Matthieu Zimmer,Haitham Bou-Ammar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decoding sits between a language model and everything we do with it, yet it is still treated as a heuristic knob-tuning exercise. We argue decoding should be understood as a principled optimisation layer: at each token, we solve a regularised problem over the probability simplex that trades off model score against structural preferences and constraints. This single template recovers greedy decoding, Softmax sampling, Top-K, Top-P, and Sparsemax-style sparsity as special cases, and explains their common structure through optimality conditions. More importantly, the framework makes it easy to invent new decoders without folklore. We demonstrate this by designing Best-of-K (BoK), a KL-anchored coverage objective aimed at multi-sample pipelines (self-consistency, reranking, verifier selection). BoK targets the probability of covering good alternatives within a fixed K-sample budget and improves empirical performance. We show that such samples can improve accuracy by, for example, +18.6% for Qwen2.5-Math-7B on MATH500 at high sampling temperatures.

[AI-7] Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

【速读】:该论文旨在解决在线多智能体强化学习(Online Multi-Agent Reinforcement Learning, MARL)中策略表达能力不足的问题,尤其是在探索与协作效率方面的瓶颈。传统方法难以在不依赖可计算似然函数的前提下实现有效的熵驱动探索,而扩散模型虽具备强大的表达能力和多模态表示潜力,却因似然不可计算限制了其在在线MARL中的应用。解决方案的关键在于提出首个基于扩散策略的在线离策略MARL框架(OMAD),其核心创新是设计了一个松弛化的策略目标函数——最大化缩放后的联合熵,从而无需依赖可计算的似然即可实现高效探索;同时,在集中训练、分散执行(CTDE)范式下引入联合分布值函数,利用可计算的熵增强目标指导扩散策略的同步更新,保障了策略间稳定协调。

链接: https://arxiv.org/abs/2602.18291
作者: Zhuoran Li,Hai Zhong,Xun Wang,Qingxin Xia,Lihua Zhang,Longbo Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underlineOnline off-policy \underlineMARL framework using \underlineDiffusion policies (\textbfOMAD) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across 10 diverse tasks, demonstrating a remarkable 2.5\times to 5\times improvement in sample efficiency.

[AI-8] PRISM: Parallel Reward Integration with Symmetry for MORL

【速读】:该论文针对异质多目标强化学习(heterogeneous Multi-Objective Reinforcement Learning, MORL)中因目标时间频率差异显著而导致的信用分配不均问题展开研究,即密集奖励会主导学习过程,而稀疏的长期奖励则难以获得有效反馈,从而导致样本效率低下。解决方案的关键在于提出一种基于反射对称性诱导偏置的并行奖励融合算法(Parallel Reward Integration with Symmetry, PRISM),其核心包括两个组成部分:一是ReSymNet,一个理论驱动的神经网络结构,通过残差块学习缩放后的机会价值(opportunity value),以缓解不同目标间的时间频率失配,加速探索同时保持最优策略;二是SymReg,一种反射等变正则化项,强制代理镜像行为并约束策略搜索于反射等变子空间,从而降低假设复杂度并提升泛化能力。该方法在MuJoCo基准测试中显著优于稀疏奖励基线和全密集奖励的“理想”模型,在帕累托前沿覆盖与分布平衡方面均有明显提升。

链接: https://arxiv.org/abs/2602.18277
作者: Finn van der Knaap,Kejiang Qian,Zheng Xu,Fengxiang He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This work studies heterogeneous Multi-Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while sparse long-horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while preserving the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection-equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a sparse-reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100% over the baseline and up to 32% over the oracle. The code is at \hrefthis https URLthis https URL.

[AI-9] [Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在多智能体协商任务中缺乏可靠且可泛化的评估基准的问题。其解决方案的关键在于引入基于可评分游戏(Scoreable Games)的协商基准,并通过复现原始实验、扩展模型范围以及引入新的评估指标,深入分析该基准在协商质量与评估公平性方面的表现。研究揭示了该基准虽具备高度复杂性,但在模型比较中存在模糊性,且实验设计在信息泄露检测和消融研究的完整性方面存在局限,从而强调了上下文因素在模型对比评估中的重要性。

链接: https://arxiv.org/abs/2602.18230
作者: Jorge Carrasco Pollo,Ioannis Kapetangeorgis,Joshua Rosenthal,John Hua Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication at Transactions on Machine Learning Research (TMLR) and MLRC Journal Track, 2025. Code available at: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate significant potential in multi-agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic evaluation framework for LLMs. Our work investigates the reproducibility of claims in their benchmark, and provides a deeper understanding of its usability and generalizability. We replicate the original experiments on additional models, and introduce additional metrics to verify negotiation quality and evenness of evaluation. Our findings reveal that while the benchmark is indeed complex, model comparison is ambiguous, raising questions about its objectivity. Furthermore, we identify limitations in the experimental setup, particularly in information leakage detection and thoroughness of the ablation study. By examining and analyzing the behavior of a wider range of models on an extended version of the benchmark, we reveal insights that provide additional context to potential users. Our results highlight the importance of context in model-comparative evaluations.

[AI-10] SOMtime the World Aint Fair: Violating Fairness Using Self-Organizing Maps

【速读】:该论文旨在解决“公平性通过无知实现”(fairness through unawareness)假设的局限性问题,即在无监督表示学习中,即使敏感属性(如年龄、收入)未被显式输入,其仍可能作为主导潜在轴在嵌入空间中隐式显现,从而引发下游任务中的不公平风险。解决方案的关键在于使用SOMtime这一基于高容量自组织映射(Self-Organizing Maps, SOM)的拓扑保持表示方法,发现并验证了敏感属性在纯无监督嵌入中仍能被高度重构(最高Spearman相关系数达0.85),显著优于PCA、UMAP、t-SNE及自动编码器等主流方法,揭示了无监督表示层本身即存在偏见传播风险,强调必须将公平性审计扩展至机器学习流水线中的无监督组件。

链接: https://arxiv.org/abs/2602.18201
作者: Joseph Bingham,Netanel Arussy,Dvir Aran
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 2 figures, preprint

点击查看摘要

Abstract:Unsupervised representations are widely assumed to be neutral with respect to sensitive attributes when those attributes are withheld from training. We show that this assumption is false. Using SOMtime, a topology-preserving representation method based on high-capacity Self-Organizing Maps, we demonstrate that sensitive attributes such as age and income emerge as dominant latent axes in purely unsupervised embeddings, even when explicitly excluded from the input. On two large-scale real-world datasets (the World Values Survey across five countries and the Census-Income dataset), SOMtime recovers monotonic orderings aligned with withheld sensitive attributes, achieving Spearman correlations of up to 0.85, whereas PCA and UMAP typically remain below 0.23 (with a single exception reaching 0.31), and against t-SNE and autoencoders which achieve at most 0.34. Furthermore, unsupervised segmentation of SOMtime embeddings produces demographically skewed clusters, demonstrating downstream fairness risks without any supervised task. These findings establish that \textitfairness through unawareness fails at the representation level for ordinal sensitive attributes and that fairness auditing must extend to unsupervised components of machine learning pipelines. We have made the code available at~ this https URL

[AI-11] LERD: Latent Event-Relational Dynamics for Neurodegenerative Classification

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)相关脑电图(electroencephalography, EEG)信号分析中现有方法依赖黑箱分类器、未能显式建模潜在神经动力学机制的问题。其核心挑战在于从多通道EEG数据中直接推断出未标注的神经事件及其相互关系结构,同时保持生理合理性与可解释性。解决方案的关键是提出一种端到端的贝叶斯神经动力系统——LERD(Latent Event Relational Dynamics),它结合连续时间事件推理模块与随机事件生成过程,以捕捉灵活的时间模式,并引入基于电生理学启发的动力学先验来指导学习,从而在理论上提供可训练的上界和推断关系动态的稳定性保障。实验表明,LERD在合成基准和两个真实AD EEG队列上均显著优于强基线方法,并能生成与生理一致的潜在表征,用于刻画群体层面的动力学差异。

链接: https://arxiv.org/abs/2602.18195
作者: Hairong Chen,Yicheng Feng,Ziyu Jia,Samir Bhatt,Hengguan Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) alters brain electrophysiology and disrupts multichannel EEG dynamics, making accurate and clinically useful EEG-based diagnosis increasingly important for screening and disease monitoring. However, many existing approaches rely on black-box classifiers and do not explicitly model the underlying dynamics that generate observed signals. To address these limitations, we propose LERD, an end-to-end Bayesian electrophysiological neural dynamical system that infers latent neural events and their relational structure directly from multichannel EEG without event or interaction annotations. LERD combines a continuous-time event inference module with a stochastic event-generation process to capture flexible temporal patterns, while incorporating an electrophysiology-inspired dynamical prior to guide learning in a principled way. We further provide theoretical analysis that yields a tractable bound for training and stability guarantees for the inferred relational dynamics. Extensive experiments on synthetic benchmarks and two real-world AD EEG cohorts demonstrate that LERD consistently outperforms strong baselines and yields physiology-aligned latent summaries that help characterize group-level dynamical differences.

[AI-12] Capabilities Aint All You Need: Measuring Propensities in AI

【速读】:该论文旨在解决当前人工智能(AI)评估中忽视模型倾向性(propensities)的问题,即模型表现出特定行为的倾向程度,而不仅限于能力(capabilities)的测量。传统基于项目反应理论(Item Response Theory, IRT)的方法将模型成功概率建模为能力与任务难度之间的单调函数,无法刻画倾向性中“过度”或“不足”均可能导致不良后果的非单调特性。其解决方案的关键在于提出首个形式化框架,采用双对数(bilogistic)函数建模模型在任务上的成功概率,明确指出当模型倾向处于一个“理想区间”(ideal band)时,成功概率最高;同时利用新开发的任务无关评分标准(task-agnostic rubrics),借助大语言模型(LLMs)估计该理想区间的边界。实证表明,该框架不仅能准确量化倾向偏移及其对任务表现的影响,且结合倾向性和能力的联合指标显著优于单独使用任一维度的预测效果。

链接: https://arxiv.org/abs/2602.18182
作者: Daniel Romero-Alvarado,Fernando Martínez-Plumed,Lorenzo Pacchiardi,Hugo Save,Siddhesh Milind Pawar,Behzad Mehrbakhsh,Pablo Antonio Moreno Casares,Ben Slater,Paolo Bova,Peter Romero,Zachary R. Tyler,Jonathan Prunty,Luning Sun,Jose Hernandez-Orallo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI evaluation has primarily focused on measuring capabilities, with formal approaches inspired from Item Response Theory (IRT) being increasingly applied. Yet propensities - the tendencies of models to exhibit particular behaviours - play a central role in determining both performance and safety outcomes. However, traditional IRT describes a model’s success on a task as a monotonic function of model capabilities and task demands, an approach unsuited to propensities, where both excess and deficiency can be problematic. Here, we introduce the first formal framework for measuring AI propensities by using a bilogistic formulation for model success, which attributes high success probability when the model’s propensity is within an “ideal band”. Further, we estimate the limits of the ideal band using LLMs equipped with newly developed task-agnostic rubrics. Applying our framework to six families of LLM models whose propensities are incited in either direction, we find that we can measure how much the propensity is shifted and what effect this has on the tasks. Critically, propensities estimated using one benchmark successfully predict behaviour on held-out tasks. Moreover, we obtain stronger predictive power when combining propensities and capabilities than either separately. More broadly, our framework showcases how rigorous propensity measurements can be conducted and how it yields gains over solely using capability evaluations to predict AI behaviour.

[AI-13] Can AI Lower the Barrier to Cybersecurity? A Human-Centered Mixed-Methods Study of Novice CTF Learning

【速读】:该论文旨在解决新手在参与网络安全夺旗(Capture-the-Flag, CTF)竞赛时面临的高门槛问题,这些问题主要源于复杂工具链和不透明的工作流程。传统CTF训练对初学者的认知负荷较大,限制了其学习效率与参与度。解决方案的关键在于引入以人类为中心的代理型人工智能(agentic AI)框架——即本文提出的网络安全AI(Cybersecurity AI, CAI),通过自动化协调渗透测试任务、提供结构化指导和策略建议,显著降低初始学习难度。实证研究表明,CAI不仅减少了认知负担,还促进了对元策略层面(meta-level strategies)的探索,从而提升学习效率;但同时也揭示了信任建立、依赖风险及负责任使用等新挑战,为未来人机协同的网络安全教育提供了重要方向。

链接: https://arxiv.org/abs/2602.18172
作者: Cathrin Schachner,Jasmin Wachter
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: A Preprint

点击查看摘要

Abstract:Capture-the-Flag (CTF) competitions serve as gateways into offensive cybersecurity, yet they often present steep barriers for novices due to complex toolchains and opaque workflows. Recently, agentic AI frameworks for cybersecurity promise to lower these barriers by automating and coordinating penetration testing tasks. However, their role in shaping novice learning remains underexplored. We present a human-centered, mixed-methods case study examining how agentic AI frameworks – here Cybersecurity AI (CAI) – mediates novice entry into CTF-based penetration testing. An undergraduate student without prior hacking experience attempted to approach performance benchmarks from a national cybersecurity challenge using CAI. Quantitative performance metrics were complemented by structured reflective analysis of learning progression and AI interaction patterns. Our thematic analysis suggest that agentic AI reduces initial entry barriers by providing overview, structure and guidance, thereby lowering the cognitive workload during early engagement. Quantitatively, the observed extensive exploration of strategies and low per-strategy execution time potetially facilitatates cybersecurity training on meta, i.e. strategic levels. At the same time, AI-assisted cybersecurity education introduces new challenges related to trust, dependency, and responsible use. We discuss implications for human-centered AI-supported cybersecurity education and outline open questions for future research. Comments: A Preprint Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) ACMclasses: K.6.5; D.4.6; I.2.m; K.3.1; K.3.2; K.7.4 Cite as: arXiv:2602.18172 [cs.CR] (or arXiv:2602.18172v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.18172 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-14] Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning ICLR2026

【速读】:该论文旨在解决生成式策略在从离线(offline)到在线(online)强化学习迁移过程中,因探索不足导致样本效率低下的问题。其核心挑战在于:现有方法通常将在线微调视为离线预训练的直接延续,未能有效应对在线阶段对多样化动作探索的需求。解决方案的关键在于提出一种名为FINO(Flow Matching with Injected Noise for Offline-to-Online RL)的新方法,通过在流匹配(flow matching)基础上引入噪声注入机制来增强策略训练中的探索能力,同时结合熵引导的采样策略平衡探索与利用,从而提升在线微调阶段的样本效率和性能表现。

链接: https://arxiv.org/abs/2602.18117
作者: Yongjae Shin,Jongseong Chae,Jongeui Park,Youngchul Sung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2026 camera-ready

点击查看摘要

Abstract:Generative models have recently demonstrated remarkable success across diverse domains, motivating their adoption as expressive policies in reinforcement learning (RL). While they have shown strong performance in offline RL, particularly where the target distribution is well defined, their extension to online fine-tuning has largely been treated as a direct continuation of offline pre-training, leaving key challenges unaddressed. In this paper, we propose Flow Matching with Injected Noise for Offline-to-Online RL (FINO), a novel method that leverages flow matching-based policies to enhance sample efficiency for offline-to-online RL. FINO facilitates effective exploration by injecting noise into policy training, thereby encouraging a broader range of actions beyond those observed in the offline dataset. In addition to exploration-enhanced flow policy training, we combine an entropy-guided sampling mechanism to balance exploration and exploitation, allowing the policy to adapt its behavior throughout online fine-tuning. Experiments across diverse, challenging tasks demonstrate that FINO consistently achieves superior performance under limited online budgets.

[AI-15] Cut Less Fold More: Model Compression through the Lens of Projection Geometry ICLR2026

【速读】:该论文旨在解决神经网络在不进行微调(retraining)的前提下实现高效压缩的问题,以支持大规模部署。其核心挑战在于如何在保持模型性能的同时降低参数量,同时避免传统方法中对压缩后模型进行额外校准(calibration)的复杂性。解决方案的关键在于从投影几何的角度重新审视结构化剪枝(structured pruning)与模型折叠(model folding):前者被视为轴对齐投影,后者则通过权重聚类实现低秩投影。作者将二者形式化为正交算子,并理论证明,在秩距离为1的条件下,折叠相比剪枝能更小地重构参数误差,并在温和平滑性假设下产生更小的功能扰动。实证上,该研究在多个主流架构(如ResNet、ViT、LLaMA系列)和数据集(CIFAR-10、ImageNet-1K、C4)上验证了折叠在中高压缩率下通常优于剪枝,且无需校准即可获得更高精度,从而确立了折叠作为一种几何感知、校准自由的压缩方法,在理论和实践中均具有优势。

链接: https://arxiv.org/abs/2602.18116
作者: Olga Saukh,Dong Wang,Haris Šikić,Yun Cheng,Lothar Thiele
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Compressing neural networks without retraining is vital for deployment at scale. We study calibration-free compression through the lens of projection geometry: structured pruning is an axis-aligned projection, whereas model folding performs a low-rank projection via weight clustering. We formalize both as orthogonal operators and show that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than pruning. At scale, we evaluate 1000 checkpoints spanning ResNet18, PreActResNet18, ViT-B/32, and CLIP ViT-B/32 on CIFAR-10 and ImageNet-1K, covering diverse training hyperparameters (optimizers, learning rates, augmentations, regularization, sharpness-aware training), as well as multiple LLaMA-family 60M and 130M parameter models trained on C4. We show that folding typically achieves higher post-compression accuracy, with the largest gains at moderate-high compression. The gap narrows and occasionally reverses at specific training setups. Our results position folding as a geometry-aware, calibration-free alternative to pruning that is often superior in practice and principled in theory.

[AI-16] MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows ICASSP2026 WWW

【速读】:该论文旨在解决语音转换(Voice Conversion, VC)中基于扩散模型和流匹配(Flow Matching)方法因迭代推理导致的转换速度慢的问题。其核心解决方案是提出一种基于均值流(Mean Flow)的一步式非并行VC模型——MeanVoiceFlow,该模型无需预训练或知识蒸馏即可从零开始训练。关键创新在于使用平均速度而非瞬时速度来更精确地计算单步推断路径上的时间积分,从而提升效率与质量;同时引入结构化边际重建损失作为零输入约束,稳定训练过程,并采用条件扩散输入训练策略(conditional diffused-input training),在训练和推理阶段均使用噪声与源数据混合输入,有效利用源信息并保持一致性。实验表明,MeanVoiceFlow在性能上可媲美多步或蒸馏训练的模型,且具备更高的实用性。

链接: https://arxiv.org/abs/2602.18104
作者: Takuhiro Kaneko,Hirokazu Kameoka,Kou Tanaka,Yuto Kondo
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted to ICASSP 2026. Project page: this https URL

点击查看摘要

Abstract:In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation. Unlike conventional flow matching that uses instantaneous velocity, mean flows employ average velocity to more accurately compute the time integral along the inference path in a single step. However, training the average velocity requires its derivative to compute the target velocity, which can cause instability. Therefore, we introduce a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input-output behavior of the model without harmful statistical averaging. Furthermore, we propose conditional diffused-input training in which a mixture of noise and source data is used as input to the model during both training and inference. This enables the model to effectively leverage source information while maintaining consistency between training and inference. Experimental results validate the effectiveness of these techniques and demonstrate that MeanVoiceFlow achieves performance comparable to that of previous multi-step and distillation-based models, even when trained from scratch. Audio samples are available at this https URL.

[AI-17] Neurosymbolic Language Reasoning as Satisfiability Modulo Theory

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言理解中难以可靠执行文本与逻辑推理交织任务的问题。现有神经符号系统虽能结合LLMs与求解器,但仅适用于可完全形式化的任务(如数学或程序合成),无法处理具有部分逻辑结构的自然文档。其解决方案的关键在于提出一种名为Logitext的神经符号语言,将文档表示为自然语言文本约束(Natural Language Text Constraints, NLTCs),显式表达部分逻辑结构,并开发了一种算法,将基于LLM的约束评估与满足理论模数(Satisfiability Modulo Theories, SMT)求解相结合,从而实现联合文本-逻辑推理。这一方法首次将LLM推理视为SMT理论,扩展了神经符号方法的应用边界至非完全形式化领域。

链接: https://arxiv.org/abs/2602.18095
作者: Hyunseok Oh,Sam Stern,Youngki Lee,Matthai Philipose
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural language understanding requires interleaving textual and logical reasoning, yet large language models often fail to perform such reasoning reliably. Existing neurosymbolic systems combine LLMs with solvers but remain limited to fully formalizable tasks such as math or program synthesis, leaving natural documents with only partial logical structure unaddressed. We introduce Logitext, a neurosymbolic language that represents documents as natural language text constraints (NLTCs), making partial logical structure explicit. We develop an algorithm that integrates LLM-based constraint evaluation with satisfiability modulo theory (SMT) solving, enabling joint textual-logical reasoning. Experiments on a new content moderation benchmark, together with LegalBench and Super-Natural Instructions, show that Logitext improves both accuracy and coverage. This work is the first that treats LLM-based reasoning as an SMT theory, extending neurosymbolic methods beyond fully formalizable domains.

[AI-18] HiAER-Spike Software-Hardware Reconfigurable Platform for Event-Driven Neuromorphic Computing at Scale

【速读】:该论文旨在解决大规模脉冲神经网络(Spiking Neural Networks, SNNs)在硬件实现中面临的计算效率、内存占用和可扩展性问题,特别是在边缘与云端部署时难以满足低延迟、高并行性和事件驱动推理的需求。其解决方案的关键在于提出了一种模块化、可重构的类脑计算平台HiAER-Spike,该平台通过软硬件协同设计,实现了基于分层地址事件路由(Hierarchical Address-Event Routing, HiAER)的高效事件流处理机制,并支持高达1.6亿神经元和400亿突触的大规模SNN运行,且速度超过实时。系统架构特别优化了稀疏连接与稀疏活动的处理能力,显著提升了内存利用率与推理效率,同时提供Python接口屏蔽底层硬件细节,便于用户灵活部署各类SNN拓扑结构,从而推动类脑计算在实际应用中的落地。

链接: https://arxiv.org/abs/2602.18072
作者: Gwenevere Frank,Gopabandhu Hota,Keli Wang,Christopher Deng,Krish Arora,Diana Vins,Abhinav Uppal,Omowuyi Olajide,Kenneth Yoshimoto,Qingbo Wang,Mari Yamaoka,Johannes Leugering,Stephen Deiss,Leif Gibb,Gert Cauwenberghs
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Leif Gibb, Gert Cauwenberghs are equal authors. arXiv admin note: substantial text overlap with arXiv:2504.03671

点击查看摘要

Abstract:In this work, we present HiAER-Spike, a modular, reconfigurable, event-driven neuromorphic computing platform designed to execute large spiking neural networks with up to 160 million neurons and 40 billion synapses - roughly twice the neurons of a mouse brain at faster than real time. This system, assembled at the UC San Diego Supercomputer Center, comprises a co-designed hard- and software stack that is optimized for run-time massively parallel processing and hierarchical address-event routing (HiAER) of spikes while promoting memory-efficient network storage and execution. The architecture efficiently handles both sparse connectivity and sparse activity for robust and low-latency event-driven inference for both edge and cloud computing. A Python programming interface to HiAER-Spike, agnostic to hardware-level detail, shields the user from complexity in the configuration and execution of general spiking neural networks with minimal constraints in topology. The system is made easily available over a web portal for use by the wider community. In the following, we provide an overview of the hard- and software stack, explain the underlying design principles, demonstrate some of the system’s capabilities and solicit feedback from the broader neuromorphic community. Examples are shown demonstrating HiAER-Spike’s capabilities for event-driven vision on benchmark CIFAR-10, DVS event-based gesture, MNIST, and Pong tasks.

[AI-19] Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets ICLR2026

【速读】:该论文旨在解决机器人策略预训练中因高质量示范数据收集成本高昂而导致的可扩展性问题。其核心解决方案是将离线强化学习(offline reinforcement learning, offline RL)与跨形态学习(cross-embodiment learning)相结合:offline RL利用专家数据和大量次优数据,而cross-embodiment learning通过整合不同形态机器人轨迹来获取通用控制先验。关键创新在于提出基于形态相似性的静态分组策略,将机器人按结构特征聚类,并在每个组内使用组梯度更新模型,从而显著减少跨机器人形态间的梯度冲突,提升预训练效果。

链接: https://arxiv.org/abs/2602.18025
作者: Haruki Abe,Takayuki Osa,Yusuke Mukuta,Tatsuya Harada
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ICLR 2026

点击查看摘要

Abstract:Scalable robot policy pre-training has been hindered by the high cost of collecting high-quality demonstrations for each platform. In this study, we address this issue by uniting offline reinforcement learning (offline RL) with cross-embodiment learning. Offline RL leverages both expert and abundant suboptimal data, and cross-embodiment learning aggregates heterogeneous robot trajectories across diverse morphologies to acquire universal control priors. We perform a systematic analysis of this offline RL and cross-embodiment paradigm, providing a principled understanding of its strengths and limitations. To evaluate this offline RL and cross-embodiment paradigm, we construct a suite of locomotion datasets spanning 16 distinct robot platforms. Our experiments confirm that this combined approach excels at pre-training with datasets rich in suboptimal trajectories, outperforming pure behavior cloning. However, as the proportion of suboptimal data and the number of robot types increase, we observe that conflicting gradients across morphologies begin to impede learning. To mitigate this, we introduce an embodiment-based grouping strategy in which robots are clustered by morphological similarity and the model is updated with a group gradient. This simple, static grouping substantially reduces inter-robot conflicts and outperforms existing conflict-resolution methods.

[AI-20] Flow Actor-Critic for Offline Reinforcement Learning ICLR2026

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中数据分布复杂且具有多模态特性时,传统高斯策略难以充分建模的问题。其关键解决方案在于提出Flow Actor-Critic方法,通过引入基于流模型(Flow-based Policy)的演员-评论家架构:一方面利用流模型作为演员以表达复杂策略分布,另一方面创新性地将流模型用于保守评论家(Conservative Critic)的设计,借助流行为代理模型(flow behavior proxy model)构建新的评论家正则化项,从而有效抑制数据外区域(out-of-data regions)的Q值爆炸问题。此联合利用流模型的方式显著提升了在D4RL和OGBench等基准测试中的性能表现,达到当前最优水平。

链接: https://arxiv.org/abs/2602.18015
作者: Jongseong Chae,Jongeui Park,Yongjae Shin,Gyeongmin Kim,Seungyul Han,Youngchul Sung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:The dataset distributions in offline reinforcement learning (RL) often exhibit complex and multi-modal distributions, necessitating expressive policies to capture such distributions beyond widely-used Gaussian policies. To handle such complex and multi-modal datasets, in this paper, we propose Flow Actor-Critic, a new actor-critic method for offline RL, based on recent flow policies. The proposed method not only uses the flow model for actor as in previous flow policies but also exploits the expressive flow model for conservative critic acquisition to prevent Q-value explosion in out-of-data regions. To this end, we propose a new form of critic regularizer based on the flow behavior proxy model obtained as a byproduct of flow-based actor design. Leveraging the flow model in this joint way, we achieve new state-of-the-art performance for test datasets of offline RL including the D4RL and recent OGBench benchmarks.

[AI-21] PHAST: Port-Hamiltonian Architecture for Structured Temporal Dynamics Forecasting

【速读】:该论文旨在解决从部分观测(仅位置 $ q_t $,动量 $ p_t $ 隐含)出发,对耗散物理系统进行长期稳定预测并恢复具有物理意义参数的挑战。其核心解决方案是提出一种基于端口-哈密顿(port-Hamiltonian)框架的结构化时序建模方法——PHAST(Port-Hamiltonian Architecture for Structured Temporal dynamics),通过将哈密顿量分解为势能 $ V(q) $、质量矩阵 $ M(q) $ 和阻尼矩阵 $ D(q) $,并在已知(KNOWN)、部分已知(PARTIAL)和未知(UNKNOWN)三种知识域下灵活建模,结合低秩正定/半正定参数化与Strang分裂算法推进动力学演化,从而在十三个涵盖机械、电学、分子、热力学、引力及生态系统的基准任务中实现最优长期预测性能,并在提供足够结构锚点时恢复可解释的物理参数。

链接: https://arxiv.org/abs/2602.17998
作者: Shubham Bhardwaj,Chandrajit Bajaj
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY)
备注: 50 pages

点击查看摘要

Abstract:Real physical systems are dissipative – a pendulum slows, a circuit loses charge to heat – and forecasting their dynamics from partial observations is a central challenge in scientific machine learning. We address the \emphposition-only (q-only) problem: given only generalized positions~ q_t at discrete times (momenta~ p_t latent), learn a structured model that (a)~produces stable long-horizon forecasts and (b)~recovers physically meaningful parameters when sufficient structure is provided. The port-Hamiltonian framework makes the conservative-dissipative split explicit via \dotx=(J-R)\nabla H(x) , guaranteeing dH/dt\le 0 when R\succeq 0 . We introduce \textbfPHAST (Port-Hamiltonian Architecture for Structured Temporal dynamics), which decomposes the Hamiltonian into potential~ V(q) , mass~ M(q) , and damping~ D(q) across three knowledge regimes (KNOWN, PARTIAL, UNKNOWN), uses efficient low-rank PSD/SPD parameterizations, and advances dynamics with Strang splitting. Across thirteen q-only benchmarks spanning mechanical, electrical, molecular, thermal, gravitational, and ecological systems, PHAST achieves the best long-horizon forecasting among competitive baselines and enables physically meaningful parameter recovery when the regime provides sufficient anchors. We show that identification is fundamentally ill-posed without such anchors (gauge freedom), motivating a two-axis evaluation that separates forecasting stability from identifiability.

[AI-22] urbo Connection: Reasoning as Information Flow from Higher to Lower Layers

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂推理任务时受限于固定计算路径深度的问题,即Transformer架构中隐式计算路径的步数上限限制了模型的推理能力。其解决方案的关键在于提出一种名为Turbo Connection(TurboConn)的新架构,通过将每个token的高层数隐藏状态中的多个残差连接路由至下一个token的低层,从而突破固定深度约束;这种密集的反向连接机制显著提升了模型在GSM8K、Parity等多步推理任务上的性能,且无需重新训练整个模型或采用复杂的课程学习策略即可实现精度跃升。

链接: https://arxiv.org/abs/2602.17993
作者: Mohan Tang,Sidi Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Complex problems, whether in math, logic, or planning, are solved by humans through a sequence of steps where the result of one step informs the next. In this work, we adopt the perspective that the reasoning power of Transformers is fundamentally limited by a fixed maximum number of steps along any latent path of computation. To address this, we introduce Turbo Connection (TurboConn), a novel architecture that overcomes the fixed-depth constraint by routing multiple residual connections from the higher-layer hidden states of each token t to the lower layers of token t+1 . Fine-tuning pre-trained LLMs with our method not only yields accuracy gains of 0.9% to over 10% on benchmarks like GSM8K, Parity, and multi-step arithmetic, but also demonstrates that the density of these backward connections is critical; our dense interaction significantly outperforms “sparse” alternatives that only pass a single hidden state or vector. Notably, TurboConn can be integrated into pre-trained LLMs to overcome task-specific plateaus: while a fine-tuned Qwen-3-1.7B achieves only 53.78% on Parity, adding our architectural modification enables the model to reach 100% accuracy, all without the necessity to retrain the full model from scratch or sophisticated curriculum learning. Our results provide strong empirical evidence that the depth of the computational path is a key factor in reasoning ability, also offering a new mechanism to enhance LLMs without significantly affecting generation latency.

[AI-23] WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)生成的结构化工作流(structured workflows)在自动评估中面临的挑战,即现有评估指标得分缺乏校准性,且分数变化难以反映工作流退化的严重程度。解决方案的关键在于提出 WorkflowPerturb——一个受控基准测试框架,通过在黄金工作流(golden workflows)上施加真实且可控的扰动(包括缺失步骤、压缩步骤和描述变更三类),构建了包含4,973个黄金工作流和44,757个扰动变体的数据集,并在10%、30%和50%三个严重等级下系统评估多种指标家族的敏感性和校准能力,从而实现对工作流评估分数的严重程度感知解读。

链接: https://arxiv.org/abs/2602.17990
作者: Madhav Kanda,Pedro Las-Casas,Alok Gautam Kumbhare,Rodrigo Fonseca,Sharad Agarwal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based systems increasingly generate structured workflows for complex tasks. In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of workflow degradation. We introduce WorkflowPerturb, a controlled benchmark for studying workflow evaluation metrics. It works by applying realistic, controlled perturbations to golden workflows. WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, and Description Changes), each applied at severity levels of 10%, 30%, and 50%. We benchmark multiple metric families and analyze their sensitivity and calibration using expected score trajectories and residuals. Our results characterize systematic differences across metric families and support severity-aware interpretation of workflow evaluation scores. Our dataset will be released upon acceptance.

[AI-24] Learning Optimal and Sample-Efficient Decision Policies with Guarantees

【速读】:该论文旨在解决在存在隐藏混杂因素(hidden confounders)的情况下,如何从离线数据集中高效学习最优决策策略的问题。核心挑战在于,传统强化学习(Reinforcement Learning, RL)依赖大量在线环境交互,而离线学习易受混杂因素导致的虚假相关性干扰,从而引发次优甚至有害的行为。解决方案的关键在于引入工具变量(instrumental variables, IVs)以识别因果效应,并将其建模为条件矩约束(conditional moment restrictions, CMR)问题;通过借鉴双重/去偏机器学习(double/debiased machine learning)的思想,提出一种具有收敛性和最优性保证的样本高效算法,显著优于现有方法。此外,论文进一步放宽了对混杂因素的假设,在模仿学习(imitation learning)场景下扩展该CMR估计器,实现了带收敛速率保证的有效策略学习,并针对线性时序逻辑(Linear Temporal Logic, LTL)表达的高层目标设计了可证明最优的学习算法,提升了样本效率。

链接: https://arxiv.org/abs/2602.17978
作者: Daqian Shao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: A thesis submitted for the degree of DPhil in Computer Science at Oxford

点击查看摘要

Abstract:The paradigm of decision-making has been revolutionised by reinforcement learning and deep learning. Although this has led to significant progress in domains such as robotics, healthcare, and finance, the use of RL in practice is challenging, particularly when learning decision policies in high-stakes applications that may require guarantees. Traditional RL algorithms rely on a large number of online interactions with the environment, which is problematic in scenarios where online interactions are costly, dangerous, or infeasible. However, learning from offline datasets is hindered by the presence of hidden confounders. Such confounders can cause spurious correlations in the dataset and can mislead the agent into taking suboptimal or adversarial actions. Firstly, we address the problem of learning from offline datasets in the presence of hidden confounders. We work with instrumental variables (IVs) to identify the causal effect, which is an instance of a conditional moment restrictions (CMR) problem. Inspired by double/debiased machine learning, we derive a sample-efficient algorithm for solving CMR problems with convergence and optimality guarantees, which outperforms state-of-the-art algorithms. Secondly, we relax the conditions on the hidden confounders in the setting of (offline) imitation learning, and adapt our CMR estimator to derive an algorithm that can learn effective imitator policies with convergence rate guarantees. Finally, we consider the problem of learning high-level objectives expressed in linear temporal logic (LTL) and develop a provably optimal learning algorithm that improves sample efficiency over existing methods. Through evaluation on reinforcement learning benchmarks and synthetic and semi-synthetic datasets, we demonstrate the usefulness of the methods developed in this thesis in real-world decision making.

[AI-25] In-Context Learning for Pure Exploration in Continuous Spaces

【速读】:该论文旨在解决连续空间中的纯探索(pure exploration)问题,即在假设空间与查询/动作空间均为连续的情况下,如何通过自适应地选择查询序列以最少的次数识别出未知的真实假设。传统方法多适用于离散假设空间,而现代应用如连续臂 bandit 中的最佳动作识别、目标区域内的 ε-球定位或未知函数最小值估计等场景均需处理连续空间中的高效探索策略。解决方案的关键在于提出 C-ICPE-TS(Continuous In-Context Pure Exploration via Thompson Sampling)算法,该算法通过元训练深度神经网络策略,直接从数据中学习将观测历史映射到下一连续查询动作和预测假设的能力,从而在推理阶段无需参数更新或显式设计信息模型即可对未见过的任务进行主动证据收集并推断真实假设,实现了可迁移的序列测试策略学习。

链接: https://arxiv.org/abs/2602.17976
作者: Alessio Russo,Yin-Ching Lee,Ryan Welch,Aldo Pacchiano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In active sequential testing, also termed pure exploration, a learner is tasked with the goal to adaptively acquire information so as to identify an unknown ground-truth hypothesis with as few queries as possible. This problem, originally studied by Chernoff in 1959, has several applications: classical formulations include Best-Arm Identification (BAI) in bandits, where actions index hypotheses, and generalized search problems, where strategically chosen queries reveal partial information about a hidden label. In many modern settings, however, the hypothesis space is continuous and naturally coincides with the query/action space: for example, identifying an optimal action in a continuous-armed bandit, localizing an \epsilon -ball contained in a target region, or estimating the minimizer of an unknown function from a sequence of observations. In this work, we study pure exploration in such continuous spaces and introduce Continuous In-Context Pure Exploration for this regime. We introduce C-ICPE-TS, an algorithm that meta-trains deep neural policies to map observation histories to (i) the next continuous query action and (ii) a predicted hypothesis, thereby learning transferable sequential testing strategies directly from data. At inference time, C-ICPE-TS actively gathers evidence on previously unseen tasks and infers the true hypothesis without parameter updates or explicit hand-crafted information models. We validate C-ICPE-TS across a range of benchmarks, spanning continuous best-arm identification, region localization, and function minimizer identification.

[AI-26] PenTiDef: Enhancing Privacy and Robustness in Decentralized Federated Intrusion Detection Systems against Poisoning Attacks

【速读】:该论文旨在解决去中心化联邦学习入侵检测系统(Decentralized Federated Learning-based Intrusion Detection System, DFL-IDS)在数据隐私保护和抗投毒攻击方面的关键挑战,尤其是传统集中式联邦学习入侵检测系统(FL-IDS)中存在的单点故障、依赖中心聚合服务器以及对恶意更新缺乏有效检测机制的问题。其解决方案的核心在于提出PenTiDef框架,该框架结合了分布式差分隐私(Distributed Differential Privacy, DDP)以保障数据机密性,利用神经网络的潜在空间表示(Latent Space Representation, LSR)识别恶意模型更新,并通过区块链驱动的去中心化协调机制实现无需中心服务器的可信聚合与历史追踪,从而在不引入单点失效风险的前提下提升系统的鲁棒性和隐私安全性。

链接: https://arxiv.org/abs/2602.17973
作者: Phan The Duy,Nghi Hoang Khoa,Nguyen Tran Anh Quan,Luong Ha Tien,Ngo Duc Hoang Son,Van-Hau Pham
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing deployment of Federated Learning (FL) in Intrusion Detection Systems (IDS) introduces new challenges related to data privacy, centralized coordination, and susceptibility to poisoning attacks. While significant research has focused on protecting traditional FL-IDS with centralized aggregation servers, there remains a notable gap in addressing the unique challenges of decentralized FL-IDS (DFL-IDS). This study aims to address the limitations of traditional centralized FL-IDS by proposing a novel defense framework tailored for the decentralized FL-IDS architecture, with a focus on privacy preservation and robustness against poisoning attacks. We propose PenTiDef, a privacy-preserving and robust defense framework for DFL-IDS, which incorporates Distributed Differential Privacy (DDP) to protect data confidentiality and utilizes latent space representations (LSR) derived from neural networks to detect malicious updates in the decentralized model aggregation context. To eliminate single points of failure and enhance trust without a centralized aggregation server, PenTiDef employs a blockchain-based decentralized coordination mechanism that manages model aggregation, tracks update history, and supports trust enforcement through smart contracts. Experimental results on CIC-IDS2018 and Edge-IIoTSet demonstrate that PenTiDef consistently outperforms existing defenses (e.g., FLARE, FedCC) across various attack scenarios and data distributions. These findings highlight the potential of PenTiDef as a scalable and secure framework for deploying DFL-based IDS in adversarial environments. By leveraging privacy protection, malicious behavior detection in hidden data, and working without a central server, it provides a useful security solution against real-world attacks from untrust participants.

[AI-27] Optimizing Graph Causal Classification Models: Estimating Causal Effects and Addressing Confounders

【速读】:该论文旨在解决传统图神经网络(Graph Neural Networks, GNNs)在处理现实世界复杂关系数据时,因过度依赖相关性而对虚假模式和分布变化敏感的问题。其核心挑战在于:GNNs难以区分因果效应与表面关联,导致模型在干预或分布偏移下预测性能下降。解决方案的关键是提出一种因果感知的图神经网络框架 CCAGNN(Confounder-Aware Causal GNN),通过引入因果推理机制显式建模混杂因子(confounder),从而实现对真实因果结构的识别与隔离,支持反事实推理,并提升模型在实际应用场景中的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2602.17941
作者: Simi Job,Xiaohui Tao,Taotao Cai,Haoran Xie,Jianming Yong,Xin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph data is becoming increasingly prevalent due to the growing demand for relational insights in AI across various domains. Organizations regularly use graph data to solve complex problems involving relationships and connections. Causal learning is especially important in this context, since it helps to understand cause-effect relationships rather than mere associations. Since many real-world systems are inherently causal, graphs can efficiently model these systems. However, traditional graph machine learning methods including graph neural networks (GNNs), rely on correlations and are sensitive to spurious patterns and distribution changes. On the other hand, causal models enable robust predictions by isolating true causal factors, thus making them more stable under such shifts. Causal learning also helps in identifying and adjusting for confounders, ensuring that predictions reflect true causal relationships and remain accurate even under interventions. To address these challenges and build models that are robust and causally informed, we propose CCAGNN, a Confounder-Aware causal GNN framework that incorporates causal reasoning into graph learning, supporting counterfactual reasoning and providing reliable predictions in real-world settings. Comprehensive experiments on six publicly available datasets from diverse domains show that CCAGNN consistently outperforms leading state-of-the-art models.

[AI-28] Causal Neighbourhood Learning for Invariant Graph Representations

【速读】:该论文旨在解决传统图神经网络(Graph Neural Networks, GNNs)在面对噪声和虚假相关性时,难以学习真实因果结构、导致跨图泛化能力差的问题。其核心挑战在于:GNN的聚合机制易放大虚假连接,限制模型在分布偏移下的鲁棒性。解决方案的关键在于提出因果邻域学习框架(Causal Neighbourhood Learning with Graph Neural Networks, CNL-GNN),通过生成反事实邻域与基于可学习重要性掩码及注意力机制引导的自适应边扰动,识别并保留因果相关的连接,削弱虚假影响;同时结合结构级干预与因果特征解耦,学习对不同图结构具有不变性的节点表示,从而提升模型的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2602.17934
作者: Simi Job,Xiaohui Tao,Taotao Cai,Haoran Xie,Jianming Yong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph data often contain noisy and spurious correlations that mask the true causal relationships, which are essential for enabling graph models to make predictions based on the underlying causal structure of the data. Dependence on spurious connections makes it challenging for traditional Graph Neural Networks (GNNs) to generalize effectively across different graphs. Furthermore, traditional aggregation methods tend to amplify these spurious patterns, limiting model robustness under distribution shifts. To address these issues, we propose Causal Neighbourhood Learning with Graph Neural Networks (CNL-GNN), a novel framework that performs causal interventions on graph structure. CNL-GNN effectively identifies and preserves causally relevant connections and reduces spurious influences through the generation of counterfactual neighbourhoods and adaptive edge perturbation guided by learnable importance masking and an attention-based mechanism. In addition, by combining structural-level interventions with the disentanglement of causal features from confounding factors, the model learns invariant node representations that are robust and generalize well across different graph structures. Our approach improves causal graph learning beyond traditional feature-based methods, resulting in a robust classification model. Extensive experiments on four publicly available datasets, including multiple domain variants of one dataset, demonstrate that CNL-GNN outperforms state-of-the-art GNN models.

[AI-29] Memory-Based Advantage Shaping for LLM -Guided Reinforcement Learning AAAI

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在稀疏或延迟奖励环境中因样本复杂度高而难以高效学习的问题。其解决方案的关键在于构建一个记忆图(memory graph),该图编码来自大语言模型(Large Language Models, LLMs)指导和智能体自身成功轨迹的子目标(subgoals)与路径信息,并从中推导出一个效用函数(utility function),用于评估当前轨迹与历史成功策略的一致性;该效用函数被整合进优势函数(advantage function)中,作为批评者(critic)的额外引导信号,从而提升探索效率,同时仅需少量在线LLM查询,避免对持续LLM监督的依赖。

链接: https://arxiv.org/abs/2602.17931
作者: Narjes Nourzad,Carlee Joe-Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Association for the Advancement of Artificial Intelligence (AAAI)

点击查看摘要

Abstract:In environments with sparse or delayed rewards, reinforcement learning (RL) incurs high sample complexity due to the large number of interactions needed for learning. This limitation has motivated the use of large language models (LLMs) for subgoal discovery and trajectory guidance. While LLMs can support exploration, frequent reliance on LLM calls raises concerns about scalability and reliability. We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both LLM guidance and the agent’s own successful rollouts. From this graph, we derive a utility function that evaluates how closely the agent’s trajectories align with prior successful strategies. This utility shapes the advantage function, providing the critic with additional guidance without altering the reward. Our method relies primarily on offline input and only occasional online queries, avoiding dependence on continuous LLM supervision. Preliminary experiments in benchmark environments show improved sample efficiency and faster early learning compared to baseline RL methods, with final returns comparable to methods that require frequent LLM interaction.

[AI-30] MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance ICLR’26

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在稀疏奖励或延迟奖励环境中因先验结构不足而导致的高样本复杂度问题。传统方法依赖大量交互数据才能有效学习,而引入大语言模型(Large Language Models, LLMs)虽可提供子目标分解、合理轨迹和抽象先验以加速早期学习,但持续依赖LLM实时监督会带来可扩展性限制并引入不可靠信号风险。解决方案的关键在于提出MIRA(Memory-Integrated Reinforcement Learning Agent),其核心是一个结构化且动态演化的记忆图(memory graph),该图融合了高回报经验与LLM输出,存储决策相关的信息如轨迹片段和子目标结构;通过从记忆图中提取一个效用信号(utility signal)来软性调整优势估计(advantage estimation),从而引导策略更新而不改变原始奖励函数。此机制将LLM查询成本摊销至持久记忆中,避免在线频繁调用,并在训练过程中逐步让代理策略超越初始LLM先验,同时确保理论收敛性。

链接: https://arxiv.org/abs/2602.17930
作者: Narjes Nourzad,Carlee Joe-Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: International Conference on Learning Representations (ICLR’26)

点击查看摘要

Abstract:Reinforcement learning (RL) agents often suffer from high sample complexity in sparse or delayed reward settings due to limited prior structure. Large language models (LLMs) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on LLM supervision introduces scalability constraints and dependence on potentially unreliable signals. We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training. The graph stores decision-relevant information, including trajectory segments and subgoal structures, and is constructed from both the agent’s high-return experiences and LLM outputs. This design amortizes LLM queries into a persistent memory rather than requiring continuous real-time supervision. From this memory graph, we derive a utility signal that softly adjusts advantage estimation to influence policy updates without modifying the underlying reward function. As training progresses, the agent’s policy gradually surpasses the initial LLM-derived priors, and the utility term decays, preserving standard convergence guarantees. We provide theoretical analysis showing that utility-based shaping improves early-stage learning in sparse-reward environments. Empirically, MIRA outperforms RL baselines and achieves returns comparable to approaches that rely on frequent LLM supervision, while requiring substantially fewer online LLM queries. Project webpage: this https URL

[AI-31] From Lossy to Verified: A Provenance-Aware Tiered Memory for Agents

【速读】:该论文旨在解决长时程智能体(long-horizon agents)在处理交互历史时面临的“写前查询障碍”(write-before-query barrier)问题,即在生成摘要(summary)时无法预知未来查询的具体需求,导致关键信息可能被遗漏,从而造成不可验证的缺失(unverifiable omissions)。为应对这一挑战,作者提出了一种名为TierMem的溯源关联框架,其核心创新在于构建了一个两级记忆层次结构:默认通过快速摘要索引进行检索以降低计算开销,同时引入运行时充分性路由器(runtime sufficiency router)动态判断是否需要升级到不可变的原始日志存储(raw-log store)以获取足够证据;当原始日志被用于验证后,系统会将结果作为新的带溯源链接的摘要单元写回,从而兼顾效率与可追溯性。

链接: https://arxiv.org/abs/2602.17913
作者: Qiming Zhu,Shunian Chen,Rui Yu,Zhehao Wu,Benyou Wang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon agents often compress interaction histories into write-time summaries. This creates a fundamental write-before-query barrier: compression decisions are made before the system knows what a future query will hinge on. As a result, summaries can cause unverifiable omissions – decisive constraints (e.g., allergies) may be dropped, leaving the agent unable to justify an answer with traceable evidence. Retaining raw logs restores an authoritative source of truth, but grounding on raw logs by default is expensive: many queries are answerable from summaries, yet raw grounding still requires processing far longer contexts, inflating token consumption and latency. We propose TierMem, a provenance-linked framework that casts retrieval as an inference-time evidence allocation problem. TierMem uses a two-tier memory hierarchy to answer with the cheapest sufficient evidence: it queries a fast summary index by default, and a runtime sufficiency router Escalates to an immutable raw-log store only when summary evidence is insufficient. TierMem then writes back verified findings as new summary units linked to their raw sources. On LoCoMo, TierMem achieves 0.851 accuracy (vs.0.873 raw-only) while reducing input tokens by 54.1% and latency by 60.7%. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.17913 [cs.DB] (or arXiv:2602.17913v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2602.17913 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-32] Alignment in Time: Peak-Aware Orchestration for Long-Horizon Agent ic Systems

【速读】:该论文旨在解决长时程任务中自主代理(autonomous agents)的对齐问题,传统AI对齐方法仅关注单个模型输出,而忽视了整个交互轨迹的稳定性与可靠性。为应对这一挑战,作者提出APEMO(Affect-aware Peak-End Modulation for Orchestration),其核心在于通过运行时调度层,在固定计算预算下优化资源分配,利用时间感知的情绪信号(temporal-affective signals)识别轨迹中的不稳定性,并针对性地在关键段落(如峰值时刻和结尾)进行修复,而非修改模型权重。该方案将对齐问题重新定义为一个时间控制问题,显著提升了轨迹级质量与可复用性,为构建高可靠性的长时程智能体系统提供了工程可行路径。

链接: https://arxiv.org/abs/2602.17910
作者: Hanjing Shi,Dominic DiFranzo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional AI alignment primarily focuses on individual model outputs; however, autonomous agents in long-horizon workflows require sustained reliability across entire interaction trajectories. We introduce APEMO (Affect-aware Peak-End Modulation for Orchestration), a runtime scheduling layer that optimizes computational allocation under fixed budgets by operationalizing temporal-affective signals. Instead of modifying model weights, APEMO detects trajectory instability through behavioral proxies and targets repairs at critical segments, such as peak moments and endings. Evaluation across multi-agent simulations and LLM-based planner–executor flows demonstrates that APEMO consistently enhances trajectory-level quality and reuse probability over structural orchestrators. Our results reframe alignment as a temporal control problem, offering a resilient engineering pathway for the development of long-horizon agentic systems.

[AI-33] Machine Learning Based Prediction of Surgical Outcomes in Chronic Rhinosinusitis from Clinical Data

【速读】:该论文旨在解决慢性鼻-鼻窦炎(Chronic Rhinosinusitis, CRS)患者手术决策中因个体化预后不确定性导致的临床难题,即如何在术前精准识别可能从手术中获益的患者,从而优化医疗资源分配并改善患者结局。解决方案的关键在于利用前瞻性收集的标准化临床干预试验数据,构建基于监督式机器学习模型的预测工具,仅依赖术前特征即可实现对Sino-Nasal Outcome Test-22(SNOT-22)评分变化的准确分类预测,其最佳模型达到约85%的分类准确率,并在独立验证集上表现优于专家临床医生平均准确率(75.6%),展现出可解释性强、辅助个性化诊疗的潜力。

链接: https://arxiv.org/abs/2602.17888
作者: Sayeed Shafayet Chowdhury,Karen D’Souza,V. Siva Kakumani,Snehasis Mukhopadhyay,Shiaofen Fang,Rodney J. Schlosser,Daniel M. Beswick,Jeremiah A. Alt,Jess C. Mace,Zachary M. Soler,Timothy L. Smith,Vijay R. Ramakrishnan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has increasingly transformed medical prognostics by enabling rapid and accurate analysis across imaging and pathology. However, the investigation of machine learning predictions applied to prospectively collected, standardized data from observational clinical intervention trials remains underexplored, despite its potential to reduce costs and improve patient outcomes. Chronic rhinosinusitis (CRS), a persistent inflammatory disease of the paranasal sinuses lasting more than three months, imposes a substantial burden on quality of life (QoL) and societal cost. Although many patients respond to medical therapy, others with refractory symptoms often pursue surgical intervention. Surgical decision-making in CRS is complex, as it must weigh known procedural risks against uncertain individualized outcomes. In this study, we evaluated supervised machine learning models for predicting surgical benefit in CRS, using the Sino-Nasal Outcome Test-22 (SNOT-22) as the primary patient-reported outcome. Our prospectively collected cohort from an observational intervention trial comprised patients who all underwent surgery; we investigated whether models trained only on preoperative data could identify patients who might not have been recommended surgery prior to the procedure. Across multiple algorithms, including an ensemble approach, our best model achieved approximately 85% classification accuracy, providing accurate and interpretable predictions of surgical candidacy. Moreover, on a held-out set of 30 cases spanning mixed difficulty, our model achieved 80% accuracy, exceeding the average prediction accuracy of expert clinicians (75.6%), demonstrating its potential to augment clinical decision-making and support personalized CRS care.

[AI-34] MantisV2: Closing the Zero-Shot Gap in Time Series Classification with Synthetic Data and Test-Time Strategies

【速读】:该论文旨在解决时间序列分类任务中基础模型(foundation models)在零样本(zero-shot)场景下特征提取性能不足的问题,尤其是冻结编码器与微调编码器之间存在的显著性能差距。解决方案的关键在于三个方面:首先,提出Mantis+,一种完全在合成时间序列上预训练的模型,提升泛化能力;其次,通过受控消融实验优化架构,得到更轻量且高效的MantisV2编码器;最后,设计改进的测试时方法,利用中间层表示并优化输出token聚合策略,同时结合自集成(self-ensembling)和跨模型嵌入融合进一步提升性能。这些改进使MantisV2和Mantis+在UCR、UEA、HAR及EEG等多个基准数据集上均达到最先进的零样本分类效果。

链接: https://arxiv.org/abs/2602.17868
作者: Vasilii Feofanov,Songkang Wen,Jianfeng Zhang,Lujia Pan,Ievgen Redko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing foundation models for time series classification is of high practical relevance, as such models can serve as universal feature extractors for diverse downstream tasks. Although early models such as Mantis have shown the promise of this approach, a substantial performance gap remained between frozen and fine-tuned encoders. In this work, we introduce methods that significantly strengthen zero-shot feature extraction for time series. First, we introduce Mantis+, a variant of Mantis pre-trained entirely on synthetic time series. Second, through controlled ablation studies, we refine the architecture and obtain MantisV2, an improved and more lightweight encoder. Third, we propose an enhanced test-time methodology that leverages intermediate-layer representations and refines output-token aggregation. In addition, we show that performance can be further improved via self-ensembling and cross-model embedding fusion. Extensive experiments on UCR, UEA, Human Activity Recognition (HAR) benchmarks, and EEG datasets show that MantisV2 and Mantis+ consistently outperform prior time series foundation models, achieving state-of-the-art zero-shot performance.

[AI-35] Financial time series augmentation using transformer based GAN architecture

【速读】:该论文旨在解决金融时间序列数据稀缺导致深度学习预测模型训练不足与泛化能力差的问题。其关键解决方案是利用基于Transformer的生成对抗网络(TTS-GAN)生成高质量合成数据,对真实金融数据进行增强,并在此基础上训练长短期记忆网络(LSTM)模型,从而显著提升预测精度。研究通过比特币和标普500价格数据在不同预测时 horizon 上验证了该方法的有效性,同时提出了一种结合动态时间规整(DTW)与改进的深度数据差异度量(DeD-iMs)的时间序列专用质量评估指标,以可靠监控生成数据的质量和训练过程。

链接: https://arxiv.org/abs/2602.17865
作者: Andrzej Podobiński,Jarosław A. Chudziak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for the upcoming 18th International Conference on Agents and Artificial Intelligence (ICAART-2026), Marbella, Spain. The final published version will appear in the official conference proceedings

点击查看摘要

Abstract:Time-series forecasting is a critical task across many domains, from engineering to economics, where accurate predictions drive strategic decisions. However, applying advanced deep learning models in challenging, volatile domains like finance is difficult due to the inherent limitation and dynamic nature of financial time series data. This scarcity often results in sub-optimal model training and poor generalization. The fundamental challenge lies in determining how to reliably augment scarce financial time series data to enhance the predictive accuracy of deep learning forecasting models. Our main contribution is a demonstration of how Generative Adversarial Networks (GANs) can effectively serve as a data augmentation tool to overcome data scarcity in the financial domain. Specifically, we show that training a Long Short-Term Memory (LSTM) forecasting model on a dataset augmented with synthetic data generated by a transformer-based GAN (TTS-GAN) significantly improves the forecasting accuracy compared to using real data alone. We confirm these results across different financial time series (Bitcoin and S\P500 price data) and various forecasting horizons. Furthermore, we propose a novel, time series specific quality metric that combines Dynamic Time Warping (DTW) and a modified Deep Dataset Dissimilarity Measure (DeD-iMs) to reliably monitor the training progress and evaluate the quality of the generated data. These findings provide compelling evidence for the benefits of GAN-based data augmentation in enhancing financial predictive capabilities.

[AI-36] he Token Games: Evaluating Language Model Reasoning with Puzzle Duels

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)推理能力评估中面临的两大挑战:一是人工构建高难度问题成本高昂,尤其在需要博士级领域知识的基准测试中;二是难以区分模型表现的是真正推理能力还是训练数据中的过拟合。为此,作者受16世纪数学对决启发,提出“Token Games”(TTG)评估框架,其核心在于让模型通过自动生成编程谜题(即给定一个返回布尔值的Python函数,要求找到使其返回True的输入)相互挑战,并基于对抗性对弈结果计算Elo评分以实现模型间的相对比较。该方案的关键创新在于将问题生成与评估过程自动化、去中心化,从而避免了人工干预并揭示了当前模型在创造高质量谜题方面仍存在显著瓶颈,同时拓展了评估维度至创造力和任务构造能力。

链接: https://arxiv.org/abs/2602.17831
作者: Simon Henniger,Gabriel Poesia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project website: this https URL

点击查看摘要

Abstract:Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity’s Last Exam, without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models, not measured by previous benchmarks. Overall, our work suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.

[AI-37] Ontology-Guided Neuro-Symbolic Inference: Grounding Language Models with Mathematical Domain Knowledge

【速读】:该论文旨在解决语言模型在高风险专业领域中因幻觉(hallucination)、脆弱性(brittleness)和缺乏形式化基础(lack of formal grounding)而导致的可靠性不足问题。其解决方案的关键在于构建一个神经符号(neuro-symbolic)管道,利用OpenMath领域本体(ontology)通过检索增强生成(retrieval-augmented generation, RAG)机制,将相关定义以高质量上下文注入模型提示(prompt),并结合混合检索与交叉编码器重排序(cross-encoder reranking)策略提升检索准确性,从而增强模型在数学推理任务中的可验证性和性能表现。

链接: https://arxiv.org/abs/2602.17826
作者: Marcelo Labre
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: Submitted to NeuS 2026. Supplementary materials and code: this https URL

点击查看摘要

Abstract:Language models exhibit fundamental limitations – hallucination, brittleness, and lack of formal grounding – that are particularly problematic in high-stakes specialist fields requiring verifiable reasoning. I investigate whether formal domain ontologies can enhance language model reliability through retrieval-augmented generation. Using mathematics as proof of concept, I implement a neuro-symbolic pipeline leveraging the OpenMath ontology with hybrid retrieval and cross-encoder reranking to inject relevant definitions into model prompts. Evaluation on the MATH benchmark with three open-source models reveals that ontology-guided context improves performance when retrieval quality is high, but irrelevant context actively degrades it – highlighting both the promise and challenges of neuro-symbolic approaches.

[AI-38] he 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agent ic AI Systems

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 代理系统(AI agent)生态复杂、发展迅速且文档记录不一致所带来的研究与政策制定障碍问题。其解决方案的关键在于构建并发布《2025年AI代理指数》(2025 AI Agent Index),该指数基于公开信息和与开发者邮件沟通,系统性地收集了30个前沿AI代理的起源、设计、能力、生态系统及安全特性等数据,并揭示了开发者在透明度方面的显著差异,尤其指出多数开发者对安全性、评估和社会影响的信息披露不足,从而为研究人员和政策制定者提供可追溯、结构化的参考框架。

链接: https://arxiv.org/abs/2602.17753
作者: Leon Staufer,Kevin Feng,Kevin Wei,Luke Bailey,Yawen Duan,Mick Yang,A. Pinar Ozisik,Stephen Casper,Noam Kolt
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic AI systems are increasingly capable of performing professional and personal tasks with limited human involvement. However, tracking these developments is difficult because the AI agent ecosystem is complex, rapidly evolving, and inconsistently documented, posing obstacles to both researchers and policymakers. To address these challenges, this paper presents the 2025 AI Agent Index. The Index documents information regarding the origins, design, capabilities, ecosystem, and safety features of 30 state-of-the-art AI agents based on publicly available information and email correspondence with developers. In addition to documenting information about individual agents, the Index illuminates broader trends in the development of agents, their capabilities, and the level of transparency of developers. Notably, we find different transparency levels among agent developers and observe that most developers share little information about safety, evaluations, and societal impacts. The 2025 AI Agent Index is available online at this https URL

[AI-39] Investigating Target Class Influence on Neural Network Compressibility for Energy-Autonomous Avian Monitoring

【速读】:该论文旨在解决野生动物监测中传统人工计数方法效率低、成本高的问题,以及现有基于机器学习的声景分析方案对复杂模型和大量计算资源的依赖。其关键解决方案是将高效的人工智能(Artificial Intelligence, AI)架构部署于低成本微控制器单元(Microcontroller Units, MCUs)上,在野外边缘设备直接进行鸟类识别,从而实现轻量化、低功耗且可能量自给的实时监测系统。通过训练和压缩不同类别目标的神经网络模型,研究评估了多物种检测任务下模型压缩率与性能损失的关系,并验证了该方案在多种硬件平台上的可行性。

链接: https://arxiv.org/abs/2602.17751
作者: Nina Brolich,Simon Geis,Maximilian Kasper,Alexander Barnhill,Axel Plinge,Dominik Seuß
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, Funding: GreenICT@FMD (BMFTR grant 16ME0491K)

点击查看摘要

Abstract:Biodiversity loss poses a significant threat to humanity, making wildlife monitoring essential for assessing ecosystem health. Avian species are ideal subjects for this due to their popularity and the ease of identifying them through their distinctive songs. Traditionalavian monitoring methods require manual counting and are therefore costly and inefficient. In passive acoustic monitoring, soundscapes are recorded over long periods of time. The recordings are analyzed to identify bird species afterwards. Machine learning methods have greatly expedited this process in a wide range of species and environments, however, existing solutions require complex models and substantial computational resources. Instead, we propose running machine learning models on inexpensive microcontroller units (MCUs) directly in the field. Due to the resulting hardware and energy constraints, efficient artificial intelligence (AI) architecture is required. In this paper, we present our method for avian monitoring on MCUs. We trained and compressed models for various numbers of target classes to assess the detection of multiple bird species on edge devices and evaluate the influence of the number of species on the compressibility of neural networks. Our results demonstrate significant compression rates with minimal performance loss. We also provide benchmarking results for different hardware platforms and evaluate the feasibility of deploying energy-autonomous devices.

[AI-40] Five Fatal Assumptions: Why T-Shirt Sizing Systematically Fails for AI Projects

【速读】:该论文旨在解决在人工智能(AI)开发项目中,传统敏捷估算方法(如T-shirt sizing)因依赖于五个基础假设而产生系统性误判的问题。这些假设包括线性努力扩展、以往经验的可重复性、工作量与工期的可互换性、任务的可分解性以及完成标准的确定性,但在涉及大语言模型(LLM)和多智能体系统(multi-agent systems)等复杂AI场景下往往失效。解决方案的关键在于提出“检查点估算”(Checkpoint Sizing),这是一种以人类为中心、迭代式的估算方法,通过设置明确的决策节点,在开发过程中基于实际学习结果而非初始假设重新评估范围与可行性,从而应对AI开发中的非线性性能跃迁、复杂交互界面和高度耦合性带来的不确定性。

链接: https://arxiv.org/abs/2602.17734
作者: Raja Soundaramourty,Ozkan Kilic,Ramu Chenchaiah
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agile estimation techniques, particularly T-shirt sizing, are widely used in software development for their simplicity and utility in scoping work. However, when we apply these methods to artificial intelligence initiatives – especially those involving large language models (LLMs) and multi-agent systems – the results can be systematically misleading. This paper shares an evidence-backed analysis of five foundational assumptions we often make during T-shirt sizing. While these assumptions usually hold true for traditional software, they tend to fail in AI contexts: (1) linear effort scaling, (2) repeatability from prior experience, (3) effort-duration fungibility, (4) task decomposability, and (5) deterministic completion criteria. Drawing on recent research into multi-agent system failures, scaling principles, and the inherent unreliability of multi-turn conversations, we show how AI development breaks these rules. We see this through non-linear performance jumps, complex interaction surfaces, and “tight coupling” where a small change in data cascades through the entire stack. To help teams navigate this, we propose Checkpoint Sizing: a more human-centric, iterative approach that uses explicit decision gates where scope and feasibility are reassessed based on what we learn during development, rather than what we assumed at the start. This paper is intended for engineering managers, technical leads, and product owners responsible for planning and delivering AI initiatives.

[AI-41] “Everyones using it but no one is allowed to talk about it”: College Students Experiences Navigating the Higher Education Environment in a Generative AI World

【速读】:该论文试图解决的问题是:高等教育机构在生成式 AI(Generative AI)日益普及的背景下,其现行制度与政策未能有效适应学生使用 AI 的实践,导致学生在环境压力和社会规范影响下产生非合规行为,进而削弱学习效果。解决方案的关键在于认识到学生使用 AI 是一种情境化(situated)实践,需从制度、教学和工具设计三个层面协同改进:首先,制定清晰、一致且具指导性的 AI 使用政策;其次,教师应主动引导学生将 AI 作为学习工具而非替代品;最后,系统设计者需开发支持价值导向自我调节的工具,缩小学生意图与行为之间的差距,从而真正实现 AI 对学习的有效赋能。

链接: https://arxiv.org/abs/2602.17720
作者: Yue Fu,Yifan Lin,Yessica Wang,Sarah Tran,Alexis Hiniker
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Higher education students are increasingly using generative AI in their academic work. However, existing institutional practices have not yet adapted to this shift. Through semi-structured interviews with 23 college students, our study examines the environmental and social factors that influence students’ use of AI. Findings show that institutional pressure factors like deadlines, exam cycles, and grading lead students to engage with AI even when they think it undermines their learning. Social influences, particularly peer micro-communities, establish de-facto AI norms regardless of official AI policies. Campus-wide ``AI shame’’ is prevalent, often pushing AI use underground. Current institutional AI policies are perceived as generic, inconsistent, and confusing, resulting in routine noncompliance. Additionally, students develop value-based self-regulation strategies, but environmental pressures create a gap between students’ intentions and their behaviors. Our findings show student AI use to be a situated practice, and we discuss implications for institutions, instructors, and system tool designers to effectively support student learning with AI.

[AI-42] MIDAS: Mosaic Input-Specific Differentiable Architecture Search

【速读】:该论文旨在解决可微分神经架构搜索(Differentiable Neural Architecture Search, DNAS)在实际应用中采纳率低的问题,核心挑战在于静态架构参数难以适应输入多样性,导致搜索效率与性能受限。解决方案的关键在于提出MIDAS方法,其创新性地将静态架构参数替换为通过自注意力机制计算的、依赖于输入的动态参数,并引入两个关键改进:(i) 在激活图的每个空间块上独立进行架构选择以增强局部适应性;(ii) 设计一种无参数且拓扑感知的搜索空间,显式建模节点连接关系并简化每节点的两路输入边选择过程。这一设计显著提升了搜索的鲁棒性和准确性,在多个基准数据集和搜索空间上均实现了最优或领先性能。

链接: https://arxiv.org/abs/2602.17700
作者: Konstanty Subbotko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Differentiable Neural Architecture Search (NAS) provides efficient, gradient-based methods for automatically designing neural networks, yet its adoption remains limited in practice. We present MIDAS, a novel approach that modernizes DARTS by replacing static architecture parameters with dynamic, input-specific parameters computed via self-attention. To improve robustness, MIDAS (i) localizes the architecture selection by computing it separately for each spatial patch of the activation map, and (ii) introduces a parameter-free, topology-aware search space that models node connectivity and simplifies selecting the two incoming edges per node. We evaluate MIDAS on the DARTS, NAS-Bench-201, and RDARTS search spaces. In DARTS, it reaches 97.42% top-1 on CIFAR-10 and 83.38% on CIFAR-100. In NAS-Bench-201, it consistently finds globally optimal architectures. In RDARTS, it sets the state of the art on two of four search spaces on CIFAR-10. We further analyze why MIDAS works, showing that patchwise attention improves discrimination among candidate operations, and the resulting input-specific parameter distributions are class-aware and predominantly unimodal, providing reliable guidance for decoding.

[AI-43] ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)后训练权重量化中面临的低精度(低于4比特)压缩难题,核心挑战在于权重敏感性高度非均匀以及缺乏合理的精度分配机制。现有方法要么采用细粒度混合精度导致运行时开销高,要么依赖启发式或受限的精度分配策略。其解决方案的关键在于提出ScaleBITS框架,通过一种新的敏感性分析引导硬件对齐的块级权重分区方案(基于双向通道重排序),并将全局比特位宽分配建模为带约束的优化问题,进而设计了一种可扩展的贪心算法近似方法,实现端到端的、无需额外运行时开销的精细化比特位宽自动分配。

链接: https://arxiv.org/abs/2602.17698
作者: Xinlin Li,Timothy Chou,Josh Fromm,Zichang Liu,Yunjie Pan,Christina Fragouli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training weight quantization is crucial for reducing the memory and inference cost of large language models (LLMs), yet pushing the average precision below 4 bits remains challenging due to highly non-uniform weight sensitivity and the lack of principled precision allocation. Existing solutions use irregular fine-grained mixed-precision with high runtime overhead or rely on heuristics or highly constrained precision allocation strategies. In this work, we propose ScaleBITS, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. Guided by a new sensitivity analysis, we introduce a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering. We formulate global bitwidth allocation as a constrained optimization problem and develop a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation. Experiments show that ScaleBITS significantly improves over uniform-precision quantization (up to +36%) and outperforms state-of-the-art sensitivity-aware baselines (up to +13%) in ultra-low-bit regime, without adding runtime overhead.

[AI-44] Can LLM Safety Be Ensured by Constraining Parameter Regions?

【速读】:该论文旨在解决当前生成式 AI(Generative AI)模型中“安全区域”(safety regions)识别不稳定的难题,即如何可靠地定位对模型安全性行为具有直接影响的参数子集。其解决方案的关键在于系统性评估四种不同参数粒度的安全区域识别方法(从单个权重到整个Transformer层),并在多个大型语言模型(LLM)家族及多种安全识别数据集上进行验证。研究发现,现有方法识别出的安全区域在不同数据集间重叠度低至中等,且在引入非有害查询的效用数据集进一步细化后,重叠显著下降,表明当前技术尚无法稳定识别与数据集无关的通用安全区域。

链接: https://arxiv.org/abs/2602.17696
作者: Zongmin Li,Jian Su,Farah Benamara,Aixin Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages

点击查看摘要

Abstract:Large language models (LLMs) are often assumed to contain ``safety regions’’ – parameter subsets whose modification directly influences safety behaviors. We conduct a systematic evaluation of four safety region identification methods spanning different parameter granularities, from individual weights to entire Transformer layers, across four families of backbone LLMs with varying sizes. Using ten safety identification datasets, we find that the identified safety regions exhibit only low to moderate overlap, as measured by IoU. The overlap drops significantly when the safety regions are further refined using utility datasets (\ie non-harmful queries). These results suggest that current techniques fail to reliably identify a stable, dataset-agnostic safety region.

[AI-45] AsynDBT: Asynchronous Distributed Bilevel Tuning for efficient In-Context Learning with Large Language Models

【速读】:该论文旨在解决云部署的大语言模型(Large Language Models, LLMs)在实际应用中因参数和梯度不可见而导致的提示词(prompt)优化成本高、效率低的问题,以及当前基于上下文学习(In-Context Learning, ICL)方法受限于高质量数据稀缺且难以共享的挑战。解决方案的关键在于提出一种异步分布式双层微调算法(Asynchronous Distributed Bilevel Tuning, AsynDBT),该算法通过利用LLM反馈同时优化上下文学习样本与提示片段(prompt fragments),从而提升下游任务性能;其分布式架构不仅增强了对异构计算环境的适应性,还保障了数据隐私,有效缓解了传统联邦学习(Federated Learning, FL)中因设备延迟(straggler问题)和非独立同分布(non-IID)数据带来的训练不稳定问题。

链接: https://arxiv.org/abs/2602.17694
作者: Hui Ma,Shaoyu Dou,Ya Liu,Fei Xing,Li Feng,Feng Pi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in Scientific Reports

点击查看摘要

Abstract:With the rapid development of large language models (LLMs), an increasing number of applications leverage cloud-based LLM APIs to reduce usage costs. However, since cloud-based models’ parameters and gradients are agnostic, users have to manually or use heuristic algorithms to adjust prompts for intervening LLM outputs, which requiring costly optimization procedures. In-context learning (ICL) has recently emerged as a promising paradigm that enables LLMs to adapt to new tasks using examples provided within the input, eliminating the need for parameter updates. Nevertheless, the advancement of ICL is often hindered by the lack of high-quality data, which is often sensitive and different to share. Federated learning (FL) offers a potential solution by enabling collaborative training of distributed LLMs while preserving data privacy. Despite this issues, previous FL approaches that incorporate ICL have struggled with severe straggler problems and challenges associated with heterogeneous non-identically data. To address these problems, we propose an asynchronous distributed bilevel tuning (AsynDBT) algorithm that optimizes both in-context learning samples and prompt fragments based on the feedback from the LLM, thereby enhancing downstream task performance. Benefiting from its distributed architecture, AsynDBT provides privacy protection and adaptability to heterogeneous computing environments. Furthermore, we present a theoretical analysis establishing the convergence guarantees of the proposed algorithm. Extensive experiments conducted on multiple benchmark datasets demonstrate the effectiveness and efficiency of AsynDBT.

[AI-46] Agent ic Unlearning: When LLM Agent Meets Machine Unlearning

【速读】:该论文旨在解决智能体(agent)中敏感信息难以彻底清除的问题,特别是现有去学习(unlearning)方法仅针对模型参数,忽略了持久记忆(persistent memory)路径,导致两个关键缺陷:(i) 参数-记忆回流(parameter-memory backflow),即检索操作可能重新激活参数中的残留信息或记忆中的敏感内容再次出现;(ii) 缺乏统一的去学习策略来同时覆盖参数与记忆两条路径。解决方案的核心是提出同步回流去学习(Synchronized Backflow Unlearning, SBU)框架,其通过双路径协同机制实现联合去学习:记忆路径基于依赖闭包(dependency closure)进行实体剪枝并逻辑上失效共享 artifacts,参数路径采用随机参考对齐(stochastic reference alignment)引导输出趋向高熵先验分布;两条路径通过同步双更新协议集成,形成闭环机制,使记忆去学习与参数抑制相互强化,从而有效防止跨路径再污染。

链接: https://arxiv.org/abs/2602.17692
作者: Bin Wang,Fan Wang,Pingping Wang,Jinyu Cong,Yang Yu,Yilong Yin,Zhongyi Han,Benzheng Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, 6 tables

点击查看摘要

Abstract:In this paper, we introduce \textbfagentic unlearning which removes specified information from both model parameters and persistent memory in agents with closed-loop interaction. Existing unlearning methods target parameters alone, leaving two critical gaps: (i) parameter-memory backflow, where retrieval reactivates parametric remnants or memory artifacts reintroduce sensitive content, and (ii) the absence of a unified strategy that covers both parameter and memory pathways. We present Synchronized Backflow Unlearning (SBU), a framework that unlearns jointly across parameter and memory pathways. The memory pathway performs dependency closure-based unlearning that prunes isolated entities while logically invalidating shared artifacts. The parameter pathway employs stochastic reference alignment to guide model outputs toward a high-entropy prior. These pathways are integrated via a synchronized dual-update protocol, forming a closed-loop mechanism where memory unlearning and parametric suppression reinforce each other to prevent cross-pathway recontamination. Experiments on medical QA benchmarks show that SBU reduces traces of targeted private information across both pathways with limited degradation on retained data.

[AI-47] Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

【速读】:该论文旨在解决将大语言模型(Large Language Models, LLMs)中的链式思维(Chain-of-Thought, CoT)推理能力蒸馏到小型学生模型时面临的容量不匹配问题:教师模型的推理过程通常过于冗长,导致学生模型难以忠实复现。解决方案的关键在于提出一个三阶段课程学习框架,通过渐进式技能习得实现高效知识迁移:首先通过掩码乱序重建任务建立结构理解;其次在掩码补全任务上应用组相对策略优化(Group Relative Policy Optimization, GRPO),使学生模型自主平衡准确率与简洁性;最后识别持续失败案例并通过针对性重写引导学生内化教师知识,再次使用GRPO进行优化。该方法在GSM8K基准上使Qwen2.5-3B-Base模型准确率提升11.29%,同时输出长度减少27.4%,优于指令微调版本及现有蒸馏方法。

链接: https://arxiv.org/abs/2602.17686
作者: Bowen Yu,Maolin Wang,Sheng Zhang,Binhao Wang,Yi Wen,Jingtong Gao,Bowen Liu,Zimo Zhao,Wanyu Wang,Xiangyu Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 12 figures

点击查看摘要

Abstract:Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achieve an 11.29 percent accuracy improvement while reducing output length by 27.4 percent, surpassing both instruction-tuned variants and prior distillation methods.

[AI-48] CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

【速读】:该论文旨在解决强化学习从可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)在代码生成任务中因依赖高质量测试用例而面临的可扩展性瓶颈问题。现有方法受限于测试用例的可用性和可靠性,难以在无测试场景下进行训练或推理。解决方案的关键在于提出CodeScaler——一种无需执行即可提供奖励信号的模型,其核心创新包括:基于已验证代码问题构建的偏好数据集、语法感知的代码提取机制以及保持有效性的奖励塑造策略,从而实现训练和推理阶段的高效扩展,并在多个基准上显著优于传统基于执行的强化学习方法。

链接: https://arxiv.org/abs/2602.17684
作者: Xiao Zhu,Xinyu Zhou,Boyu Zhu,Hanxu Hu,Mingzhe Du,Haotian Zhang,Huiming Wang,Zhijiang Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

[AI-49] Mind the Boundary: Stabilizing Gemini Enterprise A2A via a Cloud Run Hub Across Projects and Accounts

【速读】:该论文旨在解决企业级对话式用户界面(Conversational UI)在跨项目与账户边界调用异构后端代理(Agent)和工具时,如何实现安全、可复现的协同调度问题。其核心挑战在于:不仅需满足协议兼容性,还需应对生成式 AI (Generative AI) 用户界面的输入输出约束及边界依赖的身份认证机制。解决方案的关键在于构建一个部署于 Cloud Run 的 Agent-to-Agent (A2A) 中心枢纽(Hub),通过四条路径实现差异化路由——包括公共 A2A 代理、IAM 保护的跨账户代理、基于 Discovery Engine 和 Vertex AI Search 的检索增强生成路径,以及通用问答路径;同时强制 JSON-RPC 端点采用纯文本兼容模式,将结构化输出与调试信号分离至独立的 REST 工具 API,从而避免因混合结构化数据导致的 UI 错误,确保请求的确定性路由与稳定响应。

链接: https://arxiv.org/abs/2602.17675
作者: Takao Morita
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 7 pages. Implementation and evaluation study of cross-boundary agent orchestration for Gemini Enterprise UI

点击查看摘要

Abstract:Enterprise conversational UIs increasingly need to orchestrate heterogeneous backend agents and tools across project and account boundaries in a secure and reproducible way. Starting from Gemini Enterprise Agent-to-Agent (A2A) invocation, we implement an A2A Hub orchestrator on Cloud Run that routes queries to four paths: a public A2A agent deployed in a different project, an IAM-protected Cloud Run A2A agent in a different account, a retrieval-augmented generation path combining Discovery Engine and Vertex AI Search with direct retrieval of source text from Google Cloud Storage, and a general question answering path via Vertex AI. We show that practical interoperability is governed not only by protocol compliance but also by Gemini Enterprise UI constraints and boundary-dependent authentication. Real UI requests arrive as text-only inputs and include empty accepted output mode lists, so mixing structured data into JSON-RPC responses can trigger UI errors. To address this, we enforce a text-only compatibility mode on the JSON-RPC endpoint while separating structured outputs and debugging signals into a REST tool API. On a four-query benchmark spanning expense policy, project management assistance, general knowledge, and incident response deadline extraction, we confirm deterministic routing and stable UI responses. For the retrieval path, granting storage object read permissions enables evidence-backed extraction of the fifteen minute deadline. All experiments are reproducible using the repository snapshot tagged a2a-hub-gemini-ui-stable-paper.

[AI-50] rojans in Artificial Intelligence (TrojAI) Final Report

【速读】:该论文旨在解决现代人工智能系统中日益凸显的AI后门(AI Trojan)安全威胁问题,即恶意攻击者在模型训练阶段植入隐蔽的恶意触发机制,导致模型在特定输入下产生异常行为或被远程操控。解决方案的关键在于通过权重分析(weight analysis)和触发器逆向(trigger inversion)等方法实现对AI Trojans的有效检测,并提出针对已部署模型的风险缓解策略。研究还通过大规模测试验证了检测方法的性能与灵敏度,揭示了“自然”存在的Trojan现象,为AI安全领域的持续研究提供了重要实践依据与技术路径。

链接: https://arxiv.org/abs/2602.07152
作者: Kristopher W. Reese,Taylor Kulp-McDowall,Michael Majurski,Tim Blattner,Derek Juba,Peter Bajcsy,Antonio Cardone,Philippe Dessauw,Alden Dima,Anthony J. Kearsley,Melinda Kleczynski,Joel Vasanth,Walid Keyrouz,Chace Ashcraft,Neil Fendley,Ted Staley,Trevor Stout,Josh Carney,Greg Canal,Will Redman,Aurora Schmidt,Cameron Hickert,William Paul,Jared Markowitz,Nathan Drenkow,David Shriver,Marissa Connor,Keltin Grimes,Marco Christiani,Hayden Moore,Jordan Widjaja,Kasimir Gabert,Uma Balakrishnan,Satyanadh Gundimada,John Jacobellis,Sandya Lakkur,Vitus Leung,Jon Roose,Casey Battaglino,Farinaz Koushanfar,Greg Fields,Xihe Gu,Yaman Jandali,Xinqiao Zhang,Akash Vartak,Tim Oates,Ben Erichson,Michael Mahoney,Rauf Izmailov,Xiangyu Zhang,Guangyu Shen,Siyuan Cheng,Shiqing Ma,XiaoFeng Wang,Haixu Tang,Di Tang,Xiaoyi Chen,Zihao Wang,Rui Zhu,Susmit Jha,Xiao Lin,Manoj Acharya,Wenchao Li,Chao Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Intelligence Advanced Research Projects Activity (IARPA) launched the TrojAI program to confront an emerging vulnerability in modern artificial intelligence: the threat of AI Trojans. These AI trojans are malicious, hidden backdoors intentionally embedded within an AI model that can cause a system to fail in unexpected ways, or allow a malicious actor to hijack the AI model at will. This multi-year initiative helped to map out the complex nature of the threat, pioneered foundational detection methods, and identified unsolved challenges that require ongoing attention by the burgeoning AI security field. This report synthesizes the program’s key findings, including methodologies for detection through weight analysis and trigger inversion, as well as approaches for mitigating Trojan risks in deployed models. Comprehensive test and evaluation results highlight detector performance, sensitivity, and the prevalence of “natural” Trojans. The report concludes with lessons learned and recommendations for advancing AI security research.

[AI-51] Conformal Tradeoffs: Guarantees Beyond Coverag e

【速读】:该论文旨在解决部署型校准预测器(conformal predictors)在实际应用中仅关注边际覆盖率(marginal coverage)而忽略操作性指标的问题,如承诺频率、拒答率、决策错误暴露和承诺纯度等。这些问题直接影响系统的可信赖性和实用性,且无法仅通过覆盖率保证来确定。解决方案的关键在于提出一个超越覆盖率的操作认证与规划框架:(1) 小样本Beta校正(Small-Sample Beta Correction, SSBC),基于精确的有限样本Beta/秩分布反演,将用户请求的置信水平 α\alpha^\star 和误差容忍度 δ\delta 映射为具有PAC风格语义的校准网格点,提供明确的有限窗口覆盖率保证;(2) 校准-审计(Calibrate-and-Audit)两阶段设计,利用独立审计集生成可复用的标签表和有限窗口不确定性包络(二项分布/贝塔-二项分布),通过线性投影估计多种操作指标;(3) 几何刻画机制,揭示由固定 conformal 分区引发的可行性约束、区域边界(对冲 vs. 拒绝)及成本一致性条件,阐明操作率之间的耦合关系及校准如何权衡其 trade-off。最终输出是一个可审计的操作菜单,用于追踪不同校准设置下的可达操作配置及其不确定性范围。

链接: https://arxiv.org/abs/2602.18045
作者: Petrus H. Zwart
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deployed conformal predictors are long-lived decision infrastructure operating over finite operational windows. The real-world question is not only Does the true label lie in the prediction set at the target rate?'' (marginal coverage), but How often does the system commit versus defer? What error exposure does it induce when it acts? How do these rates trade off?‘’ Marginal coverage does not determine these deployment-facing quantities: the same calibrated thresholds can yield different operational profiles depending on score geometry. We provide a framework for operational certification and planning beyond coverage with three contributions. (1) Small-Sample Beta Correction (SSBC): we invert the exact finite-sample Beta/rank law for split conformal to map a user request (\alpha^\star,\delta) to a calibrated grid point with PAC-style semantics, yielding explicit finite-window coverage guarantees. (2) Calibrate-and-Audit: since no distribution-free pivot exists for rates beyond coverage, we introduce a two-stage design in which an independent audit set produces a reusable region – label table and certified finite-window envelopes (Binomial/Beta-Binomial) for operational quantities – commitment frequency, deferral, decisive error exposure, and commit purity – via linear projection. (3) Geometric characterization: we describe feasibility constraints, regime boundaries (hedging vs.\ rejection), and cost-coherence conditions induced by a fixed conformal partition, explaining why operational rates are coupled and how calibration navigates their trade-offs. The output is an auditable operational menu: for a fixed scoring model, we trace attainable operational profiles across calibration settings and attach finite-window uncertainty envelopes. We demonstrate the approach on Tox21 toxicity prediction (12 endpoints) and aqueous solubility screening using AquaSolDB.

[AI-52] Inelastic Constitutive Kolmogorov-Arnold Networks: A generalized framework for automated discovery of interpretable inelastic material models

【速读】:该论文旨在解决固体力学中材料本构关系(constitutive law)识别的问题,即如何从实验或模拟数据中自动提取描述材料弹性与非弹性行为的数学表达式。传统方法往往依赖于经验假设或参数拟合,难以兼顾精度与物理可解释性。本文提出的非弹性柯尔莫哥洛夫-阿诺德网络(iCKANs) 是一种新型人工神经网络架构,其关键在于能够自动发现符号化的本构定律,将材料测试数据转化为闭式数学形式的弹性与非弹性势函数,从而在保持高精度的同时确保物理可解释性。此外,iCKANs 还能整合温度等额外信息,扩展了对加工或服役条件影响材料性能的建模能力。

链接: https://arxiv.org/abs/2602.17750
作者: Chenyi Ji,Kian P. Abdolazizi,Hagen Holthusen,Christian J. Cyron,Kevin Linka
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:A key problem of solid mechanics is the identification of the constitutive law of a material, that is, the relation between strain and stress. Machine learning has lead to considerable advances in this field lately. Here we introduce inelastic Constitutive Kolmogorov-Arnold Networks (iCKANs). This novel artificial neural network architecture can discover in an automated manner symbolic constitutive laws describing both the elastic and inelastic behavior of materials. That is, it can translate data from material testing into corresponding elastic and inelastic potential functions in closed mathematical form. We demonstrate the advantages of iCKANs using both synthetic data and experimental data of the viscoelastic polymer materials VHB 4910 and VHB 4905. The results demonstrate that iCKANs accurately capture complex viscoelastic behavior while preserving physical interpretability. It is a particular strength of iCKANs that they can process not only mechanical data but also arbitrary additional information available about a material (e.g., about temperature-dependent behavior). This makes iCKANs a powerful tool to discover in the future also how specific processing or service conditions affect the properties of materials.

[AI-53] GeneZip: Region-Aware Compression for Long Context DNA Modeling

【速读】:该论文旨在解决基因组尺度基础模型在处理长达数十亿碱基对(bp)的DNA序列时面临的计算与表示挑战。现有方法通常通过扩展小模型的上下文长度或依赖多GPU并行来应对,但存在效率瓶颈。其解决方案的关键在于利用基因组信息分布的高度不平衡性这一生物学先验:编码区仅占约2%,却信息密集,而非编码区则相对信息稀疏。作者提出GeneZip模型,结合HNet风格的动态路由机制与区域感知的压缩比目标函数,实现对不同基因组区域自适应分配表示预算,从而在保持低困惑度(仅增加0.31)的前提下实现137.6倍的压缩率。此设计显著缩短有效序列长度,使得在单张A100 80GB GPU上即可训练参数达6.36亿、上下文长度达1M bp的大规模模型,同时在接触图预测、表达数量性状位点预测和增强子-靶基因预测等下游任务中表现优于或相当于是当前最优模型JanusDNA。

链接: https://arxiv.org/abs/2602.17739
作者: Jianan Zhao,Xixian Liu,Zhihao Zhan,Xinyu Yuan,Hongyu Guo,Jian Tang
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, work in progress

点击查看摘要

Abstract:Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, GeneZip unlocks simultaneous scaling of context and capacity: compared to the prior state-of-the-art model JanusDNA, it enables training models 82.6x larger at 1M-bp context, supporting a 636M-parameter GeneZip model at 1M-bp context. All experiments in this paper can be trained on a single A100 80GB GPU.

[AI-54] UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems

【速读】:该论文旨在解决分子模拟中量子力学(QM)精度与生物尺度之间存在的根本性权衡问题,即如何在保持高精度的同时扩展到大规模生物体系(如蛋白质、溶剂环境等)。其解决方案的关键在于提出一个通用基础模型框架UBio-MolFM,包含三大协同创新:一是构建了基于多保真度“双轨策略”的大规模生物特异性数据集UBio-Mol26(最多含1,200原子);二是设计了线性扩展的等变Transformer模型E2Former-V2,融合等变轴对齐稀疏化(EAAS)与长短程建模(LSR),显著提升推理效率(达~4倍加速);三是采用三阶段课程学习协议,从能量初始化逐步过渡到能量-力一致性优化,通过力监督缓解能量偏移。该框架在微观力和宏观可观测量(如水结构、离子溶剂化、肽链折叠)上均实现了对大规模分布外体系(最高约1,500原子)的从头计算级精度,为下一代计算生物学提供了可直接使用的鲁棒工具。

链接: https://arxiv.org/abs/2602.17709
作者: Lin Huang,Arthur Jiang,XiaoLi Liu,Zion Wang,Jason Zhao,Chu Wang,HaoCheng Lu,ChengXiang Huang,JiaJun Cheng,YiYue Du,Jia Zhang
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph)
备注:

点击查看摘要

Abstract:All-atom molecular simulation serves as a quintessential computational microscope'' for understanding the machinery of life, yet it remains fundamentally limited by the trade-off between quantum-mechanical (QM) accuracy and biological scale. We present UBio-MolFM, a universal foundation model framework specifically engineered to bridge this gap. UBio-MolFM introduces three synergistic innovations: (1) UBio-Mol26, a large bio-specific dataset constructed via a multi-fidelity Two-Pronged Strategy’’ that combines systematic bottom-up enumeration with top-down sampling of native protein environments (up to 1,200 atoms); (2) E2Former-V2, a linear-scaling equivariant transformer that integrates Equivariant Axis-Aligned Sparsification (EAAS) and Long-Short Range (LSR) modeling to capture non-local physics with up to ~4x higher inference throughput in our large-system benchmarks; and (3) a Three-Stage Curriculum Learning protocol that transitions from energy initialization to energy-force consistency, with force-focused supervision to mitigate energy offsets. Rigorous benchmarking across microscopic forces and macroscopic observables – including liquid water structure, ionic solvation, and peptide folding – demonstrates that UBio-MolFM achieves ab initio-level fidelity on large, out-of-distribution biomolecular systems (up to ~1,500 atoms) and realistic MD observables. By reconciling scalability with quantum precision, UBio-MolFM provides a robust, ready-to-use tool for the next generation of computational biology.

机器学习

[LG-0] Assigning Confidence: K-partition Ensembles

链接: https://arxiv.org/abs/2602.18435
作者: Aggelos Semoglou,John Pavlopoulos
类目: Machine Learning (cs.LG)
*备注: 31 pages including appendix

点击查看摘要

Abstract:Clustering is widely used for unsupervised structure discovery, yet it offers limited insight into how reliable each individual assignment is. Diagnostics, such as convergence behavior or objective values, may reflect global quality, but they do not indicate whether particular instances are assigned confidently, especially for initialization-sensitive algorithms like k-means. This assignment-level instability can undermine both accuracy and robustness. Ensemble approaches improve global consistency by aggregating multiple runs, but they typically lack tools for quantifying pointwise confidence in a way that combines cross-run agreement with geometric support from the learned cluster structure. We introduce CAKE (Confidence in Assignments via K-partition Ensembles), a framework that evaluates each point using two complementary statistics computed over a clustering ensemble: assignment stability and consistency of local geometric fit. These are combined into a single, interpretable score in [0,1]. Our theoretical analysis shows that CAKE remains effective under noise and separates stable from unstable points. Experiments on synthetic and real-world datasets indicate that CAKE effectively highlights ambiguous points and stable core members, providing a confidence ranking that can guide filtering or prioritization to improve clustering quality.

[LG-1] Scientific Knowledge-Guided Machine Learning for Vessel Power Prediction: A Comparative Study AAAI2026

链接: https://arxiv.org/abs/2602.18403
作者: Orfeas Bourchas,George Papalambrou
类目: Machine Learning (cs.LG)
*备注: Accepted to the KGML Bridge at AAAI 2026 (non-archival)

点击查看摘要

Abstract:Accurate prediction of main engine power is essential for vessel performance optimization, fuel efficiency, and compliance with emission regulations. Conventional machine learning approaches, such as Support Vector Machines, variants of Artificial Neural Networks (ANNs), and tree-based methods like Random Forests, Extra Tree Regressors, and XGBoost, can capture nonlinearities but often struggle to respect the fundamental propeller law relationship between power and speed, resulting in poor extrapolation outside the training envelope. This study introduces a hybrid modeling framework that integrates physics-based knowledge from sea trials with data-driven residual learning. The baseline component, derived from calm-water power curves of the form P = cV^n , captures the dominant power-speed dependence, while another, nonlinear, regressor is then trained to predict the residual power, representing deviations caused by environmental and operational conditions. By constraining the machine learning task to residual corrections, the hybrid model simplifies learning, improves generalization, and ensures consistency with the underlying physics. In this study, an XGBoost, a simple Neural Network, and a Physics-Informed Neural Network (PINN) coupled with the baseline component were compared to identical models without the baseline component. Validation on in-service data demonstrates that the hybrid model consistently outperformed a pure data-driven baseline in sparse data regions while maintaining similar performance in populated ones. The proposed framework provides a practical and computationally efficient tool for vessel performance monitoring, with applications in weather routing, trim optimization, and energy efficiency planning.

[LG-2] PRISM-FCP: Byzantine-Resilient Federated Conformal Prediction via Partial Sharing

链接: https://arxiv.org/abs/2602.18396
作者: Ehsan Lari,Reza Arablouei,Stefan Werner
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Probability (math.PR); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 13 pages, 5 figures, 2 tables, Submitted to IEEE Transactions on Signal Processing (TSP)

点击查看摘要

Abstract:We propose PRISM-FCP (Partial shaRing and robust calIbration with Statistical Margins for Federated Conformal Prediction), a Byzantine-resilient federated conformal prediction framework that utilizes partial model sharing to improve robustness against Byzantine attacks during both model training and conformal calibration. Existing approaches address adversarial behavior only in the calibration stage, leaving the learned model susceptible to poisoned updates. In contrast, PRISM-FCP mitigates attacks end-to-end. During training, clients partially share updates by transmitting only M of D parameters per round. This attenuates the expected energy of an adversary’s perturbation in the aggregated update by a factor of M/D , yielding lower mean-square error (MSE) and tighter prediction intervals. During calibration, clients convert nonconformity scores into characterization vectors, compute distance-based maliciousness scores, and downweight or filter suspected Byzantine contributions before estimating the conformal quantile. Extensive experiments on both synthetic data and the UCI Superconductivity dataset demonstrate that PRISM-FCP maintains nominal coverage guarantees under Byzantine attacks while avoiding the interval inflation observed in standard FCP with reduced communication, providing a robust and communication-efficient approach to federated uncertainty quantification.

[LG-3] Quantum Maximum Likelihood Prediction via Hilbert Space Embeddings

链接: https://arxiv.org/abs/2602.18364
作者: Sreejith Sreekumar,Nir Weinberger
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Quantum Physics (quant-ph); Machine Learning (stat.ML)
*备注: 32+4 pages, 1 figure

点击查看摘要

Abstract:Recent works have proposed various explanations for the ability of modern large language models (LLMs) to perform in-context prediction. We propose an alternative conceptual viewpoint from an information-geometric and statistical perspective. Motivated by Bach[2023], we model training as learning an embedding of probability distributions into the space of quantum density operators, and in-context learning as maximum-likelihood prediction over a specified class of quantum models. We provide an interpretation of this predictor in terms of quantum reverse information projection and quantum Pythagorean theorem when the class of quantum models is sufficiently expressive. We further derive non-asymptotic performance guarantees in terms of convergence rates and concentration inequalities, both in trace norm and quantum relative entropy. Our approach provides a unified framework to handle both classical and quantum LLMs.

[LG-4] Explaining AutoClustering: Uncovering Meta-Feature Contribution in AutoML for Clustering

链接: https://arxiv.org/abs/2602.18348
作者: Matheus Camilo da Silva,Leonardo Arrighi,Ana Carolina Lorena,Sylvio Barbon Junior
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AutoClustering methods aim to automate unsupervised learning tasks, including algorithm selection (AS), hyperparameter optimization (HPO), and pipeline synthesis (PS), by often leveraging meta-learning over dataset meta-features. While these systems often achieve strong performance, their recommendations are often difficult to justify: the influence of dataset meta-features on algorithm and hyperparameter choices is typically not exposed, limiting reliability, bias diagnostics, and efficient meta-feature engineering. This limits reliability and diagnostic insight for further improvements. In this work, we investigate the explainability of the meta-models in AutoClustering. We first review 22 existing methods and organize their meta-features into a structured taxonomy. We then apply a global explainability technique (i.e., Decision Predicate Graphs) to assess feature importance within meta-models from selected frameworks. Finally, we use local explainability tools such as SHAP (SHapley Additive exPlanations) to analyse specific clustering decisions. Our findings highlight consistent patterns in meta-feature relevance, identify structural weaknesses in current meta-learning strategies that can distort recommendations, and provide actionable guidance for more interpretable Automated Machine Learning (AutoML) design. This study therefore offers a practical foundation for increasing decision transparency in unsupervised learning automation.

[LG-5] A Probabilistic Framework for LLM -Based Model Discovery

链接: https://arxiv.org/abs/2602.18266
作者: Stefan Wahl,Raphaela Schenk,Ali Farnoud,Jakob H. Macke,Daniel Gedon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated methods for discovering mechanistic simulator models from observational data offer a promising path toward accelerating scientific progress. Such methods often take the form of agentic-style iterative workflows that repeatedly propose and revise candidate models by imitating human discovery processes. However, existing LLM-based approaches typically implement such workflows via hand-crafted heuristic procedures, without an explicit probabilistic formulation. We recast model discovery as probabilistic inference, i.e., as sampling from an unknown distribution over mechanistic models capable of explaining the data. This perspective provides a unified way to reason about model proposal, refinement, and selection within a single inference framework. As a concrete instantiation of this view, we introduce ModelSMC, an algorithm based on Sequential Monte Carlo sampling. ModelSMC represents candidate models as particles which are iteratively proposed and refined by an LLM, and weighted using likelihood-based criteria. Experiments on real-world scientific systems illustrate that this formulation discovers models with interpretable mechanisms and improves posterior predictive checks. More broadly, this perspective provides a probabilistic lens for understanding and developing LLM-based approaches to model discovery.

[LG-6] MEG-to-MEG Transfer Learning and Cross-Task Speech/Silence Detection with Limited Data INTERSPEECH2026

链接: https://arxiv.org/abs/2602.18253
作者: Xabier de Zuazo,Vincenzo Verbeni,Eva Navas,Ibon Saratxaga,Mathieu Bourguignon,Nicola Molinaro
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 3 tables, submitted to Interspeech 2026

点击查看摘要

Abstract:Data-efficient neural decoding is a central challenge for speech brain-computer interfaces. We present the first demonstration of transfer learning and cross-task decoding for MEG-based speech models spanning perception and production. We pre-train a Conformer-based model on 50 hours of single-subject listening data and fine-tune on just 5 minutes per subject across 18 participants. Transfer learning yields consistent improvements, with in-task accuracy gains of 1-4% and larger cross-task gains of up to 5-6%. Not only does pre-training improve performance within each task, but it also enables reliable cross-task decoding between perception and production. Critically, models trained on speech production decode passive listening above chance, confirming that learned representations reflect shared neural processes rather than task-specific motor activity.

[LG-7] Variational Distributional Neuron

链接: https://arxiv.org/abs/2602.18250
作者: Yves Ruffenach
类目: Machine Learning (cs.LG)
*备注: 29 pages, 7 figures. Code available at GitHub (link in paper)

点击查看摘要

Abstract:We propose a proof of concept for a variational distributional neuron: a compute unit formulated as a VAE brick, explicitly carrying a prior, an amortized posterior and a local ELBO. The unit is no longer a deterministic scalar but a distribution: computing is no longer about propagating values, but about contracting a continuous space of possibilities under constraints. Each neuron parameterizes a posterior, propagates a reparameterized sample and is regularized by the KL term of a local ELBO - hence, the activation is distributional. This “contraction” becomes testable through local constraints and can be monitored via internal measures. The amount of contextual information carried by the unit, as well as the temporal persistence of this information, are locally tuned by distinct constraints. This proposal addresses a structural tension: in sequential generation, causality is predominantly organized in the symbolic space and, even when latents exist, they often remain auxiliary, while the effective dynamics are carried by a largely deterministic decoder. In parallel, probabilistic latent models capture factors of variation and uncertainty, but that uncertainty typically remains borne by global or parametric mechanisms, while units continue to propagate scalars - hence the pivot question: if uncertainty is intrinsic to computation, why does the compute unit not carry it explicitly? We therefore draw two axes: (i) the composition of probabilistic constraints, which must be made stable, interpretable and controllable; and (ii) granularity: if inference is a negotiation of distributions under constraints, should the primitive unit remain deterministic or become distributional? We analyze “collapse” modes and the conditions for a “living neuron”, then extend the contribution over time via autoregressive priors over the latent, per unit.

[LG-8] Neural-HSS: Hierarchical Semi-Separable Neural PDE Solver

链接: https://arxiv.org/abs/2602.18248
作者: Pietro Sittoni,Emanuele Zangrando,Angelo A. Casulli,Nicola Guglielmi,Francesco Tudisco
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Deep learning-based methods have shown remarkable effectiveness in solving PDEs, largely due to their ability to enable fast simulations once trained. However, despite the availability of high-performance computing infrastructure, many critical applications remain constrained by the substantial computational costs associated with generating large-scale, high-quality datasets and training models. In this work, inspired by studies on the structure of Green’s functions for elliptic PDEs, we introduce Neural-HSS, a parameter-efficient architecture built upon the Hierarchical Semi-Separable (HSS) matrix structure that is provably data-efficient for a broad class of PDEs. We theoretically analyze the proposed architecture, proving that it satisfies exactness properties even in very low-data regimes. We also investigate its connections with other architectural primitives, such as the Fourier neural operator layer and convolutional layers. We experimentally validate the data efficiency of Neural-HSS on the three-dimensional Poisson equation over a grid of two million points, demonstrating its superior ability to learn from data generated by elliptic PDEs in the low-data regime while outperforming baseline methods. Finally, we demonstrate its capability to learn from data arising from a broad class of PDEs in diverse domains, including electromagnetism, fluid dynamics, and biology.

[LG-9] Parameter-Efficient Domain Adaptation of Physics-Informed Self-Attention based GNNs for AC Power Flow Prediction

链接: https://arxiv.org/abs/2602.18227
作者: Redwanul Karim(1),Changhun Kim(1),Timon Conrad(2),Nora Gourmelon(1),Julian Oelhaf(1),David Riebesel(2),Tomás Arias-Vergara(1),Andreas Maier(1),Johann Jäger(2),Siming Bayer(1) ((1) Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany, (2) Institute of Electrical Energy Systems, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate AC-PF prediction under domain shift is critical when models trained on medium-voltage (MV) grids are deployed on high-voltage (HV) networks. Existing physics-informed graph neural solvers typically rely on full fine-tuning for cross-regime transfer, incurring high retraining cost and offering limited control over the stability-plasticity trade-off between target-domain adaptation and source-domain retention. We study parameter-efficient domain adaptation for physics-informed self-attention based GNN, encouraging Kirchhoff-consistent behavior via a physics-based loss while restricting adaptation to low-rank updates. Specifically, we apply LoRA to attention projections with selective unfreezing of the prediction head to regulate adaptation capacity. This design yields a controllable efficiency-accuracy trade-off for physics-constrained inverse estimation under voltage-regime shift. Across multiple grid topologies, the proposed LoRA+PHead adaptation recovers near-full fine-tuning accuracy with a target-domain RMSE gap of 2.6\times10^-4 while reducing the number of trainable parameters by 85.46%. The physics-based residual remains comparable to full fine-tuning; however, relative to Full FT, LoRA+PHead reduces MV source retention by 4.7 percentage points (17.9% vs. 22.6%) under domain shift, while still enabling parameter-efficient and physically consistent AC-PF estimation.

[LG-10] SimVLA: A Simple VLA Baseline for Robotic Manipulation

链接: https://arxiv.org/abs/2602.18224
作者: Yuankai Luo,Woping Chen,Tong Liang,Baiqiao Wang,Zhenguo Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation, leveraging large-scale pre-training to achieve strong performance. The field has rapidly evolved with additional spatial priors and diverse architectural innovations. However, these advancements are often accompanied by varying training recipes and implementation details, which can make it challenging to disentangle the precise source of empirical gains. In this work, we introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research. By strictly decoupling perception from control, using a standard vision-language backbone and a lightweight action head, and standardizing critical training dynamics, we demonstrate that a minimal design can achieve state-of-the-art performance. Despite having only 0.5B parameters, SimVLA outperforms multi-billion-parameter models on standard simulation benchmarks without robot pretraining. SimVLA also reaches on-par real-robot performance compared to pi0.5. Our results establish SimVLA as a robust, reproducible baseline that enables clear attribution of empirical gains to future architectural innovations. Website: this https URL

[LG-11] Generative Model via Quantile Assignment

链接: https://arxiv.org/abs/2602.18216
作者: Georgi Hrusanov,Oliver Y. Chén,Julien S. Bodelet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Generative models (DGMs) play two key roles in modern machine learning: (i) producing new information (e.g., image synthesis) and (ii) reducing dimensionality. However, traditional architectures often rely on auxiliary networks such as encoders in Variational Autoencoders (VAEs) or discriminators in Generative Adversarial Networks (GANs), which introduce training instability, computational overhead, and risks like mode collapse. We present NeuroSQL, a new generative paradigm that eliminates the need for auxiliary networks by learning low-dimensional latent representations implicitly. NeuroSQL leverages an asymptotic approximation that expresses the latent variables as the solution to an optimal transportation problem. Specifically, NeuroSQL learns the latent variables by solving a linear assignment problem and then passes the latent information to a standalone generator. We benchmark its performance against GANs, VAEs, and a budget-matched diffusion baseline on four datasets: handwritten digits (MNIST), faces (CelebA), animal faces (AFHQ), and brain images (OASIS). Compared to VAEs, GANs, and diffusion models: (1) in terms of image quality, NeuroSQL achieves overall lower mean pixel distance between synthetic and authentic images and stronger perceptual/structural fidelity; (2) computationally, NeuroSQL requires the least training time; and (3) practically, NeuroSQL provides an effective solution for generating synthetic data with limited training samples. By embracing quantile assignment rather than an encoder, NeuroSQL provides a fast, stable, and robust way to generate synthetic data with minimal information loss.

[LG-12] RAT: Train Dense Infer Sparse – Recurrence Augmented Attention for Dilated Inference

链接: https://arxiv.org/abs/2602.18196
作者: Xiuying Wei,Caglar Gulcehre
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of the attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. However, we find a persistent failure mode of them – sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once, then flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at 16 and drops by about 2-3 points at 64 on commonsense reasoning and LongBench tasks, respectively. Moreover, RAT+ outperforms attention when sparsifying to the top-k block attention. We further scale to 2.6B parameters and 200B tokens and observe the same trend.

[LG-13] SeedFlood: A Step Toward Scalable Decentralized Training of LLM s

链接: https://arxiv.org/abs/2602.18181
作者: Jihun Kim,Namhoon Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work presents a new approach to decentralized training-SeedFlood-designed to scale for large models across complex network topologies and achieve global consensus with minimal communication overhead. Traditional gossip-based methods suffer from message communication costs that grow with model size, while information decay over network hops renders global consensus inefficient. SeedFlood departs from these practices by exploiting the seed-reconstructible structure of zeroth-order updates and effectively making the messages near-zero in size, allowing them to be flooded to every client in the network. This mechanism makes communication overhead negligible and independent of model size, removing the primary scalability bottleneck in decentralized training. Consequently, SeedFlood enables training in regimes previously considered impractical, such as billion-parameter models distributed across hundreds of clients. Our experiments on decentralized LLM fine-tuning demonstrate thatSeedFlood consistently outperforms gossip-based baselines in both generalization performance and communication efficiency, and even achieves results comparable to first-order methods in large scale settings.

[LG-14] A Deep Surrogate Model for Robust and Generalizable Long-Term Blast Wave Prediction

链接: https://arxiv.org/abs/2602.18168
作者: Danning Jing,Xinhai Chen,Xifeng Pu,Jie Hu,Chao Huang,Xuguang Chen,Qinglin Wang,Jie Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately modeling the spatio-temporal dynamics of blast wave propagation remains a longstanding challenge due to its highly nonlinear behavior, sharp gradients, and burdensome computational cost. While machine learning-based surrogate models offer fast inference as a promising alternative, they suffer from degraded accuracy, particularly evaluated on complex urban layouts or out-of-distribution scenarios. Moreover, autoregressive prediction strategies in such models are prone to error accumulation over long forecasting horizons, limiting their robustness for extended-time simulations. To address these limitations, we propose RGD-Blast, a robust and generalizable deep surrogate model for high-fidelity, long-term blast wave forecasting. RGD-Blast incorporates a multi-scale module to capture both global flow patterns and local boundary interactions, effectively mitigating error accumulation during autoregressive prediction. We introduce a dynamic-static feature coupling mechanism that fuses time-varying pressure fields with static source and layout features, thereby enhancing out-of-distribution generalization. Experiments demonstrate that RGD-Blast achieves a two-order-of-magnitude speedup over traditional numerical methods while maintaining comparable accuracy. In generalization tests on unseen building layouts, the model achieves an average RMSE below 0.01 and an R2 exceeding 0.89 over 280 consecutive time steps. Additional evaluations under varying blast source locations and explosive charge weights further validate its generalization, substantially advancing the state of the art in long-term blast wave modeling.

[LG-15] Unifying Formal Explanations: A Complexity-Theoretic Perspective ICLR2026

链接: https://arxiv.org/abs/2602.18160
作者: Shahaf Bassan,Xuanxiang Huang,Guy Katz
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
*备注: To appear in ICLR 2026

点击查看摘要

Abstract:Previous work has explored the computational complexity of deriving two fundamental types of explanations for ML model predictions: (1) sufficient reasons, which are subsets of input features that, when fixed, determine a prediction, and (2) contrastive reasons, which are subsets of input features that, when modified, alter a prediction. Prior studies have examined these explanations in different contexts, such as non-probabilistic versus probabilistic frameworks and local versus global settings. In this study, we introduce a unified framework for analyzing these explanations, demonstrating that they can all be characterized through the minimization of a unified probabilistic value function. We then prove that the complexity of these computations is influenced by three key properties of the value function: (1) monotonicity, (2) submodularity, and (3) supermodularity - which are three fundamental properties in combinatorial optimization. Our findings uncover some counterintuitive results regarding the nature of these properties within the explanation settings examined. For instance, although the local value functions do not exhibit monotonicity or submodularity/supermodularity whatsoever, we demonstrate that the global value functions do possess these properties. This distinction enables us to prove a series of novel polynomial-time results for computing various explanations with provable guarantees in the global explainability setting, across a range of ML models that span the interpretability spectrum, such as neural networks, decision trees, and tree ensembles. In contrast, we show that even highly simplified versions of these explanations become NP-hard to compute in the corresponding local explainability setting.

[LG-16] Rethinking Beam Management: Generalization Limits Under Hardware Heterogeneity

链接: https://arxiv.org/abs/2602.18151
作者: Nikita Zeulin,Olga Galinina,Ibrahim Kilinc,Sergey Andreev,Robert W. Heath Jr
类目: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Hardware heterogeneity across diverse user devices poses new challenges for beam-based communication in 5G and beyond. This heterogeneity limits the applicability of machine learning (ML)-based algorithms. This article highlights the critical need to treat hardware heterogeneity as a first-class design concern in ML-aided beam management. We analyze key failure modes in the presence of heterogeneity and present case studies demonstrating their performance impact. Finally, we discuss potential strategies to improve generalization in beam management.

[LG-17] Stable Long-Horizon Spatiotemporal Prediction on Meshes Using Latent Multiscale Recurrent Graph Neural Networks

链接: https://arxiv.org/abs/2602.18146
作者: Lionel Salesses,Larbi Arbaoui,Tariq Benamara,Arnaud Francois,Caroline Sainvitu
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:Accurate long-horizon prediction of spatiotemporal fields on complex geometries is a fundamental challenge in scientific machine learning, with applications such as additive manufacturing where temperature histories govern defect formation and mechanical properties. High-fidelity simulations are accurate but computationally costly, and despite recent advances, machine learning methods remain challenged by long-horizon temperature and gradient prediction. We propose a deep learning framework for predicting full temperature histories directly on meshes, conditioned on geometry and process parameters, while maintaining stability over thousands of time steps and generalizing across heterogeneous geometries. The framework adopts a temporal multiscale architecture composed of two coupled models operating at complementary time scales. Both models rely on a latent recurrent graph neural network to capture spatiotemporal dynamics on meshes, while a variational graph autoencoder provides a compact latent representation that reduces memory usage and improves training stability. Experiments on simulated powder bed fusion data demonstrate accurate and temporally stable long-horizon predictions across diverse geometries, outperforming existing baseline. Although evaluated in two dimensions, the framework is general and extensible to physics-driven systems with multiscale dynamics and to three-dimensional geometries.

[LG-18] Advection-Diffusion on Graphs: A Bakry-Emery Laplacian for Spectral Graph Neural Networks

链接: https://arxiv.org/abs/2602.18141
作者: Pierre-Gabriel Berlureau,Ali Hariri,Victor Kawasaki-Borruat,Mia Zosso,Pierre Vandergheynst
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) often struggle to propagate information across long distances due to oversmoothing and oversquashing. Existing remedies such as graph transformers or rewiring typically incur high computational cost or require altering the graph structure. We introduce a Bakry-Emery graph Laplacian that integrates diffusion and advection through a learnable node-wise potential, inducing task-dependent propagation dynamics without modifying topology. This operator has a well-behaved spectral decomposition and acts as a drop-in replacement for standard Laplacians in spectral GNNs. Building on this insight, we develop mu-ChebNet, a spectral architecture that jointly learns the potential and Chebyshev filters, effectively bridging message-passing adaptivity and spectral efficiency. Our theoretical analysis shows how the potential modulates the spectrum, enabling control of key graph properties. Empirically, mu-ChebNet delivers consistent gains on synthetic long-range reasoning tasks, as well as real-world benchmarks, while offering an interpretable routing field that reveals how information flows through the graph. This establishes the Bakry-Emery Laplacian as a principled and efficient foundation for adaptive spectral graph learning.

[LG-19] Learning Long-Range Dependencies with Temporal Predictive Coding

链接: https://arxiv.org/abs/2602.18131
作者: Tom Potter,Oliver Rhodes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive Coding (PC) is a biologically-inspired learning framework characterised by local, parallelisable operations, properties that enable energy-efficient implementation on neuromorphic hardware. Despite this, extending PC effectively to recurrent neural networks (RNNs) has been challenging, particularly for tasks involving long-range temporal dependencies. Backpropagation Through Time (BPTT) remains the dominant method for training RNNs, but its non-local computation, lack of spatial parallelism, and requirement to store extensive activation histories results in significant energy consumption. This work introduces a novel method combining Temporal Predictive Coding (tPC) with approximate Real-Time Recurrent Learning (RTRL), enabling effective spatio-temporal credit assignment. Results indicate that the proposed method can closely match the performance of BPTT on both synthetic benchmarks and real-world tasks. On a challenging machine translation task, with a 15-million parameter model, the proposed method achieves a test perplexity of 7.62 (vs. 7.49 for BPTT), marking one of the first applications of tPC to tasks of this scale. These findings demonstrate the potential of this method to learn complex temporal dependencies whilst retaining the local, parallelisable, and flexible properties of the original PC framework, paving the way for more energy-efficient learning systems.

[LG-20] Non-Stationary Online Resource Allocation: Learning from a Single Sample

链接: https://arxiv.org/abs/2602.18114
作者: Yiding Feng,Jiashuo Jiang,Yige Wang
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study online resource allocation under non-stationary demand with a minimum offline data requirement. In this problem, a decision-maker must allocate multiple types of resources to sequentially arriving queries over a finite horizon. Each query belongs to a finite set of types with fixed resource consumption and a stochastic reward drawn from an unknown, type-specific distribution. Critically, the environment exhibits arbitrary non-stationarity – arrival distributions may shift unpredictably-while the algorithm requires only one historical sample per period to operate effectively. We distinguish two settings based on sample informativeness: (i) reward-observed samples containing both query type and reward realization, and (ii) the more challenging type-only samples revealing only query type information. We propose a novel type-dependent quantile-based meta-policy that decouples the problem into modular components: reward distribution estimation, optimization of target service probabilities via fluid relaxation, and real-time decisions through dynamic acceptance thresholds. For reward-observed samples, our static threshold policy achieves \tildeO(\sqrtT) regret. For type-only samples, we first establish that sublinear regret is impossible without additional structure; under a mild minimum-arrival-probability assumption, we design both a partially adaptive policy attaining the same \tildeO(T) bound and, more significantly, a fully adaptive resolving policy with careful rounding that achieves the first poly-logarithmic regret guarantee of O((\log T)^3) for non-stationary multi-resource allocation. Our framework advances prior work by operating with minimal offline data (one sample per period), handling arbitrary non-stationarity without variation-budget assumptions, and supporting multiple resource constraints. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC) Cite as: arXiv:2602.18114 [cs.LG] (or arXiv:2602.18114v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.18114 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-21] mpoNet: Slack-Quantized Transformer-Guided Reinforcement Scheduler for Adaptive Deadline-Centric Real-Time Dispatchs

链接: https://arxiv.org/abs/2602.18109
作者: Rong Fu,Yibo Meng,Guangzhen Yao,Jiaxuan Lu,Zeyu Zhang,Zhaolu Kang,Ziming Guo,Jia Yee Tan,Xiaojing Du,Simon James Fong
类目: Machine Learning (cs.LG); Operating Systems (cs.OS); Systems and Control (eess.SY)
*备注: 43 pages, 12 figures

点击查看摘要

Abstract:Real-time schedulers must reason about tight deadlines under strict compute budgets. We present TempoNet, a reinforcement learning scheduler that pairs a permutation-invariant Transformer with a deep Q-approximation. An Urgency Tokenizer discretizes temporal slack into learnable embeddings, stabilizing value learning and capturing deadline proximity. A latency-aware sparse attention stack with blockwise top-k selection and locality-sensitive chunking enables global reasoning over unordered task sets with near-linear scaling and sub-millisecond inference. A multicore mapping layer converts contextualized Q-scores into processor assignments through masked-greedy selection or differentiable matching. Extensive evaluations on industrial mixed-criticality traces and large multiprocessor settings show consistent gains in deadline fulfillment over analytic schedulers and neural baselines, together with improved optimization stability. Diagnostics include sensitivity analyses for slack quantization, attention-driven policy interpretation, hardware-in-the-loop and kernel micro-benchmarks, and robustness under stress with simple runtime mitigations; we also report sample-efficiency benefits from behavioral-cloning pretraining and compatibility with an actor-critic variant without altering the inference pipeline. These results establish a practical framework for Transformer-based decision making in high-throughput real-time scheduling.

[LG-22] Interacting safely with cyclists using Hamilton-Jacobi reachability and reinforcement learning

链接: https://arxiv.org/abs/2602.18097
作者: Aarati Andrea Noronha,Jean Oh
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages. This manuscript was completed in 2020 as part of the first author’s graduate thesis at Carnegie Mellon University

点击查看摘要

Abstract:In this paper, we present a framework for enabling autonomous vehicles to interact with cyclists in a manner that balances safety and optimality. The approach integrates Hamilton-Jacobi reachability analysis with deep Q-learning to jointly address safety guarantees and time-efficient navigation. A value function is computed as the solution to a time-dependent Hamilton-Jacobi-Bellman inequality, providing a quantitative measure of safety for each system state. This safety metric is incorporated as a structured reward signal within a reinforcement learning framework. The method further models the cyclist’s latent response to the vehicle, allowing disturbance inputs to reflect human comfort and behavioral adaptation. The proposed framework is evaluated through simulation and comparison with human driving behavior and an existing state-of-the-art method.

[LG-23] Balancing Symmetry and Efficiency in Graph Flow Matching

链接: https://arxiv.org/abs/2602.18084
作者: Benjamin Honoré,Alba Carballo-Castro,Yiming Qin,Pascal Frossard
类目: Machine Learning (cs.LG)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:Equivariance is central to graph generative models, as it ensures the model respects the permutation symmetry of graphs. However, strict equivariance can increase computational cost due to added architectural constraints, and can slow down convergence because the model must be consistent across a large space of possible node permutations. We study this trade-off for graph generative models. Specifically, we start from an equivariant discrete flow-matching model, and relax its equivariance during training via a controllable symmetry modulation scheme based on sinusoidal positional encodings and node permutations. Experiments first show that symmetry-breaking can accelerate early training by providing an easier learning signal, but at the expense of encouraging shortcut solutions that can cause overfitting, where the model repeatedly generates graphs that are duplicates of the training set. On the contrary, properly modulating the symmetry signal can delay overfitting while accelerating convergence, allowing the model to reach stronger performance with 19% of the baseline training epochs.

[LG-24] Deepmechanics KDD2026

链接: https://arxiv.org/abs/2602.18060
作者: Abhay Shinde,Aryan Amit Barsainyan,Jose Siguenza,Ankita Vaishnobi Bisoi,Rakshit Kr. Singh,Bharath Ramsundar
类目: Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, Submitted to KDD 2026

点击查看摘要

Abstract:Physics-informed deep learning models have emerged as powerful tools for learning dynamical systems. These models directly encode physical principles into network architectures. However, systematic benchmarking of these approaches across diverse physical phenomena remains limited, particularly in conservative and dissipative systems. In addition, benchmarking that has been done thus far does not integrate out full trajectories to check stability. In this work, we benchmark three prominent physics-informed architectures such as Hamiltonian Neural Networks (HNN), Lagrangian Neural Networks (LNN), and Symplectic Recurrent Neural Networks (SRNN) using the DeepChem framework, an open-source scientific machine learning library. We evaluate these models on six dynamical systems spanning classical conservative mechanics (mass-spring system, simple pendulum, double pendulum, and three-body problem, spring-pendulum) and non-conservative systems with contact (bouncing ball). We evaluate models by computing error on predicted trajectories and evaluate error both quantitatively and qualitatively. We find that all benchmarked models struggle to maintain stability for chaotic or nonconservative systems. Our results suggest that more research is needed for physics-informed deep learning models to learn robust models of classical mechanical systems.

[LG-25] Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework

链接: https://arxiv.org/abs/2602.18055
作者: Jingyang Qiao,Zhizhong Zhang,Xin Tan,Jingyu Gong,Yanyun Qu,Yuan Xie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dual-to-Dual MLLMs refer to Multimodal Large Language Models, which can enable unified multimodal comprehension and generation through text and image modalities. Although exhibiting strong instantaneous learning and generalization capabilities, Dual-to-Dual MLLMs still remain deficient in lifelong evolution, significantly affecting continual adaptation to dynamic real-world scenarios. One of the challenges is that learning new tasks inevitably destroys the learned knowledge. Beyond traditional catastrophic forgetting, Dual-to-Dual MLLMs face other challenges, including hallucination, instruction unfollowing, and failures in cross-modal knowledge transfer. However, no standardized continual learning framework for Dual-to-Dual MLLMs has been established yet, leaving these challenges unexplored. Thus, in this paper, we establish Continual-NExT, a continual learning framework for Dual-to-Dual MLLMs with deliberately-architected evaluation metrics. To improve the continual learning capability of Dual-to-Dual MLLMs, we propose an efficient MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) method to further facilitate knowledge transfer across modalities and mitigate forgetting. Extensive experiments demonstrate that MAGE outperforms other continual learning methods and achieves state-of-the-art performance.

[LG-26] Asynchronous Heavy-Tailed Optimization

链接: https://arxiv.org/abs/2602.18002
作者: Junfei Sun,Dixi Yao,Xuchen Gong,Tahseen Rabbani,Manzil Zaheer,Tian Li
类目: Machine Learning (cs.LG)
*备注: 8-page main body, 25-page appendix, 5 figures

点击查看摘要

Abstract:Heavy-tailed stochastic gradient noise, commonly observed in transformer models, can destabilize the optimization process. Recent works mainly focus on developing and understanding approaches to address heavy-tailed noise in the centralized or distributed, synchronous setting, leaving the interactions between such noise and asynchronous optimization underexplored. In this work, we investigate two communication schemes that handle stragglers with asynchronous updates in the presence of heavy-tailed gradient noise. We propose and theoretically analyze algorithmic modifications based on delay-aware learning rate scheduling and delay compensation to enhance the performance of asynchronous algorithms. Our convergence guarantees under heavy-tailed noise match the rate of the synchronous counterparts and improve delay tolerance compared with existing asynchronous approaches. Empirically, our approaches outperform prior synchronous and asynchronous methods in terms of accuracy/runtime trade-offs and are more robust to hyperparameters in both image and language tasks.

[LG-27] Whole-Brain Connectomic Graph Model Enables Whole-Body Locomotion Control in Fruit Fly

链接: https://arxiv.org/abs/2602.17997
作者: Zehao Jin,Yaoye Zhu,Chen Zhang,Yanan Sui
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Whole-brain biological neural networks naturally support the learning and control of whole-body movements. However, the use of brain connectomes as neural network controllers in embodied reinforcement learning remains unexplored. We investigate using the exact neural architecture of an adult fruit fly’s brain for the control of its body movement. We develop Fly-connectomic Graph Model (FlyGM), whose static structure is identical to the complete connectome of an adult Drosophila for whole-body locomotion control. To perform dynamical control, FlyGM represents the static connectome as a directed message-passing graph to impose a biologically grounded information flow from sensory inputs to motor outputs. Integrated with a biomechanical fruit fly model, our method achieves stable control across diverse locomotion tasks without task-specific architectural tuning. To verify the structural advantages of the connectome-based model, we compare it against a degree-preserving rewired graph, a random graph, and multilayer perceptrons, showing that FlyGM yields higher sample efficiency and superior performance. This work demonstrates that static brain connectomes can be transformed to instantiate effective neural policy for embodied learning of movement control.

[LG-28] Learning Without Training

链接: https://arxiv.org/abs/2602.17985
作者: Ryan O’Dowd
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: PhD Dissertation of Ryan O’Dowd, defended successfully at Claremont Graduate University on 1/28/2026

点击查看摘要

Abstract:Machine learning is at the heart of managing the real-world problems associated with massive data. With the success of neural networks on such large-scale problems, more research in machine learning is being conducted now than ever before. This dissertation focuses on three different projects rooted in mathematical theory for machine learning applications. The first project deals with supervised learning and manifold learning. In theory, one of the main problems in supervised learning is that of function approximation: that is, given some data set \mathcalD=(x_j,f(x_j))_j=1^M , can one build a model F\approx f ? We introduce a method which aims to remedy several of the theoretical shortcomings of the current paradigm for supervised learning. The second project deals with transfer learning, which is the study of how an approximation process or model learned on one domain can be leveraged to improve the approximation on another domain. We study such liftings of functions when the data is assumed to be known only on a part of the whole domain. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related. The third project is concerned with the classification task in machine learning, particularly in the active learning paradigm. Classification has often been treated as an approximation problem as well, but we propose an alternative approach leveraging techniques originally introduced for signal separation problems. We introduce theory to unify signal separation with classification and a new algorithm which yields competitive accuracy to other recent active learning algorithms while providing results much faster. Comments: PhD Dissertation of Ryan O’Dowd, defended successfully at Claremont Graduate University on 1/28/2026 Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2602.17985 [cs.LG] (or arXiv:2602.17985v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.17985 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ryan O’Dowd [view email] [v1] Fri, 20 Feb 2026 04:42:06 UTC (2,759 KB)

[LG-29] Generating adversarial inputs for a graph neural network model of AC power flow

链接: https://arxiv.org/abs/2602.17975
作者: Robert Parker
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This work formulates and solves optimization problems to generate input points that yield high errors between a neural network’s predicted AC power flow solution and solutions to the AC power flow equations. We demonstrate this capability on an instance of the CANOS-PF graph neural network model, as implemented by the PF \Delta benchmark library, operating on a 14-bus test grid. Generated adversarial points yield errors as large as 3.4 per-unit in reactive power and 0.08 per-unit in voltage magnitude. When minimizing the perturbation from a training point necessary to satisfy adversarial constraints, we find that the constraints can be met with as little as an 0.04 per-unit perturbation in voltage magnitude on a single bus. This work motivates the development of rigorous verification and robust training methods for neural network surrogate models of AC power flow.

[LG-30] Student Flow Modeling for School Decongestion via Stochastic Gravity Estimation and Constrained Spatial Allocation

链接: https://arxiv.org/abs/2602.17972
作者: Sebastian Felipe R. Bundoc,Paula Joy B. Martinez,Sebastian C. Ibañez,Erika Fille T. Legara
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:School congestion, where student enrollment exceeds school capacity, is a major challenge in low- and middle-income countries. It highly impacts learning outcomes and deepens inequities in education. While subsidy programs that transfer students from public to private schools offer a mechanism to alleviate congestion without capital-intensive construction, they often underperform due to fragmented data systems that hinder effective implementation. The Philippine Educational Service Contracting program, one of the world’s largest educational subsidy programs, exemplifies these challenges, falling short of its goal to decongest public schools. This prevents the science-based and data-driven analyses needed to understand what shapes student enrollment flows, particularly how families respond to economic incentives and spatial constraints. We introduce a computational framework for modeling student flow patterns and simulating policy scenarios. By synthesizing heterogeneous government data across nearly 3,000 institutions, we employ a stochastic gravity model estimated via negative binomial regression to derive behavioral elasticities for distance, net tuition cost, and socioeconomic determinants. These elasticities inform a doubly constrained spatial allocation mechanism that simulates student redistribution under varying subsidy amounts while respecting both origin candidate pools and destination slot capacities. We find that geographic proximity constrains school choice four times more strongly than tuition cost and that slot capacity, not subsidy amounts, is the binding constraint. Our work demonstrates that subsidy programs alone cannot resolve systemic overcrowding, and computational modeling can empower education policymakers to make equitable, data-driven decisions by revealing the structural constraints that shape effective resource allocation, even when resources are limited.

[LG-31] Improving Generalizability of Hip Fracture Risk Prediction via Domain Adaptation Across Multiple Cohorts

链接: https://arxiv.org/abs/2602.17962
作者: Shuo Sun,Meiling Zhou,Chen Zhao,Joyce H. Keyak,Nancy E. Lane,Jeffrey D. Deng,Kuan-Jui Su,Hui Shen,Hong-Wen Deng,Kui Zhang,Weihua Zhou
类目: Machine Learning (cs.LG)
*备注: 26 pages, 3 tables, 1 figure

点击查看摘要

Abstract:Clinical risk prediction models often fail to be generalized across cohorts because underlying data distributions differ by clinical site, region, demographics, and measurement protocols. This limitation is particularly pronounced in hip fracture risk prediction, where the performance of models trained on one cohort (the source cohort) can degrade substantially when deployed in other cohorts (target cohorts). We used a shared set of clinical and DXA-derived features across three large cohorts - the Study of Osteoporotic Fractures (SOF), the Osteoporotic Fractures in Men Study (MrOS), and the UK Biobank (UKB), to systematically evaluate the performance of three domain adaptation methods - Maximum Mean Discrepancy (MMD), Correlation Alignment (CORAL), and Domain - Adversarial Neural Networks (DANN) and their combinations. For a source cohort with males only and a source cohort with females only, domain-adaptation methods consistently showed improved performance than the no-adaptation baseline (source-only training), and the use of combinations of multiple domain adaptation methods delivered the largest and most stable gains. The method that combines MMD, CORAL, and DANN achieved the highest discrimination with the area under curve (AUC) of 0.88 for a source cohort with males only and 0.95 for a source cohort with females only), demonstrating that integrating multiple domain adaptation methods could produce feature representations that are less sensitive to dataset differences. Unlike existing methods that rely heavily on supervised tuning or assume known outcomes of samples in target cohorts, our outcome-free approaches enable the model selection under realistic deployment conditions and improve generalization of models in hip fracture risk prediction.

[LG-32] Bayesian Online Model Selection

链接: https://arxiv.org/abs/2602.17958
作者: Aida Afshar,Yuke Zhang,Aldo Pacchiano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online model selection in Bayesian bandits raises a fundamental exploration challenge: When an environment instance is sampled from a prior distribution, how can we design an adaptive strategy that explores multiple bandit learners and competes with the best one in hindsight? We address this problem by introducing a new Bayesian algorithm for online model selection in stochastic bandits. We prove an oracle-style guarantee of O\left( d^* M \sqrtT + \sqrt(MT) \right) on the Bayesian regret, where M is the number of base learners, d^* is the regret coefficient of the optimal base learner, and T is the time horizon. We also validate our method empirically across a range of stochastic bandit settings, demonstrating performance that is competitive with the best base learner. Additionally, we study the effect of sharing data among base learners and its role in mitigating prior mis-specification.

[LG-33] Hardware-Friendly Input Expansion for Accelerating Function Approximation

链接: https://arxiv.org/abs/2602.17952
作者: Hu Lou,Yin-Jun Gao,Dong-Xiao Zhang,Tai-Jiao Du,Jun-Jie Zhang,Jia-Rui Zhang
类目: Machine Learning (cs.LG)
*备注: 22 pages, 4 figures

点击查看摘要

Abstract:One-dimensional function approximation is a fundamental problem in scientific computing and engineering applications. While neural networks possess powerful universal approximation capabilities, their optimization process is often hindered by flat loss landscapes induced by parameter-space symmetries, leading to slow convergence and poor generalization, particularly for high-frequency components. Inspired by the principle of \emphsymmetry breaking in physics, this paper proposes a hardware-friendly approach for function approximation through \emphinput-space expansion. The core idea involves augmenting the original one-dimensional input (e.g., x ) with constant values (e.g., \pi ) to form a higher-dimensional vector (e.g., [\pi, \pi, x, \pi, \pi] ), effectively breaking parameter symmetries without increasing the network’s parameter count. We evaluate the method on ten representative one-dimensional functions, including smooth, discontinuous, high-frequency, and non-differentiable functions. Experimental results demonstrate that input-space expansion significantly accelerates training convergence (reducing LBFGS iterations by 12% on average) and enhances approximation accuracy (reducing final MSE by 66.3% for the optimal 5D expansion). Ablation studies further reveal the effects of different expansion dimensions and constant selections, with \pi consistently outperforming other constants. Our work proposes a low-cost, efficient, and hardware-friendly technique for algorithm design.

[LG-34] A Geometric Probe of the Accuracy-Robustness Trade-off: Sharp Boundaries in Symmetry-Breaking Dimensional Expansion

链接: https://arxiv.org/abs/2602.17948
作者: Yu Bai,Zhe Wang,Jiarui Zhang,Dong-Xiao Zhang,Yinjun Gao,Jun-Jie Zhang
类目: Machine Learning (cs.LG)
*备注: 22 pages, 3 figures

点击查看摘要

Abstract:The trade-off between clean accuracy and adversarial robustness is a pervasive phenomenon in deep learning, yet its geometric origin remains elusive. In this work, we utilize Symmetry-Breaking Dimensional Expansion (SBDE) as a controlled probe to investigate the mechanism underlying this trade-off. SBDE expands input images by inserting constant-valued pixels, which breaks translational symmetry and consistently improves clean accuracy (e.g., from 90.47% to 95.63% on CIFAR-10 with ResNet-18) by reducing parameter degeneracy. However, this accuracy gain comes at the cost of reduced robustness against iterative white-box attacks. By employing a test-time \emphmask projection that resets the inserted auxiliary pixels to their training values, we demonstrate that the vulnerability stems almost entirely from the inserted dimensions. The projection effectively neutralizes the attacks and restores robustness, revealing that the model achieves high accuracy by creating \emphsharp boundaries (steep loss gradients) specifically along the auxiliary axes. Our findings provide a concrete geometric explanation for the accuracy-robustness paradox: the optimization landscape deepens the basin of attraction to improve accuracy but inevitably erects steep walls along the auxiliary degrees of freedom, creating a fragile sensitivity to off-manifold perturbations.

[LG-35] Understanding the Generalization of Bilevel Programming in Hyperparameter Optimization: A Tale of Bias-Variance Decomposition

链接: https://arxiv.org/abs/2602.17947
作者: Yubo Zhou,Jun Shu,Junmin Liu,Deyu Meng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient-based hyperparameter optimization (HPO) have emerged recently, leveraging bilevel programming techniques to optimize hyperparameter by estimating hypergradient w.r.t. validation loss. Nevertheless, previous theoretical works mainly focus on reducing the gap between the estimation and ground-truth (i.e., the bias), while ignoring the error due to data distribution (i.e., the variance), which degrades performance. To address this issue, we conduct a bias-variance decomposition for hypergradient estimation error and provide a supplemental detailed analysis of the variance term ignored by previous works. We also present a comprehensive analysis of the error bounds for hypergradient estimation. This facilitates an easy explanation of some phenomena commonly observed in practice, like overfitting to the validation set. Inspired by the derived theories, we propose an ensemble hypergradient strategy to reduce the variance in HPO algorithms effectively. Experimental results on tasks including regularization hyperparameter learning, data hyper-cleaning, and few-shot learning demonstrate that our variance reduction strategy improves hypergradient estimation. To explain the improved performance, we establish a connection between excess error and hypergradient estimation, offering some understanding of empirical observations.

[LG-36] ghter Regret Lower Bound for Gaussian Process Bandits with Squared Exponential Kernel in Hypersphere

链接: https://arxiv.org/abs/2602.17940
作者: Shogo Iwazaki
类目: Machine Learning (cs.LG)
*备注: 27 pages, 2 figures

点击查看摘要

Abstract:We study an algorithm-independent, worst-case lower bound for the Gaussian process (GP) bandit problem in the frequentist setting, where the reward function is fixed and has a bounded norm in the known reproducing kernel Hilbert space (RKHS). Specifically, we focus on the squared exponential (SE) kernel, one of the most widely used kernel functions in GP bandits. One of the remaining open questions for this problem is the gap in the \emphdimension-dependent logarithmic factors between upper and lower bounds. This paper partially resolves this open question under a hyperspherical input domain. We show that any algorithm suffers \Omega(\sqrtT (\ln T)^d (\ln \ln T)^-d) cumulative regret, where T and d represent the total number of steps and the dimension of the hyperspherical domain, respectively. Regarding the simple regret, we show that any algorithm requires \Omega(\epsilon^-2(\ln \frac1\epsilon)^d (\ln \ln \frac1\epsilon)^-d) time steps to find an \epsilon -optimal point. We also provide the improved O((\ln T)^d+1(\ln \ln T)^-d) upper bound on the maximum information gain for the SE kernel. Our results guarantee the optimality of the existing best algorithm up to \emphdimension-independent logarithmic factors under a hyperspherical input domain.

[LG-37] Latent Diffeomorphic Co-Design of End-Effectors for Deformable and Frag ile Object Manipulation

链接: https://arxiv.org/abs/2602.17921
作者: Kei Ikemura,Yifei Dong,Florian T. Pokorny
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Manipulating deformable and fragile objects remains a fundamental challenge in robotics due to complex contact dynamics and strict requirements on object integrity. Existing approaches typically optimize either end-effector design or control strategies in isolation, limiting achievable performance. In this work, we present the first co-design framework that jointly optimizes end-effector morphology and manipulation control for deformable and fragile object manipulation. We introduce (1) a latent diffeomorphic shape parameterization enabling expressive yet tractable end-effector geometry optimization, (2) a stress-aware bi-level co-design pipeline coupling morphology and control optimization, and (3) a privileged-to-pointcloud policy distillation scheme for zero-shot real-world deployment. We evaluate our approach on challenging food manipulation tasks, including grasping and pushing jelly and scooping fillets. Simulation and real-world experiments demonstrate the effectiveness of the proposed method.

[LG-38] Distribution-Free Sequential Prediction with Abstentions COLT2026

链接: https://arxiv.org/abs/2602.17918
作者: Jialin Yu,Moïse Blanchard
类目: Machine Learning (cs.LG)
*备注: 38 pages, 2 figures. Submitted to COLT 2026. Extended version

点击查看摘要

Abstract:We study a sequential prediction problem in which an adversary is allowed to inject arbitrarily many adversarial instances in a stream of i.i.d.\ instances, but at each round, the learner may also \emphabstain from making a prediction without incurring any penalty if the instance was indeed corrupted. This semi-adversarial setting naturally sits between the classical stochastic case with i.i.d.\ instances for which function classes with finite VC dimension are learnable; and the adversarial case with arbitrary instances, known to be significantly more restrictive. For this problem, Goel et al. (2023) showed that, if the learner knows the distribution \mu of clean samples in advance, learning can be achieved for all VC classes without restrictions on adversary corruptions. This is, however, a strong assumption in both theory and practice: a natural question is whether similar learning guarantees can be achieved without prior distributional knowledge, as is standard in classical learning frameworks (e.g., PAC learning or asymptotic consistency) and other non-i.i.d.\ models (e.g., smoothed online learning). We therefore focus on the distribution-free setting where \mu is \emphunknown and propose an algorithm \textscAbstainBoost based on a boosting procedure of weak learners, which guarantees sublinear error for general VC classes in \emphdistribution-free abstention learning for oblivious adversaries. These algorithms also enjoy similar guarantees for adaptive adversaries, for structured function classes including linear classifiers. These results are complemented with corresponding lower bounds, which reveal an interesting polynomial trade-off between misclassification error and number of erroneous abstentions.

[LG-39] Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors ICLR2026

链接: https://arxiv.org/abs/2602.17898
作者: Jingquan Yan,Yuwei Miao,Peiran Yu,Junzhou Huang
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Attention-based regression models are often trained by jointly optimizing Mean Squared Error (MSE) loss and Pearson correlation coefficient (PCC) loss, emphasizing the magnitude of errors and the order or shape of targets, respectively. A common but poorly understood phenomenon during training is the PCC plateau: PCC stops improving early in training, even as MSE continues to decrease. We provide the first rigorous theoretical analysis of this behavior, revealing fundamental limitations in both optimization dynamics and model capacity. First, in regard to the flattened PCC curve, we uncover a critical conflict where lowering MSE (magnitude matching) can paradoxically suppress the PCC gradient (shape matching). This issue is exacerbated by the softmax attention mechanism, particularly when the data to be aggregated is highly homogeneous. Second, we identify a limitation in the model capacity: we derived a PCC improvement limit for any convex aggregator (including the softmax attention), showing that the convex hull of the inputs strictly bounds the achievable PCC gain. We demonstrate that data homogeneity intensifies both limitations. Motivated by these insights, we propose the Extrapolative Correlation Attention (ECA), which incorporates novel, theoretically-motivated mechanisms to improve the PCC optimization and extrapolate beyond the convex hull. Across diverse benchmarks, including challenging homogeneous data setting, ECA consistently breaks the PCC plateau, achieving significant improvements in correlation without compromising MSE performance.

[LG-40] COMBA: Cross Batch Aggregation for Learning Large Graphs with Context Gating State Space Models

链接: https://arxiv.org/abs/2602.17893
作者: Jiajun Shen,Yufei Jin,Yi He,xingquan Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State space models (SSMs) have recently emerged for modeling long-range dependency in sequence data, with much simplified computational costs than modern alternatives, such as transformers. Advancing SMMs to graph structured data, especially for large graphs, is a significant challenge because SSMs are sequence models and the shear graph volumes make it very expensive to convert graphs as sequences for effective learning. In this paper, we propose COMBA to tackle large graph learning using state space models, with two key innovations: graph context gating and cross batch aggregation. Graph context refers to different hops of neighborhood for each node, and graph context gating allows COMBA to use such context to learn best control of neighbor aggregation. For each graph context, COMBA samples nodes as batches, and train a graph neural network (GNN), with information being aggregated cross batches, allowing COMBA to scale to large graphs. Our theoretical study asserts that cross-batch aggregation guarantees lower error than training GNN without aggregation. Experiments on benchmark networks demonstrate significant performance gains compared to baseline approaches. Code and benchmark datasets will be released for public access.

[LG-41] JAX-Privacy: A library for differentially private machine learning

链接: https://arxiv.org/abs/2602.17861
作者: Ryan McKenna,Galen Andrew,Borja Balle,Vadym Doroshenko,Arun Ganesh,Weiwei Kong,Alex Kurakin,Brendan McMahan,Mikhail Pravilov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:JAX-Privacy is a library designed to simplify the deployment of robust and performant mechanisms for differentially private machine learning. Guided by design principles of usability, flexibility, and efficiency, JAX-Privacy serves both researchers requiring deep customization and practitioners who want a more out-of-the-box experience. The library provides verified, modular primitives for critical components for all aspects of the mechanism design including batch selection, gradient clipping, noise addition, accounting, and auditing, and brings together a large body of recent research on differentially private ML.

[LG-42] Dual Length Codes for Lossless Compression of BFloat16

链接: https://arxiv.org/abs/2602.17849
作者: Aditya Agrawal,Albert Magyar,Hiteshwar Eswaraiah,Patrick Sheridan,Pradeep Janedula,Ravi Krishnan Venkatesan,Krishna Nair,Ravi Iyer
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Training and serving Large Language Models (LLMs) relies heavily on parallelization and collective operations, which are frequently bottlenecked by network bandwidth. Lossless compression using e.g., Huffman codes can alleviate the issue, however, Huffman codes suffer from slow, bit-sequential decoding and high hardware complexity due to deep tree traversals. Universal codes e.g., Exponential-Golomb codes are faster to decode but do not exploit the symbol frequency distributions. To address these limitations, this paper introduces Dual Length Codes, a hybrid approach designed to balance compression efficiency with decoding speed. Analyzing BFloat16 tensors from the Gemma model, we observed that the top 8 most frequent symbols account for approximately 50% of the cumulative probability. These 8 symbols are assigned a short 4 bit code. The remaining 248 symbols are assigned a longer 9 bit code. The coding scheme uses a single prefix bit to distinguish between the two code lengths. The scheme uses a small Look Up Table with only 8 entries for encoding and decoding. The scheme achieves a compressibility of 18.6% in comparison to 21.3% achieved by Huffman codes, but it significantly speeds up the decoding and simplifies the hardware complexity.

[LG-43] wo Calm Ends and the Wild Middle: A Geometric Picture of Memorization in Diffusion Models

链接: https://arxiv.org/abs/2602.17846
作者: Nick Dodson,Xinyu Gao,Qingsong Wang,Yusu Wang,Zhengchao Wan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models generate high-quality samples but can also memorize training data, raising serious privacy concerns. Understanding the mechanisms governing when memorization versus generalization occurs remains an active area of research. In particular, it is unclear where along the noise schedule memorization is induced, how data geometry influences it, and how phenomena at different noise scales interact. We introduce a geometric framework that partitions the noise schedule into three regimes based on the coverage properties of training data by Gaussian shells and the concentration behavior of the posterior, which we argue are two fundamental objects governing memorization and generalization in diffusion models. This perspective reveals that memorization risk is highly non-uniform across noise levels. We further identify a danger zone at medium noise levels where memorization is most pronounced. In contrast, both the small and large noise regimes resist memorization, but through fundamentally different mechanisms: small noise avoids memorization due to limited training coverage, while large noise exhibits low posterior concentration and admits a provably near linear Gaussian denoising behavior. For the medium noise regime, we identify geometric conditions through which we propose a geometry-informed targeted intervention that mitigates memorization.

[LG-44] Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-tuning

链接: https://arxiv.org/abs/2602.17835
作者: Sirui Chen,Yunzhe Qi,Mengting Ai,Yifan Sun,Ruizhong Qiu,Jiaru Zou,Jingrui He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) relies critically on selecting training data that most benefits a model’s downstream performance. Gradient-based data selection methods such as TracIn and Influence Functions leverage influence to identify useful samples, but their computational cost scales poorly, making them impractical for multi-billion-parameter large language models (LLMs). A common alternative is to use off-the-shelf smaller models as proxies, but they remain suboptimal since their learning dynamics are unclear, their sizes cannot be flexibly adjusted, and they cannot be further aligned with the target model in terms of gradient-based influence estimation. To address these challenges, we introduce Iprox, a two-stage framework that derives influence-preserving proxies directly from the target model. It first applies a low-rank compression stage to preserve influence information of the target model, and then an aligning stage to align both model gradients and logits, thereby constructing proxies that flexibly control computational cost while retaining the target model’s influence. Experimental results across diverse LLM families and evaluation tasks show that Iprox consistently outperforms off-the-shelf proxies and baseline methods. On Qwen3-4B, a 1.5B proxy constructed with Iprox achieves stronger performance than the larger 1.7B off-the-shelf proxy. Notably, on Llama3.2, Iprox achieves better performance than baselines while reducing computational cost by more than half relative to the full 3B model. These results show that Iprox provides effective influence-preserving proxies, making gradient-based data selection more scalable for LLMs.

[LG-45] MePoly: Max Entropy Polynomial Policy Optimization

链接: https://arxiv.org/abs/2602.17832
作者: Hang Liu,Sangli Teng,Maani Ghaffari
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Stochastic Optimal Control provides a unified mathematical framework for solving complex decision-making problems, encompassing paradigms such as maximum entropy reinforcement learning(RL) and imitation learning(IL). However, conventional parametric policies often struggle to represent the multi-modality of the solutions. Though diffusion-based policies are aimed at recovering the multi-modality, they lack an explicit probability density, which complicates policy-gradient optimization. To bridge this gap, we propose MePoly, a novel policy parameterization based on polynomial energy-based models. MePoly provides an explicit, tractable probability density, enabling exact entropy maximization. Theoretically, we ground our method in the classical moment problem, leveraging the universal approximation capabilities for arbitrary distributions. Empirically, we demonstrate that MePoly effectively captures complex non-convex manifolds and outperforms baselines in performance across diverse benchmarks.

[LG-46] Causality by Abstraction: Symbolic Rule Learning in Multivariate Timeseries with Large Language Models

链接: https://arxiv.org/abs/2602.17829
作者: Preetom Biswas,Giulia Pedrielli,K. Selçuk Candan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inferring causal relations in timeseries data with delayed effects is a fundamental challenge, especially when the underlying system exhibits complex dynamics that cannot be captured by simple functional mappings. Traditional approaches often fail to produce generalized and interpretable explanations, as multiple distinct input trajectories may yield nearly indistinguishable outputs. In this work, we present ruleXplain, a framework that leverages Large Language Models (LLMs) to extract formal explanations for input-output relations in simulation-driven dynamical systems. Our method introduces a constrained symbolic rule language with temporal operators and delay semantics, enabling LLMs to generate verifiable causal rules through structured prompting. ruleXplain relies on the availability of a principled model (e.g., a simulator) that maps multivariate input time series to output time series. Within ruleXplain, the simulator is used to generate diverse counterfactual input trajectories that yield similar target output, serving as candidate explanations. Such counterfactual inputs are clustered and provided as context to the LLM, which is tasked with the generation of symbolic rules encoding the joint temporal trends responsible for the patterns observable in the output times series. A closed-loop refinement process ensures rule consistency and semantic validity. We validate the framework using the PySIRTEM epidemic simulator, mapping testing rate inputs to daily infection counts; and the EnergyPlus building energy simulator, observing temperature and solar irradiance inputs to electricity needs. For validation, we perform three classes of experiments: (1) the efficacy of the ruleset through input reconstruction; (2) ablation studies evaluating the causal encoding of the ruleset; and (3) generalization tests of the extracted rules across unseen output trends with varying phase dynamics.

[LG-47] Avoid What You Know: Divergent Trajectory Balance for GFlowNets

链接: https://arxiv.org/abs/2602.17827
作者: Pedro Dall’Antonia,Tiago da Silva,Daniel Csillag,Salem Lahlou,Diego Mesquita
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, under review

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) are a flexible family of amortized samplers trained to generate discrete and compositional objects with probability proportional to a reward function. However, learning efficiency is constrained by the model’s ability to rapidly explore diverse high-probability regions during training. To mitigate this issue, recent works have focused on incentivizing the exploration of unvisited and valuable states via curiosity-driven search and self-supervised random network distillation, which tend to waste samples on already well-approximated regions of the state space. In this context, we propose Adaptive Complementary Exploration (ACE), a principled algorithm for the effective exploration of novel and high-probability regions when learning GFlowNets. To achieve this, ACE introduces an exploration GFlowNet explicitly trained to search for high-reward states in regions underexplored by the canonical GFlowNet, which learns to sample from the target distribution. Through extensive experiments, we show that ACE significantly improves upon prior work in terms of approximation accuracy to the target distribution and discovery rate of diverse high-reward states.

[LG-48] Calibrated Adaptation: Bayesian Stiefel Manifold Priors for Reliable Parameter-Efficient Fine-Tuning

链接: https://arxiv.org/abs/2602.17809
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning methods such as LoRA enable practical adaptation of large language models but provide no principled uncertainty estimates, leading to poorly calibrated predictions and unreliable behavior under domain shift. We introduce Stiefel-Bayes Adapters (SBA), a Bayesian PEFT framework that places a Matrix Langevin prior over orthonormal adapter factors on the Stiefel manifold \St and performs approximate posterior inference via tangent space Laplace approximation with geodesic retraction. Unlike Gaussian priors in flat space projected onto orthogonality constraints, our prior on the manifold naturally encodes the inductive bias that adapter subspaces should be well conditioned and orthogonal, while the posterior provides calibrated predictive uncertainty without recalibration. We prove formally that the tangent space approximation strictly avoids the structural variance inflation inherent in projecting from ambient space, establishing a rigorous theoretical advantage for intrinsic manifold inference. Across GLUE and SuperGLUE benchmarks on RoBERTa-large, LLaMA-2-7B, LLaMA-2-13B, Mistral-7B, and Qwen2.5-7B, domain shift evaluations, selective prediction protocols, and an abstractive summarization task, SBA achieves task performance comparable to LoRA and DoRA while reducing Expected Calibration Error by 18 to 34% over deterministic baselines, improving selective prediction AUROC by 12 to 25% under domain shift, and outperforming deep ensembles of five LoRA models on OOD detection at a fraction of the parameter cost. Our results demonstrate that where you place uncertainty, on the right geometric structure, matters more than simply adding any Bayesian treatment to adapters.

[LG-49] Grassmannian Mixture-of-Experts: Concentration-Controlled Routing on Subspace Manifolds

链接: https://arxiv.org/abs/2602.17798
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts models rely on learned routers to assign tokens to experts, yet standard softmax gating provides no principled mechanism to control the tradeoff between sparsity and utilization. We propose Grassmannian MoE (GrMoE), a routing framework that operates on the Grassmannian manifold of subspaces, where gating weights arise from the concentration parameters of Matrix Bingham distributions. This construction yields a single, interpretable knob – the concentration matrix \Lambda – that continuously controls routing entropy, replacing discrete top- k selection with a smooth, geometrically principled sparsity mechanism. We further develop an amortized variational inference procedure for posterior routing distributions, enabling uncertainty-aware expert assignment that naturally resists expert collapse. We formally prove tight bounds relating the Bingham concentration spectrum to routing entropy, expected top- k mass, and an exponential bound on expert collapse, establishing the first formal theory of concentration-controlled sparsity. On synthetic routing tasks, a 350M-parameter MoE language model with 8 experts, a 1.3B-parameter model with 16 experts, and a 2.7B-parameter model with 32 experts, GrMoE achieves 0% routing collapse across all seeds, comparable or better perplexity with 15–30% improved load balance, and a smooth monotonic relationship between concentration and effective sparsity that enables post-hoc sparsity tuning without retraining. Token-level analysis reveals that experts learn heterogeneous concentration values that correlate with linguistic specialization, providing interpretable routing behavior.

[LG-50] Market Games for Generative Models: Equilibria Welfare and Strategic Entry ICLR2026

链接: https://arxiv.org/abs/2602.17787
作者: Xiukun Wei,Min Shi,Xueru Zhang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Generative model ecosystems increasingly operate as competitive multi-platform markets, where platforms strategically select models from a shared pool and users with heterogeneous preferences choose among them. Understanding how platforms interact, when market equilibria exist, how outcomes are shaped by model-providers, platforms, and user behavior, and how social welfare is affected is critical for fostering a beneficial market environment. In this paper, we formalize a three-layer model-platform-user market game and identify conditions for the existence of pure Nash equilibrium. Our analysis shows that market structure, whether platforms converge on similar models or differentiate by selecting distinct ones, depends not only on models’ global average performance but also on their localized attraction to user groups. We further examine welfare outcomes and show that expanding the model pool does not necessarily increase user welfare or market diversity. Finally, we design novel best-response training schemes that allow model providers to strategically introduce new models into competitive markets.

[LG-51] Multi-material Multi-physics Topology Optimization with Physics-informed Gaussian Process Priors

链接: https://arxiv.org/abs/2602.17783
作者: Xiangyu Sun,Shirin Hosseinmardi,Amin Yousefpour,Ramin Bostanabad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) has been increasingly used for topology optimization (TO). However, most existing ML-based approaches focus on simplified benchmark problems due to their high computational cost, spectral bias, and difficulty in handling complex physics. These limitations become more pronounced in multi-material, multi-physics problems whose objective or constraint functions are not self-adjoint. To address these challenges, we propose a framework based on physics-informed Gaussian processes (PIGPs). In our approach, the primary, adjoint, and design variables are represented by independent GP priors whose mean functions are parametrized via neural networks whose architectures are particularly beneficial for surrogate modeling of PDE solutions. We estimate all parameters of our model simultaneously by minimizing a loss that is based on the objective function, multi-physics potential energy functionals, and design-constraints. We demonstrate the capability of the proposed framework on benchmark TO problems such as compliance minimization, heat conduction optimization, and compliant mechanism design under single- and multi-material settings. Additionally, we leverage thermo-mechanical TO with single- and multi-material options as a representative multi-physics problem. We also introduce differentiation and integration schemes that dramatically accelerate the training process. Our results demonstrate that the proposed PIGP framework can effectively solve coupled multi-physics and design problems simultaneously – generating super-resolution topologies with sharp interfaces and physically interpretable material distributions. We validate these results using open-source codes and the commercial software package COMSOL.

[LG-52] Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLM s

链接: https://arxiv.org/abs/2602.17778
作者: Zachary Coalson,Bo Fang,Sanghyun Hong
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Pre-print

点击查看摘要

Abstract:Multi-turn interaction length is a dominant factor in the operational costs of conversational LLMs. In this work, we present a new failure mode in conversational LLMs: turn amplification, in which a model consistently prolongs multi-turn interactions without completing the underlying task. We show that an adversary can systematically exploit clarification-seeking behavior - commonly encouraged in multi-turn conversation settings - to scalably prolong interactions. Moving beyond prompt-level behaviors, we take a mechanistic perspective and identify a query-independent, universal activation subspace associated with clarification-seeking responses. Unlike prior cost-amplification attacks that rely on per-turn prompt optimization, our attack arises from conversational dynamics and persists across prompts and tasks. We show that this mechanism provides a scalable pathway to induce turn amplification: both supply-chain attacks via fine-tuning and runtime attacks through low-level parameter corruptions consistently shift models toward abstract, clarification-seeking behavior across prompts. Across multiple instruction-tuned LLMs and benchmarks, our attack substantially increases turn count while remaining compliant. We also show that existing defenses offer limited protection against this emerging class of failures.

[LG-53] Solving and learning advective multiscale Darcian dynamics with the Neural Basis Method

链接: https://arxiv.org/abs/2602.17776
作者: Yuhe Wang,Min Wang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-governed models are increasingly paired with machine learning for accelerated predictions, yet most “physics–informed” formulations treat the governing equations as a penalty loss whose scale and meaning are set by heuristic balancing. This blurs operator structure, thereby confounding solution approximation error with governing-equation enforcement error and making the solving and learning progress hard to interpret and control. Here we introduce the Neural Basis Method, a projection-based formulation that couples a predefined, physics-conforming neural basis space with an operator-induced residual metric to obtain a well-conditioned deterministic minimization. Stability and reliability then hinge on this metric: the residual is not merely an optimization objective but a computable certificate tied to approximation and enforcement, remaining stable under basis enrichment and yielding reduced coordinates that are learnable across parametric instances. We use advective multiscale Darcian dynamics as a concrete demonstration of this broader point. Our method produce accurate and robust solutions in single solves and enable fast and effective parametric inference with operator learning.

[LG-54] Provable Adversarial Robustness in In-Context Learning

链接: https://arxiv.org/abs/2602.17743
作者: Di Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages

点击查看摘要

Abstract:Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption overlooks adversarial distribution shifts that threaten real-world reliability. To address this gap, we introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts. Focusing on linear self-attention Transformers, we derive a non-asymptotic bound linking adversarial perturbation strength ( \rho ), model capacity ( m ), and the number of in-context examples ( N ). The analysis reveals that model robustness scales with the square root of its capacity ( \rho_\textmax \propto \sqrtm ), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude ( N_\rho - N_0 \propto \rho^2 ). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL’s limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.

[LG-55] Parallel Complex Diffusion for Scalable Time Series Generation

链接: https://arxiv.org/abs/2602.17706
作者: Rongyao Cai,Yuxi Wan,Kexin Zhang,Ming Jin,Zhiqiang Ge,Qingsong Wen,Yong Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modeling long-range dependencies in time series generation poses a fundamental trade-off between representational capacity and computational efficiency. Traditional temporal diffusion models suffer from local entanglement and the \mathcalO(L^2) cost of attention mechanisms. We address these limitations by introducing PaCoDi (Parallel Complex Diffusion), a spectral-native architecture that decouples generative modeling in the frequency domain. PaCoDi fundamentally alters the problem topology: the Fourier Transform acts as a diagonalizing operator, converting locally coupled temporal signals into globally decorrelated spectral components. Theoretically, we prove the Quadrature Forward Diffusion and Conditional Reverse Factorization theorem, demonstrating that the complex diffusion process can be split into independent real and imaginary branches. We bridge the gap between this decoupled theory and data reality using a \textbfMean Field Theory (MFT) approximation reinforced by an interactive correction mechanism. Furthermore, we generalize this discrete DDPM to continuous-time Frequency SDEs, rigorously deriving the Spectral Wiener Process describe the differential spectral Brownian motion limit. Crucially, PaCoDi exploits the Hermitian Symmetry of real-valued signals to compress the sequence length by half, achieving a 50% reduction in attention FLOPs without information loss. We further derive a rigorous Heteroscedastic Loss to handle the non-isotropic noise distribution on the compressed manifold. Extensive experiments show that PaCoDi outperforms existing baselines in both generation quality and inference speed, offering a theoretically grounded and computationally efficient solution for time series modeling.

[LG-56] Certified Learning under Distribution Shift: Sound Verification and Identifiable Structure

链接: https://arxiv.org/abs/2602.17699
作者: Chandrasekhar Gokavarapu,Sudhakar Gadde,Y. Rajasekhar,S. R. Bhargava(Mathematics, Government College (Autonomous), Rajahmundry, Andhra Pradesh, India)
类目: Machine Learning (cs.LG); Rings and Algebras (math.RA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Proposition. Let f be a predictor trained on a distribution P and evaluated on a shifted distribution Q . Under verifiable regularity and complexity constraints, the excess risk under shift admits an explicit upper bound determined by a computable shift metric and model parameters. We develop a unified framework in which (i) risk under distribution shift is certified by explicit inequalities, (ii) verification of learned models is sound for nontrivial sizes, and (iii) interpretability is enforced through identifiability conditions rather than post hoc explanations. All claims are stated with explicit assumptions. Failure modes are isolated. Non-certifiable regimes are characterized.

[LG-57] Pimp My LLM : Leverag ing Variability Modeling to Tune Inference Hyperparameters

链接: https://arxiv.org/abs/2602.17697
作者: Nada Zine,Clément Quinton,Romain Rouvoy
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are being increasingly used across a wide range of tasks. However, their substantial computational demands raise concerns about the energy efficiency and sustainability of both training and inference. Inference, in particular, dominates total compute usage, making its optimization crucial. Recent research has explored optimization techniques and analyzed how configuration choices influence energy consumption. Yet, the vast configuration space of inference servers makes exhaustive empirical evaluation infeasible due to combinatorial explosion. In this paper, we introduce a new perspective on this problem by treating LLMs as configurable systems and applying variability management techniques to systematically analyze inference-time configuration choices. We evaluate our approach on the Hugging Face Transformers library by representing generation hyperparameters and their constraints using a feature-based variability model, sampling representative configurations, measuring their energy consumption, latency, accuracy, and learning predictive models from the collected data. Our results show that variability modeling effectively manages the complexity of LLM inference configurations. It enables systematic analysis of hyperparameters effects and interactions, reveals trade-offs, and supports accurate prediction of inference behavior from a limited number of measurements. Overall, this work opens a new research direction that bridges software engineering and machine learning by leveraging variability modeling for the efficient and sustainable configuration of LLMs.

[LG-58] AnCoder: Anchored Code Generation via Discrete Diffusion Models

链接: https://arxiv.org/abs/2602.17688
作者: Anton Xue,Litu Rout,Constantine Caramanis,Sanjay Shakkottai
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Diffusion language models offer a compelling alternative to autoregressive code generation, enabling global planning and iterative refinement of complex program logic. However, existing approaches fail to respect the rigid structure of programming languages and, as a result, often produce broken programs that fail to execute. To address this, we introduce AnchorTree, a framework that explicitly anchors the diffusion process using structured, hierarchical priors native to code. Specifically, AnchorTree uses the abstract syntax tree to prioritize resolving syntactically and semantically salient tokens, such as keywords (e.g., if, while) and identifiers (e.g., variable names), thereby establishing a structural scaffold that guides the remaining generation. We validate this framework via AnCoder, a family of models showing that structurally anchored diffusion offers a parameter-efficient path to high-quality code generation.

[LG-59] Optimal Multi-Debris Mission Planning in LEO: A Deep Reinforcement Learning Approach with Co-Elliptic Transfers and Refueling

链接: https://arxiv.org/abs/2602.17685
作者: Agni Bandyopadhyay,Gunther Waxenegger-Wilfing
类目: Machine Learning (cs.LG); Robotics (cs.RO); Space Physics (physics.space-ph)
*备注: Presented at Conference: IFAC Workshop on Control Aspects of Multi-Satellite Systems (CAMSAT) 2025 At: Wuerzburg

点击查看摘要

Abstract:This paper addresses the challenge of multi target active debris removal (ADR) in Low Earth Orbit (LEO) by introducing a unified coelliptic maneuver framework that combines Hohmann transfers, safety ellipse proximity operations, and explicit refueling logic. We benchmark three distinct planning algorithms Greedy heuristic, Monte Carlo Tree Search (MCTS), and deep reinforcement learning (RL) using Masked Proximal Policy Optimization (PPO) within a realistic orbital simulation environment featuring randomized debris fields, keep out zones, and delta V constraints. Experimental results over 100 test scenarios demonstrate that Masked PPO achieves superior mission efficiency and computational performance, visiting up to twice as many debris as Greedy and significantly outperforming MCTS in runtime. These findings underscore the promise of modern RL methods for scalable, safe, and resource efficient space mission planning, paving the way for future advancements in ADR autonomy.

[LG-60] Duality Models: An Embarrassingly Simple One-step Generation Paradigm

链接: https://arxiv.org/abs/2602.17682
作者: Peng Sun,Xinyi Shang,Tao Lin,Zhiqiang Shen
类目: Machine Learning (cs.LG)
*备注: this https URL

点击查看摘要

Abstract:Consistency-based generative models like Shortcut and MeanFlow achieve impressive results via a target-aware design for solving the Probability Flow ODE (PF-ODE). Typically, such methods introduce a target time r alongside the current time t to modulate outputs between a local multi-step derivative ( r = t ) and a global few-step integral ( r = 0 ). However, the conventional “one input, one output” paradigm enforces a partition of the training budget, often allocating a significant portion (e.g., 75% in MeanFlow) solely to the multi-step objective for stability. This separation forces a trade-off: allocating sufficient samples to the multi-step objective leaves the few-step generation undertrained, which harms convergence and limits scalability. To this end, we propose Duality Models (DuMo) via a “one input, dual output” paradigm. Using a shared backbone with dual heads, DuMo simultaneously predicts velocity v_t and flow-map u_t from a single input x_t . This applies geometric constraints from the multi-step objective to every sample, bounding the few-step estimation without separating training objectives, thereby significantly improving stability and efficiency. On ImageNet 256 \times 256, a 679M Diffusion Transformer with SD-VAE achieves a state-of-the-art (SOTA) FID of 1.79 in just 2 steps. Code is available at: this https URL

[LG-61] BioBridge: Bridging Proteins and Language for Enhanced Biological Reasoning with LLM s

链接: https://arxiv.org/abs/2602.17680
作者: Yujia Wang,Jihong Guan,Wengen Li,Shuigeng Zhou,Xuhong Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing Protein Language Models (PLMs) often suffer from limited adaptability to multiple tasks and exhibit poor generalization across diverse biological contexts. In contrast, general-purpose Large Language Models (LLMs) lack the capability to interpret protein sequences and fall short in domain-specific knowledge, limiting their capacity for effective biosemantic reasoning. To combine the advantages of both, we propose BioBridge, a domain-adaptive continual pretraining framework for protein understanding. This framework employs Domain-Incremental Continual Pre-training (DICP) to infuse protein domain knowledge and general reasoning corpus into a LLM simultaneously, effectively mitigating catastrophic forgetting. Cross-modal alignment is achieved via a PLM-Projector-LLM pipeline, which maps protein sequence embeddings into the semantic space of the language model. Ultimately, an end-to-end optimization is adopted to uniformly support various tasks, including protein property prediction and knowledge question-answering. Our proposed BioBridge demonstrates performance comparable to that of mainstream PLMs on multiple protein benchmarks, such as EC and BindingDB. It also achieves results on par with LLMs on general understanding tasks like MMLU and RACE. This showcases its innovative advantage of combining domain-specific adaptability with general-purpose language competency.

[LG-62] Joint Parameter and State-Space Bayesian Optimization: Using Process Expertise to Accelerate Manufacturing Optimization

链接: https://arxiv.org/abs/2602.17679
作者: Saksham Kiroriwal,Julius Pfrommer,Jürgen Beyerer
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: This paper is under review and has been submitted for CIRP CMS 2026

点击查看摘要

Abstract:Bayesian optimization (BO) is a powerful method for optimizing black-box manufacturing processes, but its performance is often limited when dealing with high-dimensional multi-stage systems, where we can observe intermediate outputs. Standard BO models the process as a black box and ignores the intermediate observations and the underlying process structure. Partially Observable Gaussian Process Networks (POGPN) model the process as a Directed Acyclic Graph (DAG). However, using intermediate observations is challenging when the observations are high-dimensional state-space time series. Process-expert knowledge can be used to extract low-dimensional latent features from the high-dimensional state-space data. We propose POGPN-JPSS, a framework that combines POGPN with Joint Parameter and State-Space (JPSS) modeling to use intermediate extracted information. We demonstrate the effectiveness of POGPN-JPSS on a challenging, high-dimensional simulation of a multi-stage bioethanol production process. Our results show that POGPN-JPSS significantly outperforms state-of-the-art methods by achieving the desired performance threshold twice as fast and with greater reliability. The fast optimization directly translates to substantial savings in time and resources. This highlights the importance of combining expert knowledge with structured probabilistic models for rapid process maturation.

[LG-63] Benchmarking Graph Neural Networks in Solving Hard Constraint Satisfaction Problems

链接: https://arxiv.org/abs/2602.18419
作者: Geri Skenderi,Lorenzo Buffoni,Francesco D’Amico,David Machado,Raffaele Marino,Matteo Negri,Federico Ricci-Tersenghi,Carlo Lucibello,Maria Chiara Angelini
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are increasingly applied to hard optimization problems, often claiming superiority over classical heuristics. However, such claims risk being unsolid due to a lack of standard benchmarks on truly hard instances. From a statistical physics perspective, we propose new hard benchmarks based on random problems. We provide these benchmarks, along with performance results from both classical heuristics and GNNs. Our fair comparison shows that classical algorithms still outperform GNNs. We discuss the challenges for neural networks in this domain. Future claims of superiority can be made more robust using our benchmarks, available at this https URL.

[LG-64] heory and interpretability of Quantum Extreme Learning Machines: a Pauli-transfer matrix approach

链接: https://arxiv.org/abs/2602.18377
作者: Markus Gross,Hans-Martin Rieser
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 34 pages, 12 figures

点击查看摘要

Abstract:Quantum reservoir computers (QRCs) have emerged as a promising approach to quantum machine learning, since they utilize the natural dynamics of quantum systems for data processing and are simple to train. Here, we consider n-qubit quantum extreme learning machines (QELMs) with continuous-time reservoir dynamics. QELMs are memoryless QRCs capable of various ML tasks, including image classification and time series forecasting. We apply the Pauli transfer matrix (PTM) formalism to theoretically analyze the influence of encoding, reservoir dynamics, and measurement operations, including temporal multiplexing, on the QELM performance. This formalism makes explicit that the encoding determines the complete set of (nonlinear) features available to the QELM, while the quantum channels linearly transform these features before they are probed by the chosen measurement operators. Optimizing a QELM can therefore be cast as a decoding problem in which one shapes the channel-induced transformations such that task-relevant features become available to the regressor. The PTM formalism allows one to identify the classical representation of a QELM and thereby guide its design towards a given training objective. As a specific application, we focus on learning nonlinear dynamical systems and show that a QELM trained on such trajectories learns a surrogate-approximation to the underlying flow map.

[LG-65] Clapeyron Neural Networks for Single-Species Vapor-Liquid Equilibria

链接: https://arxiv.org/abs/2602.18313
作者: Jan Pavšek,Alexander Mitsos,Elvis J. Sim,Jan G. Rittig
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) approaches have shown promising results for predicting molecular properties relevant for chemical process design. However, they are often limited by scarce experimental property data and lack thermodynamic consistency. As such, thermodynamics-informed ML, i.e., incorporating thermodynamic relations into the loss function as regularization term for training, has been proposed. We herein transfer the concept of thermodynamics-informed graph neural networks (GNNs) from the Gibbs-Duhem to the Clapeyron equation, predicting several pure component properties in a multi-task manner, namely: vapor pressure, liquid molar volume, vapor molar volume and enthalpy of vaporization. We find improved prediction accuracy of the Clapeyron-GNN compared to the single-task learning setting, and improved approximation of the Clapeyron equation compared to the purely data-driven multi-task learning setting. In fact, we observe the largest improvement in prediction accuracy for the properties with the lowest availability of data, making our model promising for practical application in data scarce scenarios of chemical engineering practice.

[LG-66] Machine-learning force-field models for dynamical simulations of metallic magnets

链接: https://arxiv.org/abs/2602.18213
作者: Gia-Wei Chern,Yunhao Fan,Sheng Zhang,Puhan Zhang
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:We review recent advances in machine learning (ML) force-field methods for Landau-Lifshitz-Gilbert (LLG) simulations of itinerant electron magnets, focusing on scalability and transferability. Built on the principle of locality, a deep neural network model is developed to efficiently and accurately predict the electron-mediated forces governing spin dynamics. Symmetry-aware descriptors constructed through a group-theoretical approach ensure rigorous incorporation of both lattice and spin-rotation symmetries. The framework is demonstrated using the prototypical s-d exchange model widely employed in spintronics. ML-enabled large-scale simulations reveal novel nonequilibrium phenomena, including anomalous coarsening of tetrahedral spin order on the triangular lattice and the freezing of phase separation dynamics in lightly hole-doped, strong-coupling square-lattice systems. These results establish ML force-field frameworks as scalable, accurate, and versatile tools for modeling nonequilibrium spin dynamics in itinerant magnets.

[LG-67] Box Thirding: Anytime Best Arm Identification under Insufficient Sampling

链接: https://arxiv.org/abs/2602.18186
作者: Seohwa Hwang,Junyong Park
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages, 5 figures

点击查看摘要

Abstract:We introduce Box Thirding (B3), a flexible and efficient algorithm for Best Arm Identification (BAI) under fixed-budget constraints. It is designed for both anytime BAI and scenarios with large N, where the number of arms is too large for exhaustive evaluation within a limited budget T. The algorithm employs an iterative ternary comparison: in each iteration, three arms are compared–the best-performing arm is explored further, the median is deferred for future comparisons, and the weakest is discarded. Even without prior knowledge of T, B3 achieves an epsilon-best arm misidentification probability comparable to Successive Halving (SH), which requires T as a predefined parameter, applied to a randomly selected subset of c0 arms that fit within the budget. Empirical results show that B3 outperforms existing methods under limited-budget constraints in terms of simple regret, as demonstrated on the New Yorker Cartoon Caption Contest dataset.

[LG-68] BONNI: Gradient-Informed Bayesian and Interior Point Optimization for Efficient Inverse Design in Nanophotonics

链接: https://arxiv.org/abs/2602.18148
作者: Yannik Mahlau,Yannick Augenstein,Tyler W. Hughes,Marius Lindauer,Bodo Rosenhahn
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inverse design, particularly geometric shape optimization, provides a systematic approach for developing high-performance nanophotonic devices. While numerous optimization algorithms exist, previous global approaches exhibit slow convergence and conversely local search strategies frequently become trapped in local optima. To address the limitations inherent to both local and global approaches, we introduce BONNI: Bayesian optimization through neural network ensemble surrogates with interior point optimization. It augments global optimization with an efficient incorporation of gradient information to determine optimal sampling points. This capability allows BONNI to circumvent the local optima found in many nanophotonic applications, while capitalizing on the efficiency of gradient-based optimization. We demonstrate BONNI’s capabilities in the design of a distributed Bragg reflector as well as a dual-layer grating coupler through an exhaustive comparison against other optimization algorithms commonly used in literature. Using BONNI, we were able to design a 10-layer distributed Bragg reflector with only 4.5% mean spectral error, compared to the previously reported results of 7.8% error with 16 layers. Further designs of a broadband waveguide taper and photonic crystal waveguide transition validate the capabilities of BONNI.

[LG-69] On the Generalization and Robustness in Conditional Value-at-Risk

链接: https://arxiv.org/abs/2602.18053
作者: Dinesh Karthik Mulumudi,Piyushi Manupriya,Gholamali Aminian,Anant Raj
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Conditional Value-at-Risk (CVaR) is a widely used risk-sensitive objective for learning under rare but high-impact losses, yet its statistical behavior under heavy-tailed data remains poorly understood. Unlike expectation-based risk, CVaR depends on an endogenous, data-dependent quantile, which couples tail averaging with threshold estimation and fundamentally alters both generalization and robustness properties. In this work, we develop a learning-theoretic analysis of CVaR-based empirical risk minimization under heavy-tailed and contaminated data. We establish sharp, high-probability generalization and excess risk bounds under minimal moment assumptions, covering fixed hypotheses, finite and infinite classes, and extending to \beta -mixing dependent data; we further show that these rates are minimax optimal. To capture the intrinsic quantile sensitivity of CVaR, we derive a uniform Bahadur-Kiefer type expansion that isolates a threshold-driven error term absent in mean-risk ERM and essential in heavy-tailed regimes. We complement these results with robustness guarantees by proposing a truncated median-of-means CVaR estimator that achieves optimal rates under adversarial contamination. Finally, we show that CVaR decisions themselves can be intrinsically unstable under heavy tails, establishing a fundamental limitation on decision robustness even when the population optimum is well separated. Together, our results provide a principled characterization of when CVaR learning generalizes and is robust, and when instability is unavoidable due to tail scarcity.

[LG-70] Interactions that reshape the interfaces of the interacting parties

链接: https://arxiv.org/abs/2602.17917
作者: David I. Spivak
类目: Category Theory (math.CT); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Polynomial functors model systems with interfaces: each polynomial specifies the outputs a system can produce and, for each output, the inputs it accepts. The bicategory \mathbbO\mathbfrg of dynamic organizations \citespivak2021learners gives a notion of state-driven interaction patterns that evolves over time, but each system’s interface remains fixed throughout the interaction. Yet in many systems, the outputs sent and inputs received can reshape the interface itself: a cell differentiating in response to chemical signals gains or loses receptors; a sensor damaged by its input loses a channel; a neural network may grow its output resolution during training. Here we introduce polynomial trees, elements of the terminal (u\triangleleft u) -coalgebra where u is the polynomial associated to a universe of sets, to model such systems: a polynomial tree is a coinductive tree whose nodes carry polynomials, and in which each round of interaction – an output chosen and an input received – determines a child tree, hence the next interface. We construct a monoidal closed category \mathbfPolyTr of polynomial trees, with coinductively-defined morphisms, tensor product, and internal hom. We then build a bicategory \mathbbO\mathbfrgTr generalizing \mathbbO\mathbfrg , whose hom-categories parametrize morphisms by state sets with coinductive action-and-update data. We provide a locally fully faithful functor \mathbbO\mathbfrg\to\mathbbO\mathbfrgTr via constant trees, those for which the interfaces do not change through time. We illustrate the generalization by suggesting a notion of progressive generative adversarial networks, where gradient feedback determines when the image-generation interface grows to a higher resolution. Comments: 20 pages Subjects: Category Theory (math.CT); Machine Learning (cs.LG) MSC classes: 18D15, 18M30, 18M05, 92B99, 93A16 Cite as: arXiv:2602.17917 [math.CT] (or arXiv:2602.17917v1 [math.CT] for this version) https://doi.org/10.48550/arXiv.2602.17917 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-71] Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

链接: https://arxiv.org/abs/2602.17894
作者: Michael O. Harding,Vikas Singh,Kirthevasan Kandasamy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities (for example, health markers, demographics, or political affiliations) and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to “match” the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size: the total sample size divided by D_\chi^2(q\mid\mid\overlinep) + 1 , where q is the target distribution, \overlinep is the aggregated source distribution, and D_\chi^2 is the \chi^2 -divergence. We pair this sampling plan with a classical post-stratification estimator and upper bound its risk. We provide matching lower bounds, establishing that our approach achieves the budgeted minimax optimal risk. Our techniques also extend to prediction problems when minimizing the excess risk, providing a principled approach to multi-source learning with costly and heterogeneous data sources. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2602.17894 [stat.ML] (or arXiv:2602.17894v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.17894 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-72] Interactive Learning of Single-Index Models via Stochastic Gradient Descent

链接: https://arxiv.org/abs/2602.17876
作者: Nived Rajaraman,Yanjun Han
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 26 pages, 2 figures

点击查看摘要

Abstract:Stochastic gradient descent (SGD) is a cornerstone algorithm for high-dimensional optimization, renowned for its empirical successes. Recent theoretical advances have provided a deep understanding of how SGD enables feature learning in high-dimensional nonlinear models, most notably the \textitsingle-index model with i.i.d. data. In this work, we study the sequential learning problem for single-index models, also known as generalized linear bandits or ridge bandits, where SGD is a simple and natural solution, yet its learning dynamics remain largely unexplored. We show that, similar to the optimal interactive learner, SGD undergoes a distinct burn-in'' phase before entering the learning’’ phase in this setting. Moreover, with an appropriately chosen learning rate schedule, a single SGD procedure simultaneously achieves near-optimal (or best-known) sample complexity and regret guarantees across both phases, for a broad class of link functions. Our results demonstrate that SGD remains highly competitive for learning single-index models under adaptive data.

[LG-73] Drift Estimation for Stochastic Differential Equations with Denoising Diffusion Models

链接: https://arxiv.org/abs/2602.17830
作者: Marcos Tapia Costa,Nikolas Kantas,George Deligiannidis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the estimation of time-homogeneous drift functions in multivariate stochastic differential equations with known diffusion coefficient, from multiple trajectories observed at high frequency over a fixed time horizon. We formulate drift estimation as a denoising problem conditional on previous observations, and propose an estimator of the drift function which is a by-product of training a conditional diffusion model capable of simulating new trajectories dynamically. Across different drift classes, the proposed estimator was found to match classical methods in low dimensions and remained consistently competitive in higher dimensions, with gains that cannot be attributed to architectural design choices alone.

[LG-74] opological Exploration of High-Dimensional Empirical Risk Landscapes: general approach and applications to phase retrieval

链接: https://arxiv.org/abs/2602.17779
作者: Antoine Maillard,Tony Bonnaire,Giulio Biroli
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 43 pages, 14 figures

点击查看摘要

Abstract:We consider the landscape of empirical risk minimization for high-dimensional Gaussian single-index models (generalized linear models). The objective is to recover an unknown signal \boldsymbol\theta^\star \in \mathbbR^d (where d \gg 1 ) from a loss function \hatR(\boldsymbol\theta) that depends on pairs of labels (\mathbfx_i \cdot \boldsymbol\theta, \mathbfx_i \cdot \boldsymbol\theta^\star)_i=1^n , with \mathbfx_i \sim \mathcalN(0, I_d) , in the proportional asymptotic regime n \asymp d . Using the Kac-Rice formula, we analyze different complexities of the landscape – defined as the expected number of critical points – corresponding to various types of critical points, including local minima. We first show that some variational formulas previously established in the literature for these complexities can be drastically simplified, reducing to explicit variational problems over a finite number of scalar parameters that we can efficiently solve numerically. Our framework also provides detailed predictions for properties of the critical points, including the spectral properties of the Hessian and the joint distribution of labels. We apply our analysis to the real phase retrieval problem for which we derive complete topological phase diagrams of the loss landscape, characterizing notably BBP-type transitions where the Hessian at local minima (as predicted by the Kac-Rice formula) becomes unstable in the direction of the signal. We test the predictive power of our analysis to characterize gradient flow dynamics, finding excellent agreement with finite-size simulations of local optimization algorithms, and capturing fine-grained details such as the empirical distribution of labels. Overall, our results open new avenues for the asymptotic study of loss landscapes and topological trivialization phenomena in high-dimensional statistical models.

[LG-75] Learning Flow Distributions via Projection-Constrained Diffusion on Manifolds

链接: https://arxiv.org/abs/2602.17773
作者: Noah Trupin,Rahul Ghosh,Aadi Jangid
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a generative modeling framework for synthesizing physically feasible two-dimensional incompressible flows under arbitrary obstacle geometries and boundary conditions. Whereas existing diffusion-based flow generators either ignore physical constraints, impose soft penalties that do not guarantee feasibility, or specialize to fixed geometries, our approach integrates three complementary components: (1) a boundary-conditioned diffusion model operating on velocity fields; (2) a physics-informed training objective incorporating a divergence penalty; and (3) a projection-constrained reverse diffusion process that enforces exact incompressibility through a geometry-aware Helmholtz-Hodge operator. We derive the method as a discrete approximation to constrained Langevin sampling on the manifold of divergence-free vector fields, providing a connection between modern diffusion models and geometric constraint enforcement in incompressible flow spaces. Experiments on analytic Navier-Stokes data and obstacle-bounded flow configurations demonstrate significantly improved divergence, spectral accuracy, vorticity statistics, and boundary consistency relative to unconstrained, projection-only, and penalty-only baselines. Our formulation unifies soft and hard physical structure within diffusion models and provides a foundation for generative modeling of incompressible fields in robotics, graphics, and scientific computing.

[LG-76] Sparse Bayesian Modeling of EEG Channel Interactions Improves P300 Brain-Computer Interface Performance

链接: https://arxiv.org/abs/2602.17772
作者: Guoxuan Ma,Yuan Zhong,Moyan Li,Yuxiao Nie,Jian Kang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG)-based P300 brain-computer interfaces (BCIs) enable communication without physical movement by detecting stimulus-evoked neural responses. Accurate and efficient decoding remains challenging due to high dimensionality, temporal dependence, and complex interactions across EEG channels. Most existing approaches treat channels independently or rely on black-box machine learning models, limiting interpretability and personalization. We propose a sparse Bayesian time-varying regression framework that explicitly models pairwise EEG channel interactions while performing automatic temporal feature selection. The model employs a relaxed-thresholded Gaussian process prior to induce structured sparsity in both channel-specific and interaction effects, enabling interpretable identification of task-relevant channels and channel pairs. Applied to a publicly available P300 speller dataset of 55 participants, the proposed method achieves a median character-level accuracy of 100% using all stimulus sequences and attains the highest overall decoding performance among competing statistical and deep learning approaches. Incorporating channel interactions yields subgroup-specific gains of up to 7% in character-level accuracy, particularly among participants who abstained from alcohol (up to 18% improvement). Importantly, the proposed method improves median BCI-Utility by approximately 10% at its optimal operating point, achieving peak throughput after only seven stimulus sequences. These results demonstrate that explicitly modeling structured EEG channel interactions within a principled Bayesian framework enhances predictive accuracy, improves user-centric throughput, and supports personalization in P300 BCI systems.

[LG-77] AgriVariant: Variant Effect Prediction using DeepChem-Variant for Precision Breeding in Rice

链接: https://arxiv.org/abs/2602.17747
作者: Ankita Vaishnobi Bisoi,Bharath Ramsundar
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Predicting functional consequences of genetic variants in crop genes remains a critical bottleneck for precision breeding programs. We present AgriVariant, an end-to-end pipeline for variant-effect prediction in rice (Oryza sativa) that addresses the lack of crop-specific variant-interpretation tools and can be extended to any crop species with available reference genomes and gene annotations. Our approach integrates deep learning-based variant calling (DeepChem-Variant) with custom plant genomics annotation using RAP-DB gene models and database-independent deleteriousness scoring that combines the Grantham distance and the BLOSUM62 substitution matrix. We validate the pipeline through targeted mutations in stress-response genes (OsDREB2a, OsDREB1F, SKC1), demonstrating correct classification of stop-gained, missense, and synonymous variants with appropriate HIGH / MODERATE / LOW impact assignments. An exhaustive mutagenesis study of OsMT-3a analyzed all 1,509 possible single-nucleotide variants in 10 days, identifying 353 high-impact, 447 medium-impact, and 709 low-impact variants - an analysis that would have required 2-4 years using traditional wet-lab approaches. This computational framework enables breeders to prioritize variants for experimental validation across diverse crop species, reducing screening costs and accelerating development of climate-resilient crop varieties.

[LG-78] Clever Materials: When Models Identify Good Materials for the Wrong Reason s

链接: https://arxiv.org/abs/2602.17730
作者: Kevin Maik Jablonka
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning can accelerate materials discovery. Models perform impressively on many benchmarks. However, strong benchmark performance does not imply that a model learned chemistry. I test a concrete alternative hypothesis: that property prediction can be driven by bibliographic confounding. Across five tasks spanning MOFs (thermal and solvent stability), perovskite solar cells (efficiency), batteries (capacity), and TADF emitters (emission wavelength), models trained on standard chemical descriptors predict author, journal, and publication year well above chance. When these predicted metadata (“bibliographic fingerprints”) are used as the sole input to a second model, performance is sometimes competitive with conventional descriptor-based predictors. These results show that many datasets do not rule out non-chemical explanations of success. Progress requires routine falsification tests (e.g., group/time splits and metadata ablations), datasets designed to resist spurious correlations, and explicit separation of two goals: predictive utility versus evidence of chemical understanding.

[LG-79] Spectral Homogenization of the Radiative Transfer Equation via Low-Rank Tensor Train Decomposition

链接: https://arxiv.org/abs/2602.17708
作者: Y. Sungtaek Ju
类目: Chemical Physics (physics.chem-ph); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Computational Physics (physics.comp-ph); Plasma Physics (physics.plasm-ph)
*备注: 30 pages; submitted for publication

点击查看摘要

Abstract:Radiative transfer in absorbing-scattering media requires solving a transport equation across a spectral domain with 10^5 - 10^6 molecular absorption lines. Line-by-line (LBL) computation is prohibitively expensive, while existing approximations sacrifice spectral fidelity. We show that the Young-measure homogenization framework produces solution tensors I that admit low-rank tensor-train (TT) decompositions whose bond dimensions remain bounded as the spectral resolution Ns increases. Using molecular line parameters from the HITRAN database for H2O and CO2, we demonstrate that: (i) the TT rank saturates at r = 8 (at tolerance e = 10^-6) from Ns = 16 to 4096, independent of single-scattering albedo, Henyey-Greenstein asymmetry, temperature, and pressure; (ii) quantized tensor-train (QTT) representations achieve sub-linear storage scaling; (iii) in a controlled comparison using identical opacity data and transport solver, the homogenized approach achieves over an order of magnitude lower L2 error than the correlated-k distribution at equal cost; and (iv) for atomic plasma opacity (aluminum at 60 eV, TOPS database), the TT rank saturates at r = 15 with fundamentally different spectral structure (bound-bound and bound-free transitions spanning 12 decades of dynamic range), confirming that rank boundedness is a property of the transport equation rather than any particular opacity source. These results establish that the spectral complexity of radiative transfer has a finite effective rank exploitable by tensor decomposition, complementing the spatial-angular compression achieved by existing TT and dynamical low-rank approaches.

[LG-80] Deep Neural Network Architectures for Electrocardiogram Classification: A Comprehensive Evaluation

链接: https://arxiv.org/abs/2602.17701
作者: Yun Song,Wenjia Zheng,Tiedan Chen,Ziyu Wang,Jiazhao Shi,Yisong Chen
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rising prevalence of cardiovascular diseases, electrocardiograms (ECG) remain essential for the non-invasive detection of cardiac abnormalities. This study presents a comprehensive evaluation of deep neural network architectures for automated arrhythmia classification, integrating temporal modeling, attention mechanisms, and ensemble strategies. To address data scarcity in minority classes, the MIT-BIH Arrhythmia dataset was augmented using a Generative Adversarial Network (GAN). We developed and compared four distinct architectures, including Convolutional Neural Networks (CNN), CNN combined with Long Short-Term Memory (CNN-LSTM), CNN-LSTM with Attention, and 1D Residual Networks (ResNet-1D), to capture both local morphological features and long-term temporal dependencies. Performance was rigorously evaluated using accuracy, F1-score, and Area Under the Curve (AUC) with 95% confidence intervals to ensure statistical robustness, while Gradient-weighted Class Activation Mapping (Grad-CAM) was employed to validate model interpretability. Experimental results indicate that the CNN-LSTM model achieved the optimal stand-alone balance between sensitivity and specificity, yielding an F1-score of 0.951. Conversely, the CNN-LSTM-Attention and ResNet-1D models exhibited higher sensitivity to class imbalance. To mitigate this, a dynamic ensemble fusion strategy was introduced; specifically, the Top2-Weighted ensemble achieved the highest overall performance with an F1-score of 0.958. These findings demonstrate that leveraging complementary deep architectures significantly enhances classification reliability, providing a robust and interpretable foundation for intelligent arrhythmia detection systems.

附件下载

点击下载今日全部论文列表