本篇博文主要内容为 2026-02-12 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-02-12)

今日共更新617篇论文,其中:

  • 自然语言处理86篇(Computation and Language (cs.CL))
  • 人工智能174篇(Artificial Intelligence (cs.AI))
  • 计算机视觉97篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习222篇(Machine Learning (cs.LG))
  • 多智能体系统11篇(Multiagent Systems (cs.MA))
  • 信息检索22篇(Information Retrieval (cs.IR))
  • 人机交互27篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Learning to Compose for Cross-domain Agent ic Workflow Generation

【速读】:该论文旨在解决当前生成式 AI(Generative AI)系统在跨域任务中依赖迭代工作流优化(workflow refinement)所带来的高计算成本与不稳定性问题。现有方法在面对领域偏移(domain shift)时,需通过多次迭代从庞大的工作流空间中搜索可行方案,导致延迟高且结果不稳定。其解决方案的关键在于将“分解-重组-决策”机制内化到开源大语言模型(LLM)中:首先学习一组跨域可复用的工作流能力(workflow capabilities),其次基于输入任务将其映射为这些基础能力的稀疏组合以实现单次生成(1-pass generation),最后通过反事实贡献分析(counterfactual attribution)量化各能力对成功与否的边际效应,从而精准驱动高效、稳定的跨域工作流生成。

链接: https://arxiv.org/abs/2602.11114
作者: Jialiang Wang,Shengxiang Xu,Hanmo Liu,Jiachuan Wang,Yuyu Luo,Shimin Di,Min-Ling Zhang,Lei Chen
机构: Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Southeast University (东南大学); University of Tsukuba (筑波大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Automatically generating agentic workflows – executable operator graphs or codes that orchestrate reasoning, verification, and repair – has become a practical way to solve complex tasks beyond what single-pass LLM generation can reliably handle. Yet what constitutes a good workflow depends heavily on the task distribution and the available operators. Under domain shift, current systems typically rely on iterative workflow refinement to discover a feasible workflow from a large workflow space, incurring high iteration costs and yielding unstable, domain-specific behavior. In response, we internalize a decompose-recompose-decide mechanism into an open-source LLM for cross-domain workflow generation. To decompose, we learn a compact set of reusable workflow capabilities across diverse domains. To recompose, we map each input task to a sparse composition over these bases to generate a task-specific workflow in a single pass. To decide, we attribute the success or failure of workflow generation to counterfactual contributions from learned capabilities, thereby capturing which capabilities actually drive success by their marginal effects. Across stringent multi-domain, cross-domain, and unseen-domain evaluations, our 1-pass generator surpasses SOTA refinement baselines that consume 20 iterations, while substantially reducing generation latency and cost.

[MA-1] he emergence of numerical representations in communicating artificial agents

【速读】:该论文试图解决的问题是:在缺乏预定义数值概念的情况下,仅通过交流压力是否足以促使人工代理(artificial agents)自发形成对数量的精确表示,并且这些 emergent code(涌现代码)是否与人类数词相似。解决方案的关键在于设计了一种参照游戏(referential game),让两个基于神经网络的代理通过离散符号或连续草图两种通信渠道进行数量信息传递。实验表明,尽管没有先验的数值概念,代理在训练分布内能够实现高精度的符号-意义映射,但其涌现代码是非组合性的(non-compositional),无法泛化到未见过的数量,表现为重复使用最高训练数值的符号或将外推值压缩为单一草图。因此,研究结论指出,单纯依靠交流压力可实现对已知数量的精准传输,但要获得组合性编码和泛化能力,仍需额外的机制或约束。

链接: https://arxiv.org/abs/2602.10996
作者: Daniela Mihai,Lucas Weber,Francesca Franzon
机构: Universitat Pompeu Fabra (庞佩乌法布拉大学)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: In the Sixteenth International Conference on the Evolution of Language

点击查看摘要

Abstract:Human languages provide efficient systems for expressing numerosities, but whether the sheer pressure to communicate is enough for numerical representations to arise in artificial agents, and whether the emergent codes resemble human numerals at all, remains an open question. We study two neural network-based agents that must communicate numerosities in a referential game using either discrete tokens or continuous sketches, thus exploring both symbolic and iconic representations. Without any pre-defined numeric concepts, the agents achieve high in-distribution communication accuracy in both communication channels and converge on high-precision symbol-meaning mappings. However, the emergent code is non-compositional: the agents fail to derive systematic messages for unseen numerosities, typically reusing the symbol of the highest trained numerosity (discrete), or collapsing extrapolated values onto a single sketch (continuous). We conclude that the communication pressure alone suffices for precise transmission of learned numerosities, but additional pressures are needed to yield compositional codes and generalisation abilities.

[MA-2] owards Probabilistic Strategic Timed CTL

【速读】:该论文旨在解决在连续时间、异步执行语义下的多智能体随机系统中进行形式化验证的问题,特别是如何有效支持带有策略约束的时序逻辑性质验证。其解决方案的关键在于提出了一种概率化的战略时序逻辑——PSTCTL(Probabilistic Strategic Timed CTL),该逻辑在STCTL基础上扩展了概率特性,并引入irP-策略(irP-strategies)以实现对随机系统的可行性验证,从而为复杂动态环境中智能体间的协作与竞争行为提供形式化分析框架。

链接: https://arxiv.org/abs/2602.10824
作者: Wojciech Jamroga,Marta Kwiatkowska,Wojciech Penczek,Laure Petrucci,Teofil Sidoruk
机构: Institute of Computer Science, Polish Academy of Sciences (波兰科学院计算机科学研究所); University of Oxford (牛津大学); Paris 13 University, LIPN (巴黎第十三大学,LIPN实验室)
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We define PSTCTL, a probabilistic variant of Strategic Timed CTL (STCTL), interpreted over stochastic multi-agent systems with continuous time and asynchronous execution semantics. STCTL extends TCTL with strategic operators in the style of ATL. Moreover, we demonstrate the feasibility of verification with irP-strategies.

[MA-3] IMITATOR4AMAS: Strategy Synthesis for STCTL

【速读】:该论文旨在解决在参数化定时自动机(Parametric Timed Automata, PTA)网络中,针对具有非完美信息和无记忆策略的时空计算逻辑(STCTL)进行模型检测与策略合成的问题。其解决方案的关键在于提出并实现了一个名为IMITATOR4AMAS的新工具,该工具基于现有验证器IMITATOR扩展而来,首次支持在异步执行环境下对无记忆非完美信息策略进行合成,实验结果表明相较于以往方法具有显著的速度提升。

链接: https://arxiv.org/abs/2602.10810
作者: Davide Catta,Adrien Lacroix,Wojciech Penczek,Laure Petrucci,Teofil Sidoruk
机构: LIPN, Université Paris 13 (巴黎十三大学LIPN实验室); École des Mines de Nantes (南特国立高等矿业学院); IPiPan, Polish Academy of Sciences (波兰科学院信息处理研究所)
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:IMITATOR4AMAS supports model checking and synthesis of memoryless imperfect information strategies for STCTL, interpreted over networks of parametric timed automata with asynchronous execution. While extending the verifier IMITATOR, IMITATOR4AMAS is the first tool for strategy synthesis in this setting. Our experimental results show a substantial speedup over previous approaches.

[MA-4] Beyond Task Performance: A Metric-Based Analysis of Sequential Cooperation in Heterogeneous Multi-Agent Destructive Forag ing

【速读】:该论文旨在解决异构多智能体系统在部分可观测性和时间角色依赖条件下协作行为的分析问题,特别是在破坏性多智能体觅食场景中。其核心挑战在于如何超越传统算法性能指标(如任务完成效率),全面刻画协作中的协调性、依赖关系、公平性及敏感性等多维特性。解决方案的关键是提出一套系统性的通用协作度量体系,分为三大类:基础度量、团队间度量和团队内度量,能够从多层次对协作行为进行量化表征,并在受动态水面清洁启发的现实场景中验证其有效性,适用于多种具有类似顺序依赖关系的多智能体序列决策领域。

链接: https://arxiv.org/abs/2602.10685
作者: Alejandro Mendoza Barrionuevo,Samuel Yanes Luis,Daniel Gutiérrez Reina,Sergio L. Toral Marín
机构: University of Sevilla (塞维利亚大学)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work addresses the problem of analyzing cooperation in heterogeneous multi-agent systems which operate under partial observability and temporal role dependency, framed within a destructive multi-agent foraging setting. Unlike most previous studies, which focus primarily on algorithmic performance with respect to task completion, this article proposes a systematic set of general-purpose cooperation metrics aimed at characterizing not only efficiency, but also coordination and dependency between teams and agents, fairness, and sensitivity. These metrics are designed to be transferable to different multi-agent sequential domains similar to foraging. The proposed suite of metrics is structured into three main categories that jointly provide a multilevel characterization of cooperation: primary metrics, inter-team metrics, and intra-team metrics. They have been validated in a realistic destructive foraging scenario inspired by dynamic aquatic surface cleaning using heterogeneous autonomous vehicles. It involves two specialized teams with sequential dependencies: one focused on the search of resources, and another on their destruction. Several representative approaches have been evaluated, covering both learning-based algorithms and classical heuristic paradigms.

[MA-5] An Ontology-driven Dynamic Knowledge Base for Uninhabited Ground Vehicles

【速读】:该论文旨在解决无人地面车辆(Uninhabited Ground Vehicles, UGVs)在战术边缘环境中因依赖预先加载的先验信息而导致的识别模糊性问题,尤其在任务执行过程中遭遇意外情况时,易引发决策延迟并增加人工干预需求。解决方案的关键在于引入动态情境任务数据(Dynamic Contextual Mission Data, DCMD),构建基于本体(ontology)驱动的动态知识库,通过近实时的信息获取与分析实现任务中平台本地的DCMD更新,从而提升态势感知能力、自主决策水平和环境适应性。实验验证表明,该方法可生成机器可操作的情境信息,有效支持任务成功执行。

链接: https://arxiv.org/abs/2602.10555
作者: Hsan Sandar Win,Andrew Walters,Cheng-Chew Lim,Daniel Webber,Seth Leslie,Tan Doan
机构: The University of Adelaide (阿德莱德大学); Defence Science and Technology Group (国防科学与技术集团)
类目: Multiagent Systems (cs.MA); Databases (cs.DB); Robotics (cs.RO)
备注: 10 pages, 11 figures, 2025 Australasian Conference on Robotics and Automation (ACRA 2025)

点击查看摘要

Abstract:In this paper, the concept of Dynamic Contextual Mission Data (DCMD) is introduced to develop an ontology-driven dynamic knowledge base for Uninhabited Ground Vehicles (UGVs) at the tactical edge. The dynamic knowledge base with DCMD is added to the UGVs to: support enhanced situation awareness; improve autonomous decision making; and facilitate agility within complex and dynamic environments. As UGVs are heavily reliant on the a priori information added pre-mission, unexpected occurrences during a mission can cause identification ambiguities and require increased levels of user input. Updating this a priori information with contextual information can help UGVs realise their full potential. To address this, the dynamic knowledge base was designed using an ontology-driven representation, supported by near real-time information acquisition and analysis, to provide in-mission on-platform DCMD updates. This was implemented on a team of four UGVs that executed a laboratory based surveillance mission. The results showed that the ontology-driven dynamic representation of the UGV operational environment was machine actionable, producing contextual information to support a successful and timely mission, and contributed directly to the situation awareness.

[MA-6] Protecting Context and Prompts: Deterministic Security for Non-Deterministic AI

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)应用在面对提示注入(prompt injection)和上下文操纵(context manipulation)攻击时,传统安全模型无法有效防护的问题。其解决方案的关键在于提出两个新型密码学原语:认证提示(authenticated prompts)和认证上下文(authenticated context),前者实现自包含的溯源验证,后者通过抗篡改哈希链保障动态输入的完整性;在此基础上构建了一个具有四条定理支持的策略代数(policy algebra),实现了协议级别的拜占庭容错能力,确保即使存在恶意代理也无法违反组织策略。该方案首次结合了密码学强制的提示溯源、抗篡改上下文与可证明的策略推理,将LLM安全从被动检测转向主动防御与形式化保障。

链接: https://arxiv.org/abs/2602.10481
作者: Mohan Rajagopalan,Vinay Rao
机构: MACAW Security, Inc.(MACAW安全公司); ROOST.tools
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) applications are vulnerable to prompt injection and context manipulation attacks that traditional security models cannot prevent. We introduce two novel primitives–authenticated prompts and authenticated context–that provide cryptographically verifiable provenance across LLM workflows. Authenticated prompts enable self-contained lineage verification, while authenticated context uses tamper-evident hash chains to ensure integrity of dynamic inputs. Building on these primitives, we formalize a policy algebra with four proven theorems providing protocol-level Byzantine resistance–even adversarial agents cannot violate organizational policies. Five complementary defenses–from lightweight resource controls to LLM-based semantic validation–deliver layered, preventative security with formal guarantees. Evaluation against representative attacks spanning 6 exhaustive categories achieves 100% detection with zero false positives and nominal overhead. We demonstrate the first approach combining cryptographically enforced prompt lineage, tamper-evident context, and provable policy reasoning–shifting LLM security from reactive detection to preventative guarantees.

[MA-7] Authenticated Workflows: A Systems Approach to Protecting Agent ic AI

【速读】:该论文旨在解决企业级生成式 AI(Generative AI)系统在自动化工作流中面临的安全挑战,现有防御机制如护栏(guardrails)和语义过滤器具有概率性且易被绕过。其核心解决方案是引入认证工作流(authenticated workflows),构建首个面向企业生成式 AI 的完整信任层,将安全问题归结为四个基本边界——提示词(prompts)、工具(tools)、数据(data)和上下文(context)的保护。关键在于在每个边界跨越时强制执行意图(intent,操作符合组织策略)和完整性(integrity,操作具备密码学真实性),通过密码学手段消除攻击类别并结合运行时策略执行,实现确定性安全:操作要么携带有效的密码学证明,要么被拒绝。此外,提出 AI 原生策略语言 MAPL,以 O(log M + N) 复杂度动态表达代理约束,替代传统 O(M × N) 规则组合,并通过轻量适配器集成九种主流框架,经形式化证明与实证验证,实现了 100% 召回率、零误报以及对 OWASP Top 10 风险中的 9 类和两个高影响 CVE 的全面防护。

链接: https://arxiv.org/abs/2602.10465
作者: Mohan Rajagopalan,Vinay Rao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Agentic AI systems automate enterprise workflows but existing defenses–guardrails, semantic filters–are probabilistic and routinely bypassed. We introduce authenticated workflows, the first complete trust layer for enterprise agentic AI. Security reduces to protecting four fundamental boundaries: prompts, tools, data, and context. We enforce intent (operations satisfy organizational policies) and integrity (operations are cryptographically authentic) at every boundary crossing, combining cryptographic elimination of attack classes with runtime policy enforcement. This delivers deterministic security–operations either carry valid cryptographic proof or are rejected. We introduce MAPL, an AI-native policy language that expresses agentic constraints dynamically as agents evolve and invocation context changes, scaling as O(log M + N) policies versus O(M x N) rules through hierarchical composition with cryptographic attestations for workflow dependencies. We prove practicality through a universal security runtime integrating nine leading frameworks (MCP, A2A, OpenAI, Claude, LangChain, CrewAI, AutoGen, LlamaIndex, Haystack) through thin adapters requiring zero protocol modifications. Formal proofs establish completeness and soundness. Empirical validation shows 100% recall with zero false positives across 174 test cases, protection against 9 of 10 OWASP Top 10 risks, and complete mitigation of two high impact production CVEs.

[MA-8] AIvilization v0: Toward Large-Scale Artificial Social Simulation with a Unified Agent Architecture and Adaptive Agent Profiles

【速读】:该论文旨在解决大规模人工社会中长期自主性与环境快速变化之间的矛盾,即如何在资源受限的环境下维持目标稳定性与反应正确性的平衡。其核心解决方案包括三个关键组件:(i) 分层分支思维规划器(hierarchical branch-thinking planner),通过将生活目标分解为并行目标分支,并结合仿真引导验证与分层重规划机制确保可行性;(ii) 自适应代理档案(adaptive agent profile),采用双过程记忆结构分离短期执行轨迹与长期语义固化,实现身份的持续演化;(iii) 人机协同引导接口(human-in-the-loop steering interface),以适当抽象层级注入长期目标和短指令,且影响通过记忆传播而非脆弱的提示覆盖方式实现。这些设计共同支撑了复杂经济行为的稳定演化与多目标长期探索能力。

链接: https://arxiv.org/abs/2602.10429
作者: Wenkai Fan,Shurui Zhang,Xiaolong Wang,Haowei Yang,Tsz Wai Chan,Xingyan Chen,Junquan Bi,Zirui Zhou,Jia Liu,Kani Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); Bauhinia AI
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AIvilization v0 is a publicly deployed large-scale artificial society that couples a resource-constrained sandbox economy with a unified LLM-agent architecture, aiming to sustain long-horizon autonomy while remaining executable under rapidly changing environment. To mitigate the tension between goal stability and reactive correctness, we introduce (i) a hierarchical branch-thinking planner that decomposes life goals into parallel objective branches and uses simulation-guided validation plus tiered re-planning to ensure feasibility; (ii) an adaptive agent profile with dual-process memory that separates short-term execution traces from long-term semantic consolidation, enabling persistent yet evolving identity; and (iii) a human-in-the-loop steering interface that injects long-horizon objectives and short commands at appropriate abstraction levels, with effects propagated through memory rather than brittle prompt overrides. The environment integrates physiological survival costs, non-substitutable multi-tier production, an AMM-based price mechanism, and a gated education-occupation system. Using high-frequency transactions from the platforms mature phase, we find stable markets that reproduce key stylized facts (heavy-tailed returns and volatility clustering) and produce structured wealth stratification driven by education and access constraints. Ablations show simplified planners can match performance on narrow tasks, while the full architecture is more robust under multi-objective, long-horizon settings, supporting delayed investment and sustained exploration.

[MA-9] Can Large Language Models Implement Agent -Based Models? An ODD-based Replication Study

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否能够从标准化规范中可靠地实现基于代理的模型(Agent-Based Models, ABMs),从而支持可重复性、验证与确认(replication, verification, and validation)。其解决方案的关键在于设计了一个受控的“描述到代码”(ODD-to-code)翻译任务,以PPHP捕食者-猎物模型作为完整规范参考,通过分阶段的可执行性检查、与经验证的NetLogo基线模型的独立统计比较,以及运行时效率和可维护性的量化指标,系统评估17个主流LLM生成的Python实现质量。结果表明,行为忠实的实现虽可能但非必然,且仅具备可执行性不足以满足科学应用需求,其中GPT-4.1表现最优,而Claude 3.7 Sonnet则次优且不稳定。

链接: https://arxiv.org/abs/2602.10140
作者: Nuno Fachada,Daniel Fernandes,Carlos M. Fernandes,João P. Matos-Carvalho
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can now synthesize non-trivial executable code from textual descriptions, raising an important question: can LLMs reliably implement agent-based models from standardized specifications in a way that supports replication, verification, and validation? We address this question by evaluating 17 contemporary LLMs on a controlled ODD-to-code translation task, using the PPHPC predator-prey model as a fully specified reference. Generated Python implementations are assessed through staged executability checks, model-independent statistical comparison against a validated NetLogo baseline, and quantitative measures of runtime efficiency and maintainability. Results show that behaviorally faithful implementations are achievable but not guaranteed, and that executability alone is insufficient for scientific use. GPT-4.1 consistently produces statistically valid and efficient implementations, with Claude 3.7 Sonnet performing well but less reliably. Overall, the findings clarify both the promise and current limitations of LLMs as model engineering tools, with implications for reproducible agent-based and environmental modelling.

[MA-10] Generalized Langevin Models of Linear Agent -Based Systems: Strategic Influence Through Environmental Coupling

【速读】:该论文旨在解决传统基于代理(agent-based)模型在模拟复杂系统时忽略环境耦合的问题,这一忽略可能导致对系统动力学本质的误解。研究表明,未被观测到的环境自由度会引入记忆效应,从而显著改变系统的演化行为。解决方案的关键在于将线性更新规则系统性地转化为精确的广义朗之万方程(generalized Langevin equations),其中环境代理表现为由环境相互作用谱决定的时间尺度和耦合强度的记忆核(memory kernel)。该方法揭示了网络拓扑如何塑造记忆结构:小世界重连促使动力学趋于单一主导弛豫模式,而碎片化环境则维持多个持久模式,对应于孤立子群体。此框架进一步应用于隐蔽影响力操作场景,表明即使极端者(zealot)不直接接触目标人群,其观点也能通过环境中介扩散,最终改变目标群体的稳态响应。

链接: https://arxiv.org/abs/2602.11037
作者: Semra Gunduc,David J. Butts,Michael S. Murillo
机构: Michigan State University (密歇根州立大学); Ankara University (安卡拉大学); Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Physics and Society (physics.soc-ph); Multiagent Systems (cs.MA)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Agent-based models typically treat systems in isolation, discarding environmental coupling as either computationally prohibitive or dynamically irrelevant. We demonstrate that this neglect misses essential physics: environmental degrees of freedom create memory effects that fundamentally alter system dynamics. By systematically transforming linear update rules into exact generalized Langevin equations, we show that unobserved environmental agents manifest as memory kernels whose timescales and coupling strengths are determined by the environmental interaction spectrum. Network topology shapes this memory structure in distinct ways: small-world rewiring drives dynamics toward a single dominant relaxation mode, while fragmented environments sustain multiple persistent modes corresponding to isolated subpopulations. We apply this framework to covert influence operations where adversaries manipulate target populations exclusively via environmental intermediaries. The steady-state response admits a random-walk interpretation through hitting probabilities, revealing how zealot opinions diffuse through the environment to shift system agent opinions toward the zealot mean - even when zealots never directly contact targets.

自然语言处理

[NLP-0] Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)中数据规模与训练策略之间的权衡问题,特别是针对推理型语言模型的后训练阶段。传统机器学习直觉认为增加唯一训练样本数量有助于提升泛化性能,但本文发现,在固定更新预算下,对较小数据集进行多轮迭代训练(高重复度)反而优于单轮遍历大规模数据集的训练方式。其关键解决方案在于:利用训练过程中的token准确率作为重复饱和的信号,当模型达到完全记忆(full memorization)时,继续增加训练轮次不再带来性能提升,此时应停止训练;这一机制可替代昂贵且低效的数据扩展策略,实现更高效的SFT优化。该发现揭示了“重复优势”现象——即在充分记忆状态下,模型不仅未发生灾难性遗忘,反而实现了更强的泛化能力,为理解大语言模型训练动态提供了新的研究方向。

链接: https://arxiv.org/abs/2602.11149
作者: Dawid J. Kopiczko,Sagar Vaze,Tijmen Blankevoort,Yuki M. Asano
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME’24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

[NLP-1] Weight Decay Improves Language Model Plasticity

【速读】: 该论文试图解决当前大语言模型(Large Language Model, LLM)预训练过程中,超参数优化主要基于基模型验证损失(validation loss),而忽视了模型在下游任务中通过微调(fine-tuning)展现的适应能力(即模型可塑性,plasticity)的问题。解决方案的关键在于重新审视权重衰减(weight decay)这一关键正则化参数的作用:通过系统实验发现,采用较大权重衰减值训练的模型具有更高的可塑性,能在下游任务微调中获得更显著的性能提升;进一步机制分析表明,权重衰减有助于生成线性可分表示、规范注意力矩阵并降低训练数据过拟合,从而揭示了单一优化超参数对模型行为的多维影响。

链接: https://arxiv.org/abs/2602.11137
作者: Tessa Han,Sebastian Bordt,Hanlin Zhang,Sham Kakade
机构: Broad Institute (博德研究所); University of Tübingen (图宾根大学); Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model’s validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay’s mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.

[NLP-2] Just on Time: Token-Level Early Stopping for Diffusion Language Models

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models)在文本生成过程中因迭代去噪步骤冗余而导致的计算效率低下问题,尤其针对许多token在达到最终去噪步之前已提前收敛的现象。解决方案的关键在于提出一种无需训练的、基于token级别的早期停止策略,通过利用模型预测结果和局部上下文信息提取轻量级信号,动态判断每个位置的token是否已达到稳定状态,从而实现自适应的逐token冻结机制,显著减少整体扩散步数,且无需任务特定微调即可保持生成质量。

链接: https://arxiv.org/abs/2602.11133
作者: Zahar Kohut,Severyn Shykula,Dmytro Khamula,Mykola Vysotskyi,Taras Rumezhak,Volodymyr Karpiv
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient because many tokens reach stability long before the final denoising step. We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position. Our method leverages lightweight signals derived from the model’s predictions and local context to dynamically determine when individual tokens can be finalized. This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required. Across diverse benchmarks, spanning mathematical reasoning, general question answering, and scientific understanding, our approach achieves state-of-the-art efficiency gains while preserving generation quality.

[NLP-3] EGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection

【速读】: 该论文旨在解决虚假信息检测(misinformation detection)任务中因缺乏外部知识支持而导致的准确性不足问题,其核心挑战在于如何有效融合外部知识库中的结构化信息以增强文本表征能力。解决方案的关键在于提出一种名为Text Encoding with Graph (TEG) 的新方法,该方法通过从文本中提取结构化信息构建图结构(graph),并将原始文本与图结构联合编码用于分类任务,从而实现语言模型与外部知识的深度融合;进一步地,作者还提出了TEGRA框架,引入领域特定知识,显著提升了多数场景下的分类准确率。

链接: https://arxiv.org/abs/2602.11106
作者: Géraud Faye,Wassila Ouerdane,Guillaume Gadek,Céline Hudelot
机构: Airbus Defence and Space; Université Paris-Saclay, CentraleSupélec, MICS
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Misinformation detection is a critical task that can benefit significantly from the integration of external knowledge, much like manual fact-checking. In this work, we propose a novel method for representing textual documents that facilitates the incorporation of information from a knowledge base. Our approach, Text Encoding with Graph (TEG), processes documents by extracting structured information in the form of a graph and encoding both the text and the graph for classification purposes. Through extensive experiments, we demonstrate that this hybrid representation enhances misinformation detection performance compared to using language models alone. Furthermore, we introduce TEGRA, an extension of our framework that integrates domain-specific knowledge, further enhancing classification accuracy in most cases.

[NLP-4] GameDevBench: Evaluating Agent ic Capabilities Through Game Development

【速读】: 该论文旨在解决当前多模态智能体(Multimodal Agents)在游戏开发任务中评估缺乏有效测试基准的问题。现有软件开发基准难以体现游戏开发所需的复杂代码理解和多模态内容处理能力,如对着色器、精灵图和动画等视觉资产的交互。为此,作者提出GameDevBench——首个面向游戏开发任务的多模态智能体评测基准,包含132个源自网络教程的真实任务,其平均解决方案所需代码量和文件变更量远超以往软件开发基准。关键创新在于引入两种基于图像与视频的反馈机制,显著提升智能体性能(如Claude Sonnet 4.5准确率从33.3%提升至47.7%),验证了简单多模态反馈对增强智能体理解与执行能力的有效性。

链接: https://arxiv.org/abs/2602.11103
作者: Wayne Chi,Yixiong Fang,Arnav Yayavaram,Siddharth Yayavaram,Seth Karten,Qiuhong Anna Wei,Runkun Chen,Alexander Wang,Valerie Chen,Ameet Talwalkar,Chris Donahue
机构: Carnegie Mellon University (卡内基梅隆大学); Princeton University (普林斯顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex – the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5’s performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.

[NLP-5] Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

【速读】: 该论文旨在解决基于强化学习(Reinforcement Learning, RL)的后训练方法在提升多模态大模型(Multimodal Large-scale Reasoning Models, MLRMs)推理能力的同时,导致安全对齐性能下降及越狱攻击成功率上升的问题。其解决方案的关键在于提出一种轻量级推理时防御机制 SafeThink,该机制将安全恢复视为满足性约束(satisficing constraint)而非最大化目标,通过安全奖励模型监控推理轨迹,并在安全阈值被违反时条件性注入一个优化过的简短纠正前缀(如“Wait, think safely”),从而在早期(通常前1-3步)干预以引导生成内容回归安全路径,实现在显著降低越狱攻击成功率(30%-60%)的同时保持原始推理性能不变。

链接: https://arxiv.org/abs/2602.11096
作者: Soumya Suvra Ghosal,Souradip Chakraborty,Vaibhav Singh,Furong Huang,Dinesh Manocha,Amrit Singh Bedi
机构: University of Maryland, College Park (马里兰大学学院公园分校); IIT, Bombay (印度理工学院孟买分校); University of Central Florida (中佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix (“Wait, think safely”) only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.

[NLP-6] Can Large Language Models Make Everyone Happy?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全(safety)、价值(value)和文化(culture)维度上存在的协同失调问题,即模型行为难以同时满足这三个维度的要求,导致在现实场景中出现与人类预期偏离的现象。现有基准测试如SAFETUNEBED、VALUEBENCH和WORLDVIEW-BENCH主要孤立评估各维度,无法揭示其交互关系与权衡机制;而基于机制可解释性的MIB等方法虽提供局部洞察,仍不足以系统刻画跨维度的权衡特征。为此,作者提出MisAlign-Profile统一基准,其核心创新在于构建了MISALIGNTRADE数据集——一个涵盖112个规范领域分类(含14个安全、56个价值与42个文化领域)的英文对齐-非对齐配对数据集,通过Gemma-2-9B-it生成并利用Qwen3-30B-A3B-Instruct-2507扩展,结合SimHash指纹去重技术保证多样性,并采用两阶段拒绝采样确保响应质量。实验表明,该基准能有效量化通用、微调及开源权重LLMs中的跨维度误对齐比例(12%-34%),从而为系统性分析多维对齐冲突提供了标准化工具。

链接: https://arxiv.org/abs/2602.11091
作者: Usman Naseem,Gautam Siddharth Kashyap,Ebad Shabbir,Sushant Kumar Ray,Abdullah Mohammad,Rafiq Ali
机构: Macquarie University (麦考瑞大学); DSEU-Okhla (德里科学与工程大学-奥克拉校区); University of Delhi (德里大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Misalignment in Large Language Models (LLMs) refers to the failure to simultaneously satisfy safety, value, and cultural dimensions, leading to behaviors that diverge from human expectations in real-world settings where these dimensions must co-occur. Existing benchmarks, such as SAFETUNEBED (safety-centric), VALUEBENCH (value-centric), and WORLDVIEW-BENCH (culture-centric), primarily evaluate these dimensions in isolation and therefore provide limited insight into their interactions and trade-offs. More recent efforts, including MIB and INTERPRETABILITY BENCHMARK-based on mechanistic interpretability, offer valuable perspectives on model failures; however, they remain insufficient for systematically characterizing cross-dimensional trade-offs. To address these gaps, we introduce MisAlign-Profile, a unified benchmark for measuring misalignment trade-offs inspired by mechanistic profiling. First, we construct MISALIGNTRADE, an English misaligned-aligned dataset across 112 normative domains taxonomies, including 14 safety, 56 value, and 42 cultural domains. In addition to domain labels, each prompt is classified with one of three orthogonal semantic types-object, attribute, or relations misalignment-using Gemma-2-9B-it and expanded via Qwen3-30B-A3B-Instruct-2507 with SimHash-based fingerprinting to avoid deduplication. Each prompt is paired with misaligned and aligned responses through two-stage rejection sampling to ensure quality. Second, we benchmark general-purpose, fine-tuned, and open-weight LLMs on MISALIGNTRADE-revealing 12%-34% misalignment trade-offs across dimensions.

[NLP-7] DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)训练中高质量数据配方(data recipe)设计依赖人工、耗时且需要专家经验的问题。当前虽然部分数据处理步骤(如数据合成与过滤)已可通过LLM自动化,但整体数据配方的设计仍为手动流程,难以高效适配不同目标任务。解决方案的关键在于提出DataChef-32B,一种基于代理奖励(proxy reward)的在线强化学习框架,能够自动输出完整数据配方,从而将基础模型适配至目标任务。实验表明,其生成的配方在六个独立任务上达到与人类专家相当的下游性能,例如成功将Qwen3-1.7B-Base模型适配至数学领域,在AIME’25基准上取得66.7分,超越原模型。

链接: https://arxiv.org/abs/2602.11089
作者: Yicheng Chen,Zerun Ma,Xinchen Xie,Yining Li,Kai Chen
机构: Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emphdata recipe, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emphend-to-end data recipe generation for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME’25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.

[NLP-8] SteuerLLM : Local specialized large language model for German tax law analysis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在严格规则驱动、术语精确且具有法律约束力的领域(如税法)中性能显著下降的问题。传统LLM虽具备强大的通用推理与语言理解能力,但在需精准引用法规条文、结构化法律论证及数值准确性要求高的税务考试场景下表现不足。解决方案的关键在于:首先构建了首个基于真实德国高校税法考试题目的开源基准数据集SteuerEx,包含115道专家验证题目,覆盖六个核心税法领域并采用逐语句部分赋分的评估机制;其次提出SteuerLLM——一个基于大规模合成训练数据(由受控检索增强管道生成)微调的领域适配模型,其280亿参数规模虽未超越某些更大模型,却在多项指标上优于同类通用指令微调模型,证明了领域特定数据与架构适配比单纯扩大参数规模对提升法律推理任务效果更为关键。

链接: https://arxiv.org/abs/2602.11081
作者: Sebastian Wind,Jeta Sopa,Laurin Schmid,Quirin Jackl,Sebastian Kiefer,Fei Wu,Martin Mayr,Harald Köstler,Gerhard Wellein,Andreas Maier,Soroosh Tayebi Arasteh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at this https URL.

[NLP-9] Chatting with Images for Introspective Visual Thinking

【速读】: 该论文旨在解决当前大规模视觉语言模型(Large Vision-Language Models, LVLMs)在单次视觉编码下易丢失细粒度视觉信息的问题,以及现有“以图像思考”方法因视觉状态与语言语义对齐不足而导致跨模态推理效果不佳的局限性,尤其在涉及远距离区域或多图场景下的空间关系推理中表现薄弱。其解决方案的关键在于提出“与图像对话”(chatting with images)的新范式,将视觉操作重构为语言引导的特征调制机制:通过表达性强的语言提示动态地对多个图像区域进行联合重编码,从而实现语言推理与视觉状态更新之间的紧密耦合。该方法在ViLaVT模型中得以实现,该模型配备专为交互式视觉推理设计的动态视觉编码器,并采用监督微调与强化学习相结合的两阶段训练策略,显著提升了复杂多图像和视频空间推理任务的表现。

链接: https://arxiv.org/abs/2602.11073
作者: Junfei Wu,Jian Guan,Qiang Liu,Shu Wu,Liang Wang,Wei Wu,Tienie Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ‘‘thinking with images’’ attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ‘‘chatting with images’’, a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.

[NLP-10] Simultaneous Speech-to-Speech Translation Without Aligned Data

【速读】: 该论文旨在解决端到端语音翻译(end-to-end speech translation)中因依赖词级对齐数据而导致的训练瓶颈问题。传统方法需通过语言特定的启发式规则生成词级对齐,这不仅难以规模化收集,还限制了模型在语法结构差异大的多语言场景下的泛化能力。其解决方案的关键在于提出Hibiki-Zero,该模型完全摒弃词级对齐需求,先在句级对齐数据上训练高延迟模型,再引入一种基于GRPO(Generalized Reward Policy Optimization)的强化学习策略,在优化推理延迟的同时保持翻译质量。这一设计显著简化了训练流程,并实现了跨语言的无缝扩展,且仅需不到1000小时语音即可适配新输入语言。

链接: https://arxiv.org/abs/2602.11072
作者: Tom Labiausse,Romain Fabre,Yannick Estève,Alexandre Défossez,Neil Zeghidour
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: See inference code at: this https URL

点击查看摘要

Abstract:Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.

[NLP-11] Conversational Behavior Modeling Foundation Model With Multi-Level Perception

【速读】: 该论文旨在解决全双工交互系统中自然对话建模的问题,即如何捕捉人类对话中隐含的思维链(thought chain)并将其转化为可预测、可解释的言语行为(speech act)。其核心挑战在于建立高阶交际意图与低阶言语行为之间的因果和时序依赖关系。解决方案的关键是提出一种多层级感知框架,通过分层标注方案对意图与言语行为进行建模,并引入图式推理结构(Graph-of-Thoughts, GoT),将流式预测组织为动态演化的图结构,使Transformer模型能够预测下一言语行为、生成决策依据,并持续优化推理过程。该方法在合成与真实全双工对话数据上的实验验证了其在行为检测鲁棒性、推理可解释性方面的优势,为全双工语音对话系统的对话推理能力提供了基准评估基础。

链接: https://arxiv.org/abs/2602.11065
作者: Dingkun Zhou,Shuchang Pan,Jiachen Lian,Siddharth Banerjee,Sarika Pasumarthy,Dhruv Hebbar,Siddhant Patel,Zeyi Austin Li,Kan Jen Cheng,Sanay Bordia,Krish Patel,Akshaj Gupta,Tingle Li,Gopala Anumanchipalli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.

[NLP-12] Embedding Inversion via Conditional Masked Diffusion Language Models

【速读】: 该论文旨在解决嵌入空间到文本序列的逆向映射问题,即从给定的嵌入向量(embedding)中恢复出对应的文本 token 序列。传统方法通常依赖于自回归生成(autoregressive generation),存在推理效率低、难以并行化等局限性。本文提出将嵌入反演(embedding inversion)建模为条件掩码扩散过程(conditional masked diffusion),通过迭代去噪机制在单次前向传播中并行恢复全部 token,而非逐个生成。其关键创新在于利用自适应层归一化(adaptive layer normalization)将目标嵌入作为条件信息注入扩散语言模型,从而无需访问原始编码器即可实现高效重建;实验表明,在32-token序列上使用仅78M参数模型即可达到81.3%的token准确率和0.87的余弦相似度。

链接: https://arxiv.org/abs/2602.11047
作者: Han Xiao
机构: Jina AI by Elastic (Jina AI 由 Elastic 提供支持)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures, 7 tables. Code and demo: this https URL

点击查看摘要

Abstract:We frame embedding inversion as conditional masked diffusion, recovering all tokens in parallel through iterative denoising rather than sequential autoregressive generation. A masked diffusion language model is conditioned on the target embedding via adaptive layer normalization, requiring only 8 forward passes through a 78M parameter model with no access to the target encoder. On 32-token sequences across three embedding models, the method achieves 81.3% token accuracy and 0.87 cosine similarity.

[NLP-13] Language Model Inversion through End-to-End Differentiation

【速读】: 该论文试图解决语言模型(Language Model, LM)的可逆性问题,即在给定一个冻结的LM和期望的目标输出序列时,如何确定能够生成该目标输出的输入提示(prompt)。这一问题在现有研究中尚未得到充分探索。解决方案的关键在于将LM视为作用于token分布序列的函数(而非传统上视为作用于token序列的函数),从而实现端到端的可微分性,并基于梯度下降优化提示。作者提出了一种简单算法,使预训练LM具备可微特性,并通过优化过程高效地生成与目标输出匹配的输入提示,实验表明该方法在不同长度的提示和目标序列下均具有可靠性与效率。

链接: https://arxiv.org/abs/2602.11044
作者: Kevin Yandoka Denamganaï,Kartic Subr
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures, under review

点击查看摘要

Abstract:Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths 10 and 80 for targets of length 20 , for several white-box LMs (out-of-the-box).

[NLP-14] Learning Page Order in Shuffled WOO Releases

【速读】: 该论文旨在解决文档页面排序(document page ordering)问题,尤其针对由多种异构内容(如电子邮件、法律文本和电子表格)组成的PDF文档集合,其中语义顺序信号不可靠。其关键解决方案是使用页面嵌入(page embeddings)并比较五种方法,包括指针网络(pointer networks)、序列到序列Transformer(seq2seq transformers)和专用成对排序模型(pairwise ranking models)。实验表明,最优方法在15页以内的文档上表现优异(Kendall’s tau达0.72),而模型专业化(model specialization)在长文档上带来显著提升(+0.21 tau),揭示了短文档与长文档需采用根本不同的排序策略,从而解释了课程学习(curriculum learning)失败的原因。

链接: https://arxiv.org/abs/2602.11040
作者: Efe Kahraman,Giulio Tosato
机构: utf.ai
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate document page ordering on 5,461 shuffled WOO documents (Dutch freedom of information releases) using page embeddings. These documents are heterogeneous collections such as emails, legal texts, and spreadsheets compiled into single PDFs, where semantic ordering signals are unreliable. We compare five methods, including pointer networks, seq2seq transformers, and specialized pairwise ranking models. The best performing approach successfully reorders documents up to 15 pages, with Kendall’s tau ranging from 0.95 for short documents (2-5 pages) to 0.72 for 15 page documents. We observe two unexpected failures: seq2seq transformers fail to generalize on long documents (Kendall’s tau drops from 0.918 on 2-5 pages to 0.014 on 21-25 pages), and curriculum learning underperforms direct training by 39% on long documents. Ablation studies suggest learned positional encodings are one contributing factor to seq2seq failure, though the degradation persists across all encoding variants, indicating multiple interacting causes. Attention pattern analysis reveals that short and long documents require fundamentally different ordering strategies, explaining why curriculum learning fails. Model specialization achieves substantial improvements on longer documents (+0.21 tau).

[NLP-15] Linguistic Indicators of Early Cognitive Decline in the DementiaBank Pitt Corpus: A Statistical and Machine Learning Study

【速读】: 该论文旨在解决如何通过自发语言产出中的可解释语言特征来早期识别认知衰退的问题,从而支持透明且临床基础扎实的认知筛查方法。其解决方案的关键在于采用三种不同的语言表征方式(原始清洗文本、词性标注增强表示和仅词性标注的句法表示),结合机器学习模型(逻辑回归与随机森林)在不同评估协议下进行分析,并通过全局特征重要性与非参数统计检验(Mann-Whitney U测试及Cliff’s delta效应量)验证结果的稳健性和可解释性。研究发现,即使在缺乏词汇内容的情况下,句法和语法特征仍具有强大的判别能力,表明抽象语言特征是早期认知衰退的可靠标志。

链接: https://arxiv.org/abs/2602.11028
作者: Artsvik Avetisyan,Sachin Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background: Subtle changes in spontaneous language production are among the earliest indicators of cognitive decline. Identifying linguistically interpretable markers of dementia can support transparent and clinically grounded screening approaches. Methods: This study analyzes spontaneous speech transcripts from the DementiaBank Pitt Corpus using three linguistic representations: raw cleaned text, a part-of-speech (POS)-enhanced representation combining lexical and grammatical information, and a POS-only syntactic representation. Logistic regression and random forest models were evaluated under two protocols: transcript-level train-test splits and subject-level five-fold cross-validation to prevent speaker overlap. Model interpretability was examined using global feature importance, and statistical validation was conducted using Mann-Whitney U tests with Cliff’s delta effect sizes. Results: Across representations, models achieved stable performance, with syntactic and grammatical features retaining strong discriminative power even in the absence of lexical content. Subject-level evaluation yielded more conservative but consistent results, particularly for POS-enhanced and POS-only representations. Statistical analysis revealed significant group differences in functional word usage, lexical diversity, sentence structure, and discourse coherence, aligning closely with machine learning feature importance findings. Conclusion: The results demonstrate that abstract linguistic features capture robust markers of early cognitive decline under clinically realistic evaluation. By combining interpretable machine learning with non-parametric statistical validation, this study supports the use of linguistically grounded features for transparent and reliable language-based cognitive screening. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.11028 [cs.CL] (or arXiv:2602.11028v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.11028 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sachin Kumar [view email] [v1] Wed, 11 Feb 2026 16:53:57 UTC (537 KB)

[NLP-16] ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression

【速读】: 该论文旨在解决模型压缩中如何在不依赖训练(training-free)的前提下实现高效且高保真度压缩的问题,尤其针对传统方法如因子分解(factorization)、结构化稀疏(structured-sparsification)和动态压缩(dynamic compression)在压缩率与性能保持之间难以平衡的挑战。其解决方案的关键在于两个核心创新:一是将层级压缩分配建模为多选择背包问题(multi-choice knapsack problem),在全局压缩预算约束下优化各层压缩比例以最小化总重构误差;二是提出一种单步稀疏矩阵分解方法,受字典学习(dictionary learning)启发,仅用少量校准数据即可基于激活-权重敏感性进行权值稀疏化,并通过最小二乘法闭式更新字典,完全避免迭代优化、稀疏编码或反向传播(backpropagation)。该方法在20–50%压缩率下显著优于现有技术,30%压缩时无需微调即可保留90%以上原始性能,轻量微调后恢复效果更佳。

链接: https://arxiv.org/abs/2602.11008
作者: Ammar Ali,Baher Mohammad,Denis Makhov,Dmitriy Shopkhoev,Magauiya Zhussip,Stamatios Lefkimmiatis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present ROCKET, a training-free model compression method that achieves state-of-the-art performance in comparison with factorization, structured-sparsification and dynamic compression baselines. Operating under a global compression budget, ROCKET comprises two key innovations: First, it formulates layer-wise compression allocation as a multi-choice knapsack problem, selecting the optimal compression level for each layer to minimize total reconstruction error while adhering to a target model size. Second, it introduces a single-step sparse matrix factorization inspired by dictionary learning: using only a small calibration set, it sparsifies weight coefficients based on activation-weights sensitivity and then updates the dictionary in closed form via least squares bypassing iterative optimization, sparse coding, or backpropagation entirely. ROCKET consistently outperforms existing compression approaches across different model architectures at 20-50% compression rates. Notably, it retains over 90% of the original model’s performance at 30% compression without any fine-tuning. Moreover, when applying a light fine-tuning phase, recovery is substantially enhanced: for instance, compressing Qwen3-14B to an 8B-parameter model and healing it with just 30 million tokens yields performance nearly on par with the original Qwen3-8B. The code for ROCKET is at this http URL.

[NLP-17] LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules

【速读】: 该论文旨在解决标准低秩适应(Low-Rank Adaptation, LoRA)在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中面临的两大核心问题:一是预选最优秩(rank)及其特定超参数的困难,二是异构秩模块部署复杂以及更高级LoRA变体的实现难度。其解决方案的关键在于提出LoRA-Squeeze方法,该方法通过先以较高源秩进行微调学习一个表达能力强的解,再利用随机奇异值分解(Randomized Singular Value Decomposition, RSVD)对全权重更新矩阵进行压缩重构,从而生成目标低秩的适配器模块。实验表明,后处理压缩策略通常优于直接在目标秩下训练的LoRA模块,尤其当允许少量目标秩微调步骤时性能更优;进一步引入训练过程中逐步降低秩的“秩退火”(rank annealing)机制,可实现最佳的模型规模与性能权衡。

链接: https://arxiv.org/abs/2602.10993
作者: Ivan Vulić,Adam Grycner,Quentin de Laroussilhe,Jonas Pfeiffer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Despite its huge number of variants, standard Low-Rank Adaptation (LoRA) is still a dominant technique for parameter-efficient fine-tuning (PEFT). Nonetheless, it faces persistent challenges, including the pre-selection of an optimal rank and rank-specific hyper-parameters, as well as the deployment complexity of heterogeneous-rank modules and more sophisticated LoRA derivatives. In this work, we introduce LoRA-Squeeze, a simple and efficient methodology that aims to improve standard LoRA learning by changing LoRA module ranks either post-hoc or dynamically during training. Our approach posits that it is better to first learn an expressive, higher-rank solution and then compress it, rather than learning a constrained, low-rank solution directly. The method involves fine-tuning with a deliberately high(er) source rank, reconstructing or efficiently approximating the reconstruction of the full weight update matrix, and then using Randomized Singular Value Decomposition (RSVD) to create a new, compressed LoRA module at a lower target rank. Extensive experiments across 13 text and 10 vision-language tasks show that post-hoc compression often produces lower-rank adapters that outperform those trained directly at the target rank, especially if a small number of fine-tuning steps at the target rank is allowed. Moreover, a gradual, in-tuning rank annealing variant of LoRA-Squeeze consistently achieves the best LoRA size-performance trade-off.

[NLP-18] Rotary Positional Embeddings as Phase Modulation: Theoretical Bounds on the RoPE Base for Long-Context Transformers

【速读】: 该论文旨在解决旋转位置编码(Rotary Positional Embeddings, RoPE)在长上下文长度下行为不明确的问题,尤其是其在深度Transformer模型中如何保持位置信息的稳定性与可区分性。解决方案的关键在于将RoPE重新诠释为复振荡器组上的相位调制,从而借助经典信号处理理论进行分析:首先推导出保障位置相干性的下界——包括类奈奎斯特的混叠限制和低频模式相位漂移约束;其次揭示了深层网络中重复旋转变换会累积角度失配,导致基参数需求随层数增加而提高;此外还提出了由浮点精度限制引发的上界,超出此限则相位更新无法区分,造成位置信息湮灭。这些边界共同定义了一个依赖于精度和深度的“黄金区域”,用于指导长上下文Transformer的设计与调试。

链接: https://arxiv.org/abs/2602.10959
作者: Feilong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Rotary positional embeddings (RoPE) are widely used in large language models to encode token positions through multiplicative rotations, yet their behavior at long context lengths remains poorly characterized. In this work, we reinterpret RoPE as phase modulation applied to a bank of complex oscillators, enabling analysis through classical signal processing theory. Under this formulation, we derive principled lower bounds on the RoPE base parameter that are necessary to preserve positional coherence over a target context length. These include a fundamental aliasing bound, analogous to a Nyquist limit, and a DC-component stability bound that constrains phase drift in low-frequency positional modes. We further extend this analysis to deep transformers, showing that repeated rotary modulation across layers compounds angular misalignment, tightening the base requirement as depth increases. Complementing these results, we derive a precision-dependent upper bound on the RoPE base arising from finite floating-point resolution. Beyond this limit, incremental phase updates become numerically indistinguishable, leading to positional erasure even in the absence of aliasing. Together, the lower and upper bounds define a precision- and depth-dependent feasibility region a Goldilocks zone for long-context transformers. We validate the framework through a comprehensive case study of state-of-the-art models, including LLaMA, Mistral, and DeepSeek variants, showing that observed successes, failures, and community retrofits align closely with the predicted bounds. Notably, models that violate the stability bound exhibit attention collapse and long-range degradation, while attempts to scale beyond one million tokens encounter a hard precision wall independent of architecture or training. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.10959 [cs.LG] (or arXiv:2602.10959v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.10959 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Feilong Liu [view email] [v1] Wed, 11 Feb 2026 15:50:07 UTC (320 KB)

[NLP-19] Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在生成文本时因采用贪婪解码策略而导致的次优解码顺序问题,尤其是在需要复杂推理的任务中,局部最优决策可能引发全局性能下降。解决方案的关键在于提出一种无需训练的解码算法SOAR,其核心机制是根据模型置信度动态调整搜索范围:当不确定性较高时,短暂扩大对替代解码决策的探索空间以避免过早锁定;当置信度较高时,则收缩搜索并行解码多个位置,从而减少去噪迭代次数。这一自适应策略在数学推理和代码生成任务(如GSM8K、MBPP、HumanEval)上显著提升了生成质量,同时保持了高效的推理速度。

链接: https://arxiv.org/abs/2602.10953
作者: Mingyu Cao,Alvaro Correia,Christos Louizos,Shiwei Liu,Lu Yin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoning-heavy prompts. We present SOAR, a training-free decoding algorithm that adapts its behavior to the model’s uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding.

[NLP-20] Computational Phenomenology of Temporal Experience in Autism: Quantifying the Emotional and Narrative Characteristics of Lived Unpredictability

【速读】: 该论文旨在解决当前关于自闭症个体时间感知(temporality)研究中存在的三大局限:一是以缺陷为基础的医学模型主导,二是定性研究样本量不足,三是计算研究缺乏现象学锚定。为弥合现象学与计算方法之间的鸿沟并克服样本规模限制,研究整合了三种方法:基于跨诊断时间体验评估工具(Transdiagnostic Assessment of Temporal Experience)的结构化现象学访谈、针对自闭症叙事构建的自传体语料库的计算分析,以及通过叙事流畅度指标对自闭症自传的真实性进行复现性计算评估。关键解决方案在于将主观体验与客观计算分析相结合,揭示出自闭症个体的时间挑战主要源于生活经验中的不可预测性(unpredictability),而非叙事建构本身,从而提供了一个更具生态效度和机制解释力的理解框架。

链接: https://arxiv.org/abs/2602.10947
作者: Kacper Dudzic,Karolina Drożdż,Maciej Wodziński,Anastazja Szuła,Marcin Moskalewicz
机构: Maria Curie-Skłodowska University (玛丽亚·居里-斯克沃多夫斯卡大学); AMU Center for Artificial Intelligence (AMU人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Disturbances in temporality, such as desynchronization with the social environment and its unpredictability, are considered core features of autism with a deep impact on relationships. However, limitations regarding research on this issue include: 1) the dominance of deficit-based medical models of autism, 2) sample size in qualitative research, and 3) the lack of phenomenological anchoring in computational research. To bridge the gap between phenomenological and computational approaches and overcome sample-size limitations, our research integrated three methodologies. Study A: structured phenomenological interviews with autistic individuals using the Transdiagnostic Assessment of Temporal Experience. Study B: computational analysis of an autobiographical corpus of autistic narratives built for this purpose. Study C: a replication of a computational study using narrative flow measures to assess the perceived phenomenological authenticity of autistic autobiographies. Interviews revealed that the most significant differences between the autistic and control groups concerned unpredictability of experience. Computational results mirrored these findings: the temporal lexicon in autistic narratives was significantly more negatively valenced - particularly the “Immediacy Suddenness” category. Outlier analysis identified terms associated with perceived discontinuity (unpredictably, precipitously, and abruptly) as highly negative. The computational analysis of narrative flow found that the autistic narratives contained within the corpus quantifiably resemble autobiographical stories more than imaginary ones. Overall, the temporal challenges experienced by autistic individuals were shown to primarily concern lived unpredictability and stem from the contents of lived experience, and not from autistic narrative construction.

[NLP-21] SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora ATC

【速读】: 该论文旨在解决在万亿级自然语言语料库中实现超快速且灵活的搜索问题,尤其是在处理语义变体(如替换、插入和删除)时仍能保持低延迟。解决方案的关键在于结合基于后缀数组(suffix array)的字符串匹配技术与两项核心算法思想:一是通过磁盘感知设计实现快速精确查找,二是采用动态语料库感知剪枝策略来抑制因查询语义松弛导致的组合爆炸。理论分析表明,该方法利用自然语言的统计特性有效遏制了搜索空间随查询长度呈指数增长的趋势,从而在FineWeb-Edu语料库(1.4T tokens)上实现了显著低于现有方法(如infini-gram、infini-gram mini和SoftMatcha)的搜索延迟,并成功识别出传统方法未能发现的训练语料库中的基准污染问题。

链接: https://arxiv.org/abs/2602.10908
作者: Masataka Yoneda,Yusuke Matsushita,Go Kamoda,Kohei Suenaga,Takuya Akiba,Masaki Waga,Sho Yokoi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Project Page Web Interface: this https URL , Source Code: this https URL

点击查看摘要

Abstract:We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.

[NLP-22] he CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems

【速读】: 该论文旨在解决当前金融领域大语言模型(Large Language Models, LLMs)评估体系中存在的局限性问题,即现有基准测试大多局限于单一语言、纯文本模态,并且仅覆盖狭窄的子任务,难以全面衡量模型在多语言和多模态环境下的理解、推理与决策能力。解决方案的关键在于提出首个面向金融领域的多语言、多模态评估框架——FinMMEval Lab,通过三个相互关联的任务实现系统性测评:金融考试问答(Financial Exam Question Answering)、多语言金融问答(PolyFiQA)以及金融决策制定(Financial Decision Making),从而从多个维度综合评估模型在跨语言、跨模态场景中的泛化能力和实际应用潜力。

链接: https://arxiv.org/abs/2602.10886
作者: Zhuohan Xie,Rania Elbadry,Fan Zhang,Georgi Georgiev,Xueqing Peng,Lingfei Qian,Jimin Huang,Dimitar Dimitrov,Vanshikaa Jani,Yuyang Dai,Jiahui Geng,Yuxia Wang,Ivan Koychev,Veselin Stoyanov,Preslav Nakov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 7 pages

点击查看摘要

Abstract:We present the setup and the tasks of the FinMMEval Lab at CLEF 2026, which introduces the first multilingual and multimodal evaluation framework for financial Large Language Models (LLMs). While recent advances in financial natural language processing have enabled automated analysis of market reports, regulatory documents, and investor communications, existing benchmarks remain largely monolingual, text-only, and limited to narrow subtasks. FinMMEval 2026 addresses this gap by offering three interconnected tasks that span financial understanding, reasoning, and decision-making: Financial Exam Question Answering, Multilingual Financial Question Answering (PolyFiQA), and Financial Decision Making. Together, these tasks provide a comprehensive evaluation suite that measures models’ ability to reason, generalize, and act across diverse languages and modalities. The lab aims to promote the development of robust, transparent, and globally inclusive financial AI systems, with datasets and evaluation resources publicly released to support reproducible research.

[NLP-23] Diagnosing Structural Failures in LLM -Based Evidence Extraction for Meta-Analysis

【速读】: 该论文旨在解决生成式 AI(Generative AI)在自动化系统综述与元分析(systematic reviews and meta-analyses)中结构化证据提取能力不足的问题,其核心挑战在于如何准确捕捉研究文档中变量角色、统计方法与效应量之间的稳定绑定关系,而非仅识别孤立实体。解决方案的关键在于提出一种结构化的诊断框架,将证据提取过程建模为一系列具有递增关系复杂度和数值精度要求的模式约束查询(schema-constrained queries),从而精准定位模型失败点——包括角色颠倒、跨分析绑定漂移、密集结果段落中的实例压缩以及数值误分配等结构性缺陷,而非单纯依赖实体识别准确性。实验证明,当前主流大语言模型(LLMs)在处理多文档长上下文时无法维持必要的结构一致性,导致元分析关联元组几乎不可靠,进而使聚合统计结果失真。

链接: https://arxiv.org/abs/2602.10881
作者: Zhiyin Tan,Jennifer D’Souza
机构: L3S Research Center, Leibniz University Hannover, Hannover, Germany; TIB Leibniz Information Centre for Science and Technology, Hannover, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the 22nd Conference on Information and Research Science Connecting to Digital and Library Science (IRCDL 2026)

点击查看摘要

Abstract:Systematic reviews and meta-analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect-size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom-level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state-of-the-art LLMs under both per-document and long-context, multi-document input regimes. Across domains and models, performance remains moderate for single-property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta-analytic association tuples are extracted with near-zero reliability, and long-context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus-level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross-analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta-analysis. The code and data are publicly available at GitHub (this https URL).

[NLP-24] C-MOP: Integrating Momentum and Boundary-Aware Clustering for Enhanced Prompt Evolution

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动提示优化(Automatic Prompt Optimization)过程中因更新信号噪声大和语义冲突导致的性能不稳定问题。其解决方案的关键在于提出C-MOP框架,通过两个核心机制实现稳定优化:一是边界感知对比采样(Boundary-Aware Contrastive Sampling, BACS),利用批次级信息挖掘硬负样本、锚点和边界对,精确刻画正负样本的典型表征与决策边界;二是基于动量引导的语义聚类(Momentum-Guided Semantic Clustering, MGSC),引入具有时间衰减特性的文本动量机制,从迭代中波动的梯度中提炼出持久的语义共识,从而缓解语义冲突。实验表明,该方法显著优于当前最优基线,如PromptWizard和ProTeGi,在多个任务上平均提升1.58%和3.35%,并使一个3B参数的通用LLM超越70B参数的专业领域密集模型。

链接: https://arxiv.org/abs/2602.10874
作者: Binwei Yan,Yifei Fu,Mingjian Zhu,Hanting Chen,Mingxuan Yuan,Yunhe Wang,Hailin Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: The code is available at this https URL

点击查看摘要

Abstract:Automatic prompt optimization is a promising direction to boost the performance of Large Language Models (LLMs). However, existing methods often suffer from noisy and conflicting update signals. In this research, we propose C-MOP (Cluster-based Momentum Optimized Prompting), a framework that stabilizes optimization via Boundary-Aware Contrastive Sampling (BACS) and Momentum-Guided Semantic Clustering (MGSC). Specifically, BACS utilizes batch-level information to mine tripartite features–Hard Negatives, Anchors, and Boundary Pairs–to precisely characterize the typical representation and decision boundaries of positive and negative prompt samples. To resolve semantic conflicts, MGSC introduces a textual momentum mechanism with temporal decay that distills persistent consensus from fluctuating gradients across iterations. Extensive experiments demonstrate that C-MOP consistently outperforms SOTA baselines like PromptWizard and ProTeGi, yielding average gains of 1.58% and 3.35%. Notably, C-MOP enables a general LLM with 3B activated parameters to surpass a 70B domain-specific dense LLM, highlighting its effectiveness in driving precise prompt evolution. The code is available at this https URL.

[NLP-25] I can tell whether you are a Native Hawlêri Speaker! How ANN CNN and RNN perform in NLI-Native Language Identification

【速读】: 该论文旨在解决低资源语言中子方言(subdialect)的母语识别(Native Language Identification, NLI)问题,具体聚焦于伊拉克库尔德斯坦地区首府埃尔比勒(Hewlêr)使用的Sorani库尔德语子方言——Hewlêri的语音NLI任务。研究的关键在于构建并评估三种基于神经网络的模型(人工神经网络ANN、卷积神经网络CNN和循环神经网络RNN),通过66组实验系统比较不同音频片段长度(1–60秒)、数据不平衡处理策略(欠采样与过采样)及交叉验证方法下的性能表现,最终发现RNN模型在5秒音频分段下达到95.92%的最高准确率,同时首次建立了针对Hewlêri子方言的语音数据集,为相关领域的研究提供了重要基础资源。

链接: https://arxiv.org/abs/2602.10832
作者: Hardi Garari,Hossein Hassani
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 12 figures, 7 tables

点击查看摘要

Abstract:Native Language Identification (NLI) is a task in Natural Language Processing (NLP) that typically determines the native language of an author through their writing or a speaker through their speaking. It has various applications in different areas, such as forensic linguistics and general linguistics studies. Although considerable research has been conducted on NLI regarding two different languages, such as English and German, the literature indicates a significant gap regarding NLI for dialects and subdialects. The gap becomes wider in less-resourced languages such as Kurdish. This research focuses on NLI within the context of a subdialect of Sorani (Central) Kurdish. It aims to investigate the NLI for Hewlêri, a subdialect spoken in Hewlêr (Erbil), the Capital of the Kurdistan Region of Iraq. We collected about 24 hours of speech by recording interviews with 40 native or non-native Hewlêri speakers, 17 female and 23 male. We created three Neural Network-based models: Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN), which were evaluated through 66 experiments, covering various time-frames from 1 to 60 seconds, undersampling, oversampling, and cross-validation. The RNN model showed the highest accuracy of 95.92% for 5-second audio segmentation, using an 80:10:10 data splitting scheme. The created dataset is the first speech dataset for NLI on the Hewlêri subdialect in the Sorani Kurdish dialect, which can be of benefit to various research areas.

[NLP-26] Beyond Confidence: The Rhythms of Reasoning in Generative Models ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)对输入上下文微小变化敏感的问题,这种敏感性削弱了模型预测的可靠性。传统评估指标如准确率和困惑度(perplexity)无法有效衡量局部预测鲁棒性,因为归一化的输出概率会掩盖模型内部状态对扰动的响应能力。论文提出了一种新的度量指标——Token Constraint Bound(δ_TC B),其关键在于量化LLM在主导下一个词预测发生显著变化前可承受的最大内部状态扰动,该指标与输出嵌入空间的几何结构密切相关,从而揭示模型内部预测承诺的稳定性。实验表明,δ_TC B能有效关联于提示工程的效果,并识别出困惑度在上下文学习和文本生成中遗漏的关键预测不稳定性,为分析和提升LLM的上下文稳定性提供了理论严谨且互补的新途径。

链接: https://arxiv.org/abs/2602.10816
作者: Deyuan Liu,Zecheng Wang,Zhanyue Qin,Zhiying Tu,Dianhui Chu,Dianbo Sui
机构: Harbin Institute of Technology (哈尔滨工业大学); Wechat AI (微信AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit impressive capabilities yet suffer from sensitivity to slight input context variations, hampering reliability. Conventional metrics like accuracy and perplexity fail to assess local prediction robustness, as normalized output probabilities can obscure the underlying resilience of an LLM’s internal state to perturbations. We introduce the Token Constraint Bound ( \delta_\mathrmTCB ), a novel metric that quantifies the maximum internal state perturbation an LLM can withstand before its dominant next-token prediction significantly changes. Intrinsically linked to output embedding space geometry, \delta_\mathrmTCB provides insights into the stability of the model’s internal predictive commitment. Our experiments show \delta_\mathrmTCB correlates with effective prompt engineering and uncovers critical prediction instabilities missed by perplexity during in-context learning and text generation. \delta_\mathrmTCB offers a principled, complementary approach to analyze and potentially improve the contextual stability of LLM predictions.

[NLP-27] Deep Learning-based Method for Expressing Knowledge Boundary of Black-Box LLM

【速读】: 该论文旨在解决黑盒大语言模型(Large Language Models, LLMs)在生成内容时出现的幻觉(hallucination)问题,其核心在于模型缺乏对其内部知识边界的意识,无法像人类一样在面对超出其知识范围的问题时明确表达自身的不确定性。针对现有研究主要聚焦于白盒模型、难以直接应用于仅提供API访问的黑盒LLM这一局限,论文提出LSCL(LLM-Supervised Confidence Learning)方法,其关键创新在于基于知识蒸馏框架设计了一个深度学习模型,以黑盒LLM输入的问题、输出的答案及token概率作为输入,构建从外部输入到模型内部知识状态的映射关系,从而实现对黑盒LLM知识边界的量化与表达。实验表明,LSCL在多个公共数据集和主流黑盒LLM上显著优于基线模型,在准确率和召回率等指标上表现优异。此外,论文还提出一种无需token概率的自适应替代方案,性能接近LSCL且优于传统方法。

链接: https://arxiv.org/abs/2602.10801
作者: Haotian Sheng,Heyong Wang,Ming Hong,Hongman He,Junqiu Liu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success, however, the emergence of content generation distortion (hallucination) limits their practical applications. The core cause of hallucination lies in LLMs’ lack of awareness regarding their stored internal knowledge, preventing them from expressing their knowledge state on questions beyond their internal knowledge boundaries, as humans do. However, existing research on knowledge boundary expression primarily focuses on white-box LLMs, leaving methods suitable for black-box LLMs which offer only API access without revealing internal parameters-largely unexplored. Against this backdrop, this paper proposes LSCL (LLM-Supervised Confidence Learning), a deep learning-based method for expressing the knowledge boundaries of black-box LLMs. Based on the knowledge distillation framework, this method designs a deep learning model. Taking the input question, output answer, and token probability from a black-box LLM as inputs, it constructs a mapping between the inputs and the model’ internal knowledge state, enabling the quantification and expression of the black-box LLM’ knowledge boundaries. Experiments conducted on diverse public datasets and with multiple prominent black-box LLMs demonstrate that LSCL effectively assists black-box LLMs in accurately expressing their knowledge boundaries. It significantly outperforms existing baseline models on metrics such as accuracy and recall rate. Furthermore, considering scenarios where some black-box LLMs do not support access to token probability, an adaptive alternative method is proposed. The performance of this alternative approach is close to that of LSCL and surpasses baseline models.

[NLP-28] Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在特定领域(如医学影像或几何问题求解)适应过程中面临的“灾难性遗忘”问题,即在提升领域性能的同时难以保持其通用多模态能力。解决方案的关键在于提出一种名为“强化课程预对齐”(Reinforced Curriculum Pre-Alignment, RCPA)的新型后训练范式,其核心是引入一种课程感知的渐进调制机制:在早期阶段通过部分输出约束安全地引导模型接触新领域概念,随着模型对领域的熟悉度提升,逐步过渡到完整的生成优化,从而在保留通用多模态能力的前提下实现高效、稳定的领域适应。

链接: https://arxiv.org/abs/2602.10740
作者: Yuming Yan,Shuo Yang,Kai Tang,Sihong Chen,Yang Zhang,Ke Xu,Dan Hu,Qun Yu,Pengfei Hu,Edith C.H. Ngai
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model’s domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.

[NLP-29] Calliope: A TTS-based Narrated E-book Creator Ensuring Exact Synchronization Privacy and Layout Fidelity

【速读】: 该论文旨在解决将标准文本电子书(e-book)转换为高质量同步音频与文字的有声电子书(narrated e-book)这一任务缺乏开源解决方案的问题。其关键创新在于提出了一种名为Calliope的开源框架,通过直接在神经网络文本转语音(Text-to-Speech, TTS)过程中捕获音频时间戳,实现语音与文本高精度同步;同时严格保留原始出版物的排版、样式及嵌入媒体,并支持离线运行,从而避免云端服务带来的费用、隐私和版权合规风险。实验表明,相较于依赖强制对齐(forced alignment)的替代方法,该方案能有效消除音文不同步导致的漂移问题,显著提升阅读体验。

链接: https://arxiv.org/abs/2602.10735
作者: Hugo L. Hammer,Vajira Thambawita,Pål Halvorsen
机构: Oslo Metropolitan University (奥斯陆城市大学); SimulaMet (SimulaMet)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A narrated e-book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural-sounding neural Text-to-Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e-books into high-quality narrated e-books. However, no open-source solutions currently exist to perform this task. In this paper, we present Calliope, an open-source framework designed to fill this gap. Our method leverages state-of-the-art open-source TTS to convert a text e-book into a narrated e-book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher’s original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud-based services. The framework currently supports the state-of-the-art open-source TTS systems XTTS-v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at this https URL.

[NLP-30] Macaron: Controlled Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

【速读】: 该论文旨在解决多语言基准测试中缺乏对文化根基前提推理能力的评估问题:现有翻译数据集往往保留英语中心场景,而以文化为先的数据集又难以控制所需的推理类型。其解决方案的关键在于提出Macaron——一个基于模板优先(template-first)的基准,通过将推理类型与文化要素解耦,在多种语言中系统性地构建问题。该方法使用100个语言无关的模板覆盖7种推理类型和22种文化维度,由母语标注者生成符合情境的英文及本地语言多项选择题,并衍生出真/假判断题,从而实现对跨文化推理能力的可控测量。

链接: https://arxiv.org/abs/2602.10732
作者: Alaa Elsetohy,Sama Hadhoud,Haryo Akbarianto Wibowo,Chenxi Whitehouse,Genta Indra Winata,Fajri Koto,Alham Fikri Aji
机构: MBZUAI; Meta; Capital One
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here this https URL.

[NLP-31] SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

【速读】: 该论文旨在解决将FP8(8-bit浮点数)注意力机制集成到DeepSeek多头潜在注意力(Multi-head Latent Attention, MLA)架构解码阶段时所面临的三大核心挑战:一是由于位置编码(RoPE, Rotary Positional Embedding)与键值(KV)缓存解耦导致的数值异质性;二是FP8 PV GEMM(矩阵乘法)中量化尺度不匹配问题,源于MLA KV缓存共享结构;三是缺乏系统级优化支持以提升长上下文效率。解决方案的关键在于硬件感知的算法-内核协同优化:(i) RoPE感知的逐token KV量化策略,保留RoPE部分高精度并采用逐token粒度以适配自回归解码过程;(ii) 重构量化PV计算流水线,消除因共享KV结构引发的量化尺度错位;(iii) 端到端数据流优化,通过专用内核实现高效读写调度,从而显著提升吞吐量(最高达1.91倍),且在数学推理和代码生成等复杂长上下文任务中保持性能稳定。

链接: https://arxiv.org/abs/2602.10718
作者: Yifan Zhang,Zunhai Su,Shuhao Hu,Rui Yang,Wei Wu,Yulei Qian,Yuchen Xie,Xunliang Cai
机构: Meituan(美团); Tsinghua University(清华大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While FP8 attention has shown substantial promise in innovations like FlashAttention-3, its integration into the decoding phase of the DeepSeek Multi-head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system-level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long-context efficiency through the following hardware-aware algorithm-kernel co-optimization techniques: (i) RoPE-Aware Per-Token KV Quantization, where the RoPE part is maintained in high precision, motivated by our comprehensive analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache. Furthermore, per-token granularity is employed to align with the autoregressive decoding process and maintain quantization accuracy. (ii) Quantized PV Computation Pipeline Reconstruction, which resolves the misalignment of quantization scale in FP8 PV computation stemming from the shared KV structure of the MLA KV cache. (iii) End-to-End Dataflow Optimization, where we establish an efficient data read-and-write workflow using specialized kernels, ensuring efficient data flow and performance gains. Extensive experiments on state-of-the-art MLA LLMs show that SnapMLA achieves up to a 1.91x improvement in throughput, with negligible risk of performance degradation in challenging long-context tasks, including mathematical reasoning and code generation benchmarks. Code is available at this https URL.

[NLP-32] Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期对话中对隐式约束(implicit constraints)的保持与应用能力不足的问题,这类约束如用户状态、目标或价值观通常不被显式提问,但在真实交互中对生成恰当响应至关重要。现有评估基准主要关注表面事实回忆,难以衡量模型在语义线索与触发条件不一致(cue–trigger semantic disconnect)情境下的认知记忆能力。解决方案的关键在于提出LoCoMo-Plus基准和基于约束一致性(constraint consistency)的统一评估框架,该框架能更准确地识别模型是否在长程对话中正确保留并应用潜在语义约束,从而揭示传统字符串匹配指标和显式任务提示无法捕捉的认知失败。

链接: https://arxiv.org/abs/2602.10715
作者: Yifei Li,Weidong Guo,Lingling Zhang,Rongman Xu,Muye Huang,Hui Liu,Lijiao Xu,Yu Xu,Jun Liu
机构: Xi’an Jiaotong University (西安交通大学); Tencent (腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbfLoCoMo-Plus, a benchmark for assessing cognitive memory under cue–trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: this https URL.

[NLP-33] argeted Syntactic Evaluation of Language Models on Georgian Case Alignment EACL2026

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在处理格鲁吉亚语中罕见的分裂作通格(split-ergative case alignment)系统时的表现评估问题,特别是针对主语和宾语标记通过名词的主格(nominative)、与格(dative)和作格(ergative)形式变化所体现的句法任务。其解决方案的关键在于构建一个基于树库(treebank)的最小对(minimal pairs)数据集,利用 Grew 查询语言自动化生成370个语法测试样本,涵盖七个任务,每个任务包含50–70个样本,每条样本测试三种名词形式的正确分配;同时对比五种编码器和两种解码器-only 模型在词级和句级准确率上的表现,揭示了模型在作格标记识别上的显著劣势,归因于该格在句法角色中的高度特定性及训练数据稀缺。

链接: https://arxiv.org/abs/2602.10661
作者: Daniel Gallagher,Gerhard Heyer
机构: Institute for Applied Informatics (InfAI), Leipzig (莱比锡应用信息研究所)
类目: Computation and Language (cs.CL)
备注: To appear in Proceedings of The Second Workshop on Language Models for Low-Resource Languages (LoResLM), EACL 2026

点击查看摘要

Abstract:This paper evaluates the performance of transformer-based language models on split-ergative case alignment in Georgian, a particularly rare system for assigning grammatical cases to mark argument roles. We focus on subject and object marking determined through various permutations of nominative, ergative, and dative noun forms. A treebank-based approach for the generation of minimal pairs using the Grew query language is implemented. We create a dataset of 370 syntactic tests made up of seven tasks containing 50-70 samples each, where three noun forms are tested in any given sample. Five encoder- and two decoder-only models are evaluated with word- and/or sentence-level accuracy metrics. Regardless of the specific syntactic makeup, models performed worst in assigning the ergative case correctly and strongest in assigning the nominative case correctly. Performance correlated with the overall frequency distribution of the three forms (NOM DAT ERG). Though data scarcity is a known issue for low-resource languages, we show that the highly specific role of the ergative along with a lack of available training data likely contributes to poor performance on this case. The dataset is made publicly available and the methodology provides an interesting avenue for future syntactic evaluations of languages where benchmarks are limited.

[NLP-34] Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance

【速读】: 该论文试图解决的核心问题是:在语言模型预训练过程中,高质量预训练数据的定义尚不明确,尤其是基准测试(benchmark)性能是否主要由预训练语料与评估数据之间的统计模式重叠程度决定。其解决方案的关键在于通过量化词级一元语法交叉熵(word-level unigram cross-entropy)和词频统计来衡量这种重叠,并在10个零样本基准测试、4个不同规模的预训练数据集(8.5B至60B tokens)以及多个模型尺寸(400M至3B参数)下进行受控实验。结果表明,词级一元语法交叉熵与基准性能呈显著负相关,说明当前广泛使用的基准测试对预训练语料具有较强的分布内特性,即“弱分布外”(weakly out-of-distribution),因此简单地基于词重叠统计即可预测模型在这些基准上的表现。

链接: https://arxiv.org/abs/2602.10657
作者: Woojin Chung,Jeonghoon Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding what constitutes high-quality pre-training data remains a central question in language model training. In this work, we investigate whether benchmark performance is primarily driven by the degree of statistical pattern overlap between pre-training corpora and evaluation datasets. We measure this overlap using word-level unigram cross-entropy and word frequency statistics, and perform controlled experiments across 10 zero-shot benchmarks, 4 pre-training datasets spanning 8.5\mathrmB to 60\mathrmB tokens, and model sizes ranging from 400\mathrmM to 3\mathrmB parameters. Our results demonstrate a robust inverse relationship between word-level unigram cross-entropy and benchmark performance, suggesting that widely used benchmarks are strongly influenced by word overlap between training and evaluation data. Thus, larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results, indicating that word frequency statistics play an additional role in shaping benchmark scores. Taken together, these results suggest that many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, so that simple word-overlap statistics predict benchmark performance.

[NLP-35] UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)代理在自演化记忆(self-evolving memory)过程中存在的泛化能力差的问题,即现有方法通常将记忆提取视为静态过程,仅优化记忆管理,导致代理积累的是针对特定实例的噪声而非具有鲁棒性的通用记忆。解决方案的关键在于提出统一的记忆提取与管理框架(Unified Memory Extraction and Management, UMEM),通过联合优化LLM实现记忆的同步提取与更新,并引入语义邻域建模(Semantic Neighborhood Modeling)机制,结合基于GRPO(Generalized Reward Policy Optimization)的邻域级边际效用奖励策略,从语义相关查询簇的角度评估记忆效用,从而有效抑制过拟合,提升记忆的泛化性能。

链接: https://arxiv.org/abs/2602.10652
作者: Yongshi Ye,Hui Jiang,Feihu Jiang,Tian Lan,Yichao Du,Biao Fu,Xiaodong Shi,Qianghuai Jia,Longyue Wang,Weihua Luo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-evolving memory serves as the trainable parameters for Large Language Models (LLMs)-based agents, where extraction (distilling insights from experience) and management (updating the memory bank) must be tightly coordinated. Existing methods predominately optimize memory management while treating memory extraction as a static process, resulting in poor generalization, where agents accumulate instance-specific noise rather than robust memories. To address this, we propose Unified Memory Extraction and Management (UMEM), a self-evolving agent framework that jointly optimizes a Large Language Model to simultaneous extract and manage memories. To mitigate overfitting to specific instances, we introduce Semantic Neighborhood Modeling and optimize the model with a neighborhood-level marginal utility reward via GRPO. This approach ensures memory generalizability by evaluating memory utility across clusters of semantically related queries. Extensive experiments across five benchmarks demonstrate that UMEM significantly outperforms highly competitive baselines, achieving up to a 10.67% improvement in multi-turn interactive tasks. Futhermore, UMEM maintains a monotonic growth curve during continuous evolution. Codes and models will be publicly released.

[NLP-36] o Think or Not To Think That is The Question for Large Reasoning Models in Theory of Mind Tasks

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在社会认知能力——特别是理论心理(Theory of Mind, ToM)任务中的表现是否优于非推理模型的问题。研究表明,尽管LRMs在数学和代码等形式推理任务中表现出色,但其在ToM任务中并未展现出一致优势,甚至有时表现更差。解决方案的关键在于识别出三个核心问题:一是“慢思考崩溃”现象,即随着推理长度增加,准确率显著下降;二是“适度且自适应的推理”能提升性能,说明动态调整推理策略的重要性;三是“选项匹配捷径”,表明模型倾向于依赖选项匹配而非真正推理。基于此,作者提出了两种干预方法:Slow-to-Fast(S2F)自适应推理机制和Think-to-Match(T2M)捷径预防策略,以验证并缓解上述问题,最终指出实现鲁棒的ToM能力需发展超越现有推理方法的独特能力。

链接: https://arxiv.org/abs/2602.10625
作者: Nanxu Gong,Haotian Li,Sixun Dong,Jianxun Lian,Yanjie Fu,Xing Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions, which is essential for natural social interaction. Although recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding, it is still underexplored whether this benefit transfers to socio-cognitive skills. We present a systematic study of nine advanced Large Language Models (LLMs), comparing reasoning models with non-reasoning models on three representative ToM benchmarks. The results show that reasoning models do not consistently outperform non-reasoning models and sometimes perform worse. A fine-grained analysis reveals three insights. First, slow thinking collapses: accuracy significantly drops as responses grow longer, and larger reasoning budgets hurt performance. Second, moderate and adaptive reasoning benefits performance: constraining reasoning length mitigates failure, while distinct success patterns demonstrate the necessity of dynamic adaptation. Third, option matching shortcut: when multiple choice options are removed, reasoning models improve markedly, indicating reliance on option matching rather than genuine deduction. We also design two intervention approaches: Slow-to-Fast (S2F) adaptive reasoning and Think-to-Match (T2M) shortcut prevention to further verify and mitigate the problems. With all results, our study highlights the advancement of LRMs in formal reasoning (e.g., math, code) cannot be fully transferred to ToM, a typical task in social reasoning. We conclude that achieving robust ToM requires developing unique capabilities beyond existing reasoning methods.

[NLP-37] How Do Decoder-Only LLM s Perceive Users? Rethinking Attention Masking for User Representation Learning

【速读】: 该论文旨在解决Decoder-only大语言模型(Large Language Models, LLMs)在用户表征学习中,注意力掩码(attention masking)设计对用户嵌入质量影响不明确的问题。现有方法多采用因果掩码(causal mask),但其限制了模型利用未来行为信息的能力,而完全双向掩码虽理论上更优,却易导致训练不稳定。为此,作者提出Gradient-Guided Soft Masking(梯度引导软掩码),一种基于梯度的预热策略,在线性调度器之前逐步开放未来注意力,从而平滑从因果到双向掩码的过渡过程,提升训练稳定性与表征质量。该方案在9个工业级用户认知基准任务上均优于因果、混合及仅使用调度器的基线方法,且兼容解码器预训练流程,验证了掩码设计与训练过渡策略对有效用户表征学习的关键作用。

链接: https://arxiv.org/abs/2602.10622
作者: Jiahao Yuan,Yike Xu,Jinyong Wen,Baokun Wang,Yang Chen,Xiaotong Lin,Wuliang Huang,Ziyi Gao,Xing Fu,Yu Cheng,Weiqiang Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at this https URL.

[NLP-38] ISD-Agent -Bench: A Comprehensive Benchmark for Evaluating LLM -based Instructional Design Agents

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)代理在教学系统设计(Instructional Systems Design, ISD)自动化中缺乏标准化评估基准及LLM作为评判者可能引入偏差的问题。解决方案的关键在于构建了一个名为ISD-Agent-Bench的综合性评估基准,包含通过上下文矩阵框架生成的25,795个场景,该框架整合了51个跨5类的上下文变量与33个源自ADDIE模型的ISD子步骤;同时采用多评判者协议,利用来自不同提供商的多样化LLM进行评估以提升可靠性,并验证了将经典ISD理论(如ADDIE、Dick & Carey和快速原型法)与现代ReAct式推理相结合的方法在性能上优于纯理论驱动或纯技术导向的代理,从而为系统性LLM赋能的ISD研究提供了坚实基础。

链接: https://arxiv.org/abs/2602.10620
作者: YoungHoon Jeon,Suwan Kim,Haein Son,Sookbun Lee,Yeil Jeong,Unggi Lee
机构: Upstage; Opentutorials; Indiana University Bloomington; Korea University Sejong Campus
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents have shown promising potential in automating Instructional Systems Design (ISD), a systematic approach to developing educational programs. However, evaluating these agents remains challenging due to the lack of standardized benchmarks and the risk of LLM-as-judge bias. We present ISD-Agent-Bench, a comprehensive benchmark comprising 25,795 scenarios generated via a Context Matrix framework that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model. To ensure evaluation reliability, we employ a multi-judge protocol using diverse LLMs from different providers, achieving high inter-judge reliability. We compare existing ISD agents with novel agents grounded in classical ISD theories such as ADDIE, Dick \ Carey, and Rapid Prototyping ISD. Experiments on 1,017 test scenarios demonstrate that integrating classical ISD frameworks with modern ReAct-style reasoning achieves the highest performance, outperforming both pure theory-based agents and technique-only approaches. Further analysis reveals that theoretical quality strongly correlates with benchmark performance, with theory-based agents showing significant advantages in problem-centered design and objective-assessment alignment. Our work provides a foundation for systematic LLM-based ISD research.

[NLP-39] Online Causal Kalman Filtering for Stable and Effective Policy Optimization

【速读】: 该论文旨在解决大语言模型在强化学习中因高方差的token-level重要性采样(Importance Sampling, IS)比率导致策略优化不稳定的问题。现有方法通常采用固定序列级IS比率或独立调整每个token的IS比率,忽略了序列内token间的时间依赖性off-policy偏差,从而可能引起相邻token间策略梯度更新的扭曲,甚至导致训练崩溃。其解决方案的关键在于提出在线因果卡尔曼滤波(Online Causal Kalman Filtering)用于策略优化(KPO),将理想的IS比率建模为随token演化而变化的隐状态,并基于历史token状态在线、自回归地更新该状态,而不依赖未来token信息;由此得到的滤波后IS比率在保留token级局部结构感知变化的同时显著抑制噪声尖峰,从而实现更稳定且有效的策略更新。

链接: https://arxiv.org/abs/2602.10609
作者: Shuo He,Lang Feng,Xin Cheng,Lei Feng,Bo An
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token’s IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.

[NLP-40] Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)代理(agent)在实现前沿智能水平的同时,如何兼顾计算效率与稳定自我提升能力的问题。其核心挑战在于:如何在保持高推理精度和多轮交互可靠性的同时,显著降低延迟与成本,并支持大规模离策略(off-policy)训练下的持续优化。解决方案的关键在于提出 Step 3.5 Flash——一个稀疏的专家混合模型(sparse Mixture-of-Experts, MoE),通过将196B参数基础模型与仅激活11B参数的架构设计实现高效推理;结合交错的3:1滑动窗口/全注意力机制与多标记预测(Multi-Token Prediction, MTP-3)以减少多轮代理交互的延迟与开销;并构建一种可扩展的强化学习框架,融合可验证信号与偏好反馈,在大规模离策略训练中保持稳定性,从而在数学、代码和工具使用等任务上实现一致的自增强性能。

链接: https://arxiv.org/abs/2602.10604
作者: Ailin Huang,Ang Li,Aobo Kong,Bin Wang,Binxing Jiao,Bo Dong,Bojun Wang,Boyu Chen,Brian Li,Buyun Ma,Chang Su,Changxin Miao,Changyi Wan,Chao Lou,Chen Hu,Chen Xu,Chenfeng Yu,Chengting Feng,Chengyuan Yao,Chunrui Han,Dan Ma,Dapeng Shi,Daxin Jiang,Dehua Ma,Deshan Sun,Di Qi,Enle Liu,Fajie Zhang,Fanqi Wan,Guanzhe Huang,Gulin Yan,Guoliang Cao,Guopeng Li,Han Cheng,Hangyu Guo,Hanshan Zhang,Hao Nie,Haonan Jia,Haoran Lv,Hebin Zhou,Hekun Lv,Heng Wang,Heung-Yeung Shum,Hongbo Huang,Hongbo Peng,Hongyu Zhou,Hongyuan Wang,Houyong Chen,Huangxi Zhu,Huimin Wu,Huiyong Guo,Jia Wang,Jian Zhou,Jianjian Sun,Jiaoren Wu,Jiaran Zhang,Jiashu Lv,Jiashuo Liu,Jiayi Fu,Jiayu Liu,Jie Cheng,Jie Luo,Jie Yang,Jie Zhou,Jieyi Hou,Jing Bai,Jingcheng Hu,Jingjing Xie,Jingwei Wu,Jingyang Zhang,Jishi Zhou,Junfeng Liu,Junzhe Lin,Ka Man Lo,Kai Liang,Kaibo Liu,Kaijun Tan,Kaiwen Yan,Kaixiang Li,Kang An,Kangheng Lin,Lei Yang,Liang Lv,Liang Zhao,Liangyu Chen,Lieyu Shi,Liguo Tan,Lin Lin,Lina Chen,Luck Ma,Mengqiang Ren,Michael Li,Ming Li,Mingliang Li,Mingming Zhang,Mingrui Chen,Mitt Huang,Na Wang,Peng Liu,Qi Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical report for Step 3.5 Flash

点击查看摘要

Abstract:We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.

[NLP-41] When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文时因上下文长度增加而导致性能下降的问题,特别是针对现有方法MemAgent在长文本推理中存在记忆爆炸和冗余计算的缺陷。其解决方案的关键在于提出GRU-Mem框架,通过引入两个受文本控制的门机制——更新门(update gate)和退出门(exit gate),实现更稳定且高效的长上下文推理:仅当更新门打开时才更新记忆,且一旦退出门开启即刻终止循环,从而避免无意义的迭代计算。为训练这两个门机制,作者在端到端强化学习(end-to-end reinforcement learning)中设计了两个奖励信号 $ r^{\text{update}} $ 和 $ r^{\text{exit}} $,分别鼓励正确的记忆更新与适时退出行为。实验表明,GRU-Mem在多个长上下文推理任务中显著优于基线MemAgent,推理速度最高提升400%。

链接: https://arxiv.org/abs/2602.10560
作者: Leheng Sheng,Yongtao Zhang,Wenchang Ma,Yaorui Shi,Ting Huang,Xiang Wang,An Zhang,Ke Shen,Tat-Seng Chua
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages

点击查看摘要

Abstract:While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals r^\textupdate and r^\textexit within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400% times inference speed acceleration.

[NLP-42] LHAW: Controllable Underspecification for Long-Horizon Tasks

【速读】: 该论文旨在解决长时程工作流代理(long-horizon workflow agents)在复杂、模糊场景下因缺乏系统性评估框架而导致的可靠性不足问题,尤其体现在代理对任务不明确性(underspecification)的识别、推理与澄清能力难以量化。其解决方案的关键在于提出LHAW(Long-Horizon Augmented Workflows)框架,该框架通过可控地从目标(Goals)、约束(Constraints)、输入(Inputs)和上下文(Context)四个维度移除信息,生成可配置严重程度的不完整任务变体,并基于实际代理执行结果(终端状态差异)将变体分类为结果关键型(outcome-critical)、发散型(divergent)或良性(benign),从而实现对代理澄清行为的成本敏感评估,填补了当前缺乏任务无关、可扩展的模糊性度量体系的空白。

链接: https://arxiv.org/abs/2602.10525
作者: George Pu,Michael S. Lee,Udari Madhushani Sehwag,David J. Lee,Bryan Zhu,Yash Maurya,Mohit Raghavendra,Yuan Xue,Samuel Marc Denton
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.

[NLP-43] On the Robustness of Knowledge Editing for Detoxification

【速读】: 该论文旨在解决基于知识编辑(Knowledge-Editing-based, KE-based)的去毒方法在实际应用中存在可靠性不足的问题,尤其指出当前依赖自动毒性分类器的评估方式可能无法真实反映模型行为的改善。其解决方案的关键在于提出一个面向鲁棒性的评估框架,从优化鲁棒性(optimisation robustness)、组合鲁棒性(compositional robustness)和跨语言鲁棒性(cross-lingual robustness)三个维度系统检验KE-based去毒的有效性,并识别出“伪去毒”(pseudo-detoxification)这一常见失效模式——即表面毒性降低实则源于生成行为退化而非真正抑制有害内容。研究进一步揭示了多目标联合去毒效果下降及语言适配敏感性问题,表明KE-based去毒仅在特定模型、有限去毒目标数和部分语言下具有鲁棒性。

链接: https://arxiv.org/abs/2602.10504
作者: Ming Dong,Shiyi Tang,Ziyan Peng,Guanyi Chen,Tingting He
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.

[NLP-44] Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

【速读】: 该论文旨在解决当前Chain-of-Thought (CoT) 提示在多模态大语言模型(Multimodal Large Language Models, MLLMs)中因依赖线性文本序列而导致的复杂任务推理瓶颈问题。具体而言,现有方法将辅助视觉元素视为静态快照,无法有效支持动态状态更新,导致局部错误修正需冗长的下游重写或全量上下文重生成,显著增加token消耗和认知负担。解决方案的关键在于提出Canvas-of-Thought (Canvas-CoT),其核心是利用HTML Canvas作为外部推理基底,使模型能够执行基于DOM的原子CRUD操作,实现无需干扰上下文的原地状态修改,并通过渲染反馈机制构建硬约束验证回路,从而显式维护“真实状态”并提升高维领域(如几何与SVG设计)中的推理精度。

链接: https://arxiv.org/abs/2602.10494
作者: Lingzhuang Sun,Yuxia Zhu,Ruitong Liu,Hao Liang,Zheng Sun,Caijun Jia,Honghao He,Yuchen Wu,Siyuan Li,Jingxuan Wei,Xiangxiang Zhang,Bihui Yu,Wentao Zhang
机构: 中国科学院大学(University of Chinese Academy of Sciences); 北京大学(Peking University); 中国科学院自动化研究所(Institute of Automation, Chinese Academy of Sciences)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), relying solely on linear text sequences remains a bottleneck for complex tasks. We observe that even when auxiliary visual elements are interleaved, they are often treated as static snapshots within a one-dimensional, unstructured reasoning chain. We argue that such approaches treat reasoning history as an immutable stream: correcting a local error necessitates either generating verbose downstream corrections or regenerating the entire context. This forces the model to implicitly maintain and track state updates, significantly increasing token consumption and cognitive load. This limitation is particularly acute in high-dimensional domains, such as geometry and SVG design, where the textual expression of CoT lacks explicit visual guidance, further constraining the model’s reasoning precision. To bridge this gap, we introduce \textbfCanvas-of-Thought (Canvas-CoT). By leveraging a HTML Canvas as an external reasoning substrate, Canvas-CoT empowers the model to perform atomic, DOM-based CRUD operations. This architecture enables in-place state revisions without disrupting the surrounding context, allowing the model to explicitly maintain the “ground truth”. Furthermore, we integrate a rendering-based critique loop that serves as a hard constraint validator, providing explicit visual feedback to resolve complex tasks that are difficult to articulate through text alone. Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.

[NLP-45] Neuro-Symbolic Synergy for Interactive World Modeling

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在作为世界模型(World Models, WMs)使用时频繁产生幻觉(hallucination)的问题,尤其是在需要严格遵守确定性转换规则的边缘场景中;同时,传统符号式世界模型(Symbolic World Models)虽具备逻辑一致性,但缺乏语义表达能力。解决方案的关键在于提出神经符号协同框架(Neuro-Symbolic Synergy, NeSyS),通过交替训练机制将LLM的概率语义先验与可执行符号规则相结合,使两者相互约束:符号WM直接调整LLM输出的概率分布以保证逻辑合规性,而神经WM仅在符号规则无法覆盖的轨迹上进行微调,从而在不损失准确性的前提下减少50%的训练数据需求,并在ScienceWorld、Webshop和Plancraft三个交互环境中显著提升世界模型预测精度与数据效率。

链接: https://arxiv.org/abs/2602.10480
作者: Hongyu Zhao,Siyu Zhou,Haolin Yang,Zengyi Qin,Tianyi Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong general-purpose reasoning capabilities, yet they frequently hallucinate when used as world models (WMs), where strict compliance with deterministic transition rules–particularly in corner cases–is essential. In contrast, Symbolic WMs provide logical consistency but lack semantic expressivity. To bridge this gap, we propose Neuro-Symbolic Synergy (NeSyS), a framework that integrates the probabilistic semantic priors of LLMs with executable symbolic rules to achieve both expressivity and robustness. NeSyS alternates training between the two models using trajectories inadequately explained by the other. Unlike rule-based prompting, the symbolic WM directly constrains the LLM by modifying its output probability distribution. The neural WM is fine-tuned only on trajectories not covered by symbolic rules, reducing training data by 50% without loss of accuracy. Extensive experiments on three distinct interactive environments, i.e., ScienceWorld, Webshop, and Plancraft, demonstrate NeSyS’s consistent advantages over baselines in both WM prediction accuracy and data efficiency.

[NLP-46] stExplora: Benchmarking LLM s for Proactive Bug Discovery via Repository-Level Test Generation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在软件保障评估中普遍忽视“主动发现缺陷”这一目标的问题。现有方法要么将已有代码视为基准(合规陷阱),要么依赖故障后产生的数据(如问题报告)进行缺陷复现,导致难以在实际故障发生前识别潜在缺陷。为填补这一空白,作者提出TestExplora基准,用于评估LLMs作为主动测试者的性能——其核心在于隐藏所有缺陷相关信号,要求模型通过比对实现与文档推导出的意图来主动发现错误,并以文档作为验证标准。关键创新在于引入持续、时间感知的数据收集机制以维持评估可持续性并减少信息泄露;实证表明,导航复杂模块间交互及采用智能体式探索(agentic exploration)是提升LLMs实现自主软件质量保障能力的核心因素。

链接: https://arxiv.org/abs/2602.10471
作者: Steven Liu,Jane Luo,Xin Zhang,Aofan Liu,Hao Liu,Jie Wu,Ziyang Huang,Yangyu Huang,Yu Kang,Scarlett Li
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either treat existing code as ground truth (a compliance trap) for regression prevention, or depend on post-failure artifacts (e.g., issue reports) for bug reproduction-so they rarely surface defects before failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals. Models must proactively find bugs by comparing implementations against documentation-derived intent, using documentation as the oracle. Furthermore, to keep evaluation sustainable and reduce leakage, we propose continuous, time-aware data collection. Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass (F2P) rate of only 16.06%. Further analysis indicates that navigating complex cross-module interactions and leveraging agentic exploration are critical to advancing LLMs toward autonomous software quality assurance. Consistent with this, SWEAgent instantiated with GPT-5-mini achieves an F2P of 17.27% and an F2P@5 of 29.7%, highlighting the effectiveness and promise of agentic exploration in proactive bug discovery tasks.

[NLP-47] LATA: A Tool for LLM -Assisted Translation Annotation

【速读】: 该论文旨在解决构建高质量平行语料库时,针对结构差异显著的语言对(如阿拉伯语-英语)中自动化工具难以捕捉深层语言转换和语义细微差别的问题。其解决方案的关键在于设计了一种基于大语言模型(Large Language Models, LLMs)的交互式工具,通过模板化的Prompt Manager实现受约束的JSON输出下的句子分割与对齐,并将自动化预处理嵌入“人在回路”(human-in-the-loop)的工作流中,支持研究人员在stand-off架构下对对齐结果进行精细化修正及自定义翻译技术标注,从而在效率与语言学精度之间取得平衡。

链接: https://arxiv.org/abs/2602.10454
作者: Baorong Huang,Ali Asiri
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The construction of high-quality parallel corpora for translation research has increasingly evolved from simple sentence alignment to complex, multi-layered annotation tasks. This methodological shift presents significant challenges for structurally divergent language pairs, such as Arabic–English, where standard automated tools frequently fail to capture deep linguistic shifts or semantic nuances. This paper introduces a novel, LLM-assisted interactive tool designed to reduce the gap between scalable automation and the rigorous precision required for expert human judgment. Unlike traditional statistical aligners, our system employs a template-based Prompt Manager that leverages large language models (LLMs) for sentence segmentation and alignment under strict JSON output constraints. In this tool, automated preprocessing integrates into a human-in-the-loop workflow, allowing researchers to refine alignments and apply custom translation technique annotations through a stand-off architecture. By leveraging LLM-assisted processing, the tool balances annotation efficiency with the linguistic precision required to analyze complex translation phenomena in specialized domains.

[NLP-48] he Landscape of Prompt Injection Threats in LLM Agents : From Taxonomy to Analysis

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的自主智能体(autonomous agents)在面对提示注入(Prompt Injection, PI)攻击时的安全性问题,尤其关注现有防御方法和评估基准普遍忽视上下文依赖任务(context-dependent tasks)这一关键缺陷。其解决方案的核心在于提出一个新的基准测试工具——AgentPI,用于系统性地评估智能体在依赖运行时环境信息进行决策的情境下的行为表现;通过该基准发现,当前主流防御策略无法同时实现高可信度、高效用与低延迟,并揭示了多数现有防御在简化场景中有效但难以泛化至真实环境中,从而为未来安全LLM代理的研究与部署提供了结构化指导。

链接: https://arxiv.org/abs/2602.10453
作者: Peiran Wang,Xinfeng Li,Chong Xiang,Jinghuai Zhang,Ying Li,Lixia Zhang,Xiaofeng Wang,Yuan Tian
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The evolution of Large Language Models (LLMs) has resulted in a paradigm shift towards autonomous agents, necessitating robust security against Prompt Injection (PI) vulnerabilities where untrusted inputs hijack agent behaviors. This SoK presents a comprehensive overview of the PI landscape, covering attacks, defenses, and their evaluation practices. Through a systematic literature review and quantitative analysis, we establish taxonomies that categorize PI attacks by payload generation strategies (heuristic vs. optimization) and defenses by intervention stages (text, model, and execution levels). Our analysis reveals a key limitation shared by many existing defenses and benchmarks: they largely overlook context-dependent tasks, in which agents are authorized to rely on runtime environmental observations to determine actions. To address this gap, we introduce AgentPI, a new benchmark designed to systematically evaluate agent behavior under context-dependent interaction settings. Using AgentPI, we empirically evaluate representative defenses and show that no single approach can simultaneously achieve high trustworthiness, high utility, and low latency. Moreover, we show that many defenses appear effective under existing benchmarks by suppressing contextual inputs, yet fail to generalize to realistic agent settings where context-dependent reasoning is essential. This SoK distills key takeaways and open research problems, offering structured guidance for future research and practical deployment of secure LLM agents.

[NLP-49] Control Reinforcement Learning: Token-Level Mechanistic Analysis via Learned SAE Feature Steering

【速读】: 该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在语言模型机制解释中的局限性问题,即现有方法仅能识别哪些特征被激活,而无法确定哪些特征在被放大时真正改变模型输出。为此,作者提出控制强化学习(Control Reinforcement Learning, CRL),其核心在于训练一个策略网络(policy),在每个token处选择SAE特征进行干预,从而生成可解释的干预日志;该策略能够识别出在被放大时确实影响模型输出的特征。CRL通过自适应特征掩码(Adaptive Feature Masking)促进多样化特征发现,同时保持单特征的可解释性,实现了从静态特征分析到动态干预探测的跃迁,为机制解释提供了新的分析能力,如分支点追踪、批评者轨迹分析和层间特征比较等。

链接: https://arxiv.org/abs/2602.10437
作者: Seonglae Cho,Zekun Wu,Adriano Koshiyama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Reinforcement Learning (CRL), which trains a policy to select SAE features for steering at each token, producing interpretable intervention logs: the learned policy identifies features that change model outputs when amplified. Adaptive Feature Masking encourages diverse feature discovery while preserving singlefeature interpretability. The framework yields new analysis capabilities: branch point tracking locates tokens where feature choice determines output correctness; critic trajectory analysis separates policy limitations from value estimation errors; layer-wise comparison reveals syntactic features in early layers and semantic features in later layers. On Gemma-2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves improvements while providing per-token intervention logs. These results establish learned feature steering as a mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes

[NLP-50] AI-rithmetic

【速读】: 该论文试图解决当前前沿生成式 AI(Generative AI)模型在基础算术运算——特别是整数加法任务中表现不佳的问题。尽管这些模型在高级数学推理和研究辅助方面取得显著进展,但它们在简单加法任务上的准确率随数字位数增加而急剧下降,且错误模式具有高度可解释性。解决方案的关键在于识别并量化两类主要错误类型:一是由于分词(tokenization)导致的运算数位对齐错误(operand misalignment),二是进位处理失败(failure to correctly carry)。研究发现,这两类错误分别解释了不同模型(Claude Opus 4.1、GPT-5 和 Gemini 2.5 Pro)高达 62.9% 至 92.4% 的错误,其中对齐错误与分词机制密切相关,而进位错误则表现为独立的随机性失败,为改进模型结构与训练策略提供了明确方向。

链接: https://arxiv.org/abs/2602.10416
作者: Alex Bie,Travis Dick,Alex Kulesza,Prabhakar Raghavan,Vinod Raman,Sergei Vassilvitskii
机构: Google(谷歌)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern AI systems have been successfully deployed to win medals at international math competitions, assist with research workflows, and prove novel technical lemmas. However, despite their progress at advanced levels of mathematics, they remain stubbornly bad at basic arithmetic, consistently failing on the simple task of adding two numbers. We present a systematic investigation of this phenomenon. We demonstrate empirically that all frontier models suffer significantly degraded accuracy for integer addition as the number of digits increases. Furthermore, we show that most errors made by these models are highly interpretable and can be attributed to either operand misalignment or a failure to correctly carry; these two error classes explain 87.9%, 62.9%, and 92.4% of Claude Opus 4.1, GPT-5, and Gemini 2.5 Pro errors, respectively. Finally, we show that misalignment errors are frequently related to tokenization, and that carrying errors appear largely as independent random failures.

[NLP-51] EVOKE: Emotion Vocabulary Of Korean and English

【速读】: 该论文旨在解决跨语言情绪词汇资源匮乏的问题,尤其针对英语与韩语之间情绪词的系统性映射与语义差异缺乏全面标注的现状。其解决方案的关键在于构建了一个名为EVOKE的平行情绪词汇数据集,涵盖1,427个韩语词和1,399个英语词,并对819个韩语和924个英语形容词及动词进行系统标注,包括多义性、词义关系以及语言特有情绪词识别,同时提供多对多翻译信息。该数据集具有高度的理论中立性(theory-agnostic),可支持情绪科学、心理语言学、计算语言学及自然语言处理等多个领域的研究需求。

链接: https://arxiv.org/abs/2602.10414
作者: Yoonwon Jung,Hagyeong Shin,Benjamin K. Bergen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces EVOKE, a parallel dataset of emotion vocabulary in English and Korean. The dataset offers comprehensive coverage of emotion words in each language, in addition to many-to-many translations between words in the two languages and identification of language-specific emotion words. The dataset contains 1,427 Korean words and 1,399 English words, and we systematically annotate 819 Korean and 924 English adjectives and verbs. We also annotate multiple meanings of each word and their relationships, identifying polysemous emotion words and emotion-related metaphors. The dataset is, to our knowledge, the most comprehensive, systematic, and theory-agnostic dataset of emotion words in both Korean and English to date. It can serve as a practical tool for emotion science, psycholinguistics, computational linguistics, and natural language processing, allowing researchers to adopt different views on the resource reflecting their needs and theoretical perspectives. The dataset is publicly available at this https URL.

[NLP-52] Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference

【速读】: 该论文旨在解决预归一化(pre-norm)Transformer中内部归一化层(如RMSNorm或LayerNorm)依赖于每个样本的统计信息所带来的计算开销与推理效率瓶颈问题。其核心挑战在于:尽管归一化被广泛认为对训练稳定性至关重要,但是否所有归一化操作都必须基于逐token的统计量尚不明确。解决方案的关键是提出TaperNorm——一种可直接替换现有归一化模块的新型机制,它在训练初期保持标准归一化行为,随后通过余弦衰减逐渐过渡到一个全局学习的、与样本无关的线性/仿射映射。该设计使得归一化层可在推理阶段完全折叠进相邻的线性投影中,从而提升效率;同时理论分析表明,输出归一化的真正作用在于提供“尺度锚定”(scale anchoring),即通过近零次齐次性消除输出层的径向梯度,防止交叉熵损失引发logit无界增长(logit chasing)。这一发现揭示了归一化并非必需的逐token统计过程,而是特定结构下对模型尺度稳定性的关键约束。

链接: https://arxiv.org/abs/2602.10408
作者: Andrei Kanavalau,Carmen Amo Alonso,Sanjay Lall
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Normalization is widely viewed as essential for stabilizing Transformer training. We revisit this assumption for pre-norm Transformers and ask to what extent sample-dependent normalization is needed inside Transformer blocks. We introduce TaperNorm, a drop-in replacement for RMSNorm/LayerNorm that behaves exactly like the standard normalizer early in training and then smoothly tapers to a learned sample-independent linear/affine map. A single global gate is held at g=1 during gate warmup, used to calibrate the scaling branch via EMAs, and then cosine-decayed to g=0 , at which point per-token statistics vanish and the resulting fixed scalings can be folded into adjacent linear projections. Our theoretical and empirical results isolate scale anchoring as the key role played by output normalization: as a (near) 0 -homogeneous map it removes radial gradients at the output, whereas without such an anchor cross-entropy encourages unbounded logit growth (``logit chasing’'). We further show that a simple fixed-target auxiliary loss on the pre-logit residual-stream scale provides an explicit alternative anchor and can aid removal of the final normalization layer. Empirically, TaperNorm matches normalized baselines under identical setups while eliminating per-token statistics and enabling these layers to be folded into adjacent linear projections at inference. On an efficiency microbenchmark, folding internal scalings yields up to 1.22\times higher throughput in last-token logits mode. These results take a step towards norm-free Transformers while identifying the special role output normalization plays.

[NLP-53] Modular Multi-Task Learning for Chemical Reaction Prediction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在化学领域中,尤其是有机反应预测任务上,如何在有限且复杂的特定领域数据集上进行高效微调的问题。其核心挑战在于:在学习新反应知识的同时,需保留模型对通用化学知识的掌握,并避免灾难性遗忘(catastrophic forgetting),从而实现多任务性能的稳定保持。解决方案的关键在于采用低秩适应(Low-Rank Adaptation, LoRA)这一参数高效微调方法,相较于全量微调(full fine-tuning),LoRA在保持预测准确率相当的前提下,显著缓解了灾难性遗忘问题,并更好地维持了多任务泛化能力,尤其在C–H官能团化等复杂反应场景下,展现出更精细的反应特异性适应潜力。

链接: https://arxiv.org/abs/2602.10404
作者: Jiayun Pang,Ahmed M. Zaitoun,Xacobe Couso Cambeiro,Ivan Vulić
机构: University of Greenwich (格林威治大学); University of Cambridge (剑桥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:Adapting large language models (LLMs) trained on broad organic chemistry to smaller, domain-specific reaction datasets is a key challenge in chemical and pharmaceutical RD. Effective specialisation requires learning new reaction knowledge while preserving general chemical understanding across related tasks. Here, we evaluate Low-Rank Adaptation (LoRA) as a parameter-efficient alternative to full fine-tuning for organic reaction prediction on limited, complex datasets. Using USPTO reaction classes and challenging C-H functionalisation reactions, we benchmark forward reaction prediction, retrosynthesis and reagent prediction. LoRA achieves accuracy comparable to full fine-tuning while effectively mitigating catastrophic forgetting and better preserving multi-task performance. Both fine-tuning approaches generalise beyond training distributions, producing plausible alternative solvent predictions. Notably, C-H functionalisation fine-tuning reveals that LoRA and full fine-tuning encode subtly different reactivity patterns, suggesting more effective reaction-specific adaptation with LoRA. As LLMs continue to scale, our results highlight the practicality of modular, parameter-efficient fine-tuning strategies for their flexible deployment for chemistry applications.

[NLP-54] When are We Worried? Temporal Trends of Anxiety and What They Reveal about Us

【速读】: 该论文旨在解决如何通过分析社交媒体文本数据来揭示人类焦虑情绪的时间分布特征及其与语言使用模式之间的关联问题。其解决方案的关键在于利用新构建的词-焦虑关联词典(word-anxiety associations lexicon),对大量美国和加拿大用户的推文进行量化分析,识别出焦虑水平在一天中的波动规律(如早晨8点达峰、中午最低)、一周内变化趋势(周末最低、工作日最高),并进一步考察不同时态(过去、现在、未来)及代词类型(第一/第二人称 vs. 第三人称,主语 vs. 宾语)下焦虑表达的差异,从而揭示焦虑情绪与时间感知、自我关注和外向关注等认知焦点之间的系统性关系。

链接: https://arxiv.org/abs/2602.10400
作者: Saif M. Mohammad
机构: National Research Council Canada (加拿大国家研究委员会)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this short paper, we make use of a recently created lexicon of word-anxiety associations to analyze large amounts of US and Canadian social media data (tweets) to explore when we are anxious and what insights that reveals about us. We show that our levels of anxiety on social media exhibit systematic patterns of rise and fall during the day – highest at 8am (in-line with when we have high cortisol levels in the body) and lowest around noon. Anxiety is lowest on weekends and highest mid-week. We also examine anxiety in past, present, and future tense sentences to show that anxiety is highest in past tense and lowest in future tense. Finally, we examine the use of anxiety and calmness words in posts that contain pronouns to show: more anxiety in 3rd person pronouns (he, they) posts than 1st and 2nd person pronouns and higher anxiety in posts with subject pronouns (I, he, she, they) than object pronouns (me, him, her, them). Overall, these trends provide valuable insights on not just when we are anxious, but also how different types of focus (future, past, self, outward, etc.) are related to anxiety.

[NLP-55] Less is Enough: Synthesizing Diverse Data in Feature Space of LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)后训练数据多样性不足导致下游任务性能受限的问题。现有方法多依赖文本层面的指标衡量多样性,但这些指标对决定下游性能的任务相关特征捕捉能力较弱。其解决方案的关键在于提出一种可解释的特征空间度量方法——特征激活覆盖率(Feature Activation Coverage, FAC),用于量化数据在神经网络内部激活特征上的覆盖程度;并在此基础上构建FAC Synthesis框架,通过稀疏自编码器识别种子数据集中缺失的特征,并生成显式体现这些特征的合成样本,从而提升数据多样性与下游任务表现。

链接: https://arxiv.org/abs/2602.10388
作者: Zhongzhi Li,Xuansheng Wu,Yijiang Li,Lijie Hu,Ninghao Liu
机构: UGA; UCSD; MBZUAI; PolyU
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

[NLP-56] When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在非英语专业领域、特别是法语金融文档理解任务中的可靠性不足问题。金融文档通常包含密集的法规文本、数值表格和图表,且错误提取可能导致严重后果,而现有VLMs在此类多模态复杂场景下的表现尚不明确。解决方案的关键在于构建首个针对法语金融文档理解的多模态基准测试——Multimodal Finance Eval,其包含1,204个专家验证的问题,涵盖文本提取、表格理解、图表解读及多轮对话推理等任务,并基于大语言模型作为评判者(LLM-as-judge)协议评估六种不同规模(8B–124B参数)的开源VLMs。实验表明,尽管模型在文本和表格任务中表现良好(准确率85–90%),但在图表理解上显著落后(34–62%),且多轮对话中早期错误会持续传播,导致整体准确率降至约50%,凸显了当前VLMs在交互式、多步骤金融分析中的脆弱性。该基准为高风险场景下VLM性能的量化评估与改进提供了重要工具。

链接: https://arxiv.org/abs/2602.10384
作者: Virginie Mouilleron,Théo Lasnier,Djamé Seddah
机构: Inria Paris (法国国家信息与自动化研究院巴黎分部)
类目: Computation and Language (cs.CL)
备注: 14 pages, 17 figures

点击查看摘要

Abstract:Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting. Comments: 14 pages, 17 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.10384 [cs.CL] (or arXiv:2602.10384v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.10384 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-57] riggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中后门攻击的机制不明确问题,特别是针对在预训练阶段注入的语言切换型后门触发器(trigger)如何影响模型行为的理解缺失。其关键解决方案是通过激活修补(activation patching)技术,定位触发器形成于模型早期层(占模型深度的7.5%-25%),并识别出处理触发信息的注意力头(attention heads)。研究发现,这些被触发激活的注意力头与模型自然编码输出语言的功能头高度重叠(Jaccard指数介于0.18至0.66之间),表明后门触发器并非构建独立电路,而是劫持了模型已有的语言处理组件。这一发现为防御策略提供了新思路:检测方法可聚焦于监控已知功能模块而非寻找隐藏回路,而缓解措施则可能利用这种注入行为与自然行为之间的纠缠特性。

链接: https://arxiv.org/abs/2602.10382
作者: Théo Lasnier,Wissam Antoun,Francis Kulumba,Djamé Seddah
机构: Inria Paris (法国国家信息与自动化研究院巴黎分部)
类目: Computation and Language (cs.CL)
备注: 13 pages, 35 figures

点击查看摘要

Abstract:Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model’s existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.

[NLP-58] he Alignment Bottleneck in Decomposition-Based Claim Verification

【速读】: 该论文旨在解决复杂多维度事实核查中结构化子 claim 分解方法效果不稳定的问题。其核心发现表明,现有方法性能不一致的根本原因在于两个被忽视的瓶颈:证据对齐(evidence alignment)与子 claim 错误分布(sub-claim error profiles)。解决方案的关键在于:首先,必须采用细粒度且严格对齐的子 claim 证据(Sub-claim Aligned Evidence, SAE),而非重复使用原始主张级别的证据(Repeated Claim-level Evidence, SRE);其次,在子 claim 标签存在噪声时,应通过保守的“弃权”策略(abstention)来降低错误传播,而非追求高置信度但错误的预测。这为未来 claim 分解框架的设计指明了方向:需优先实现精确的证据合成机制,并校准子 claim 验证模型的标签偏差。

链接: https://arxiv.org/abs/2602.10380
作者: Mahmud Elahi Akhter,Federico Ruggeri,Iman Munire Bilal,Rob Procter,Maria Liakata
机构: Queen Mary University of London (伦敦玛丽女王大学); University of Bologna (博洛尼亚大学); University of Warwick (华威大学); The Alan Turing Institute (艾伦图灵研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Structured claim decomposition is often proposed as a solution for verifying complex, multi-faceted claims, yet empirical results have been inconsistent. We argue that these inconsistencies stem from two overlooked bottlenecks: evidence alignment and sub-claim error profiles. To better understand these factors, we introduce a new dataset of real-world complex claims, featuring temporally bounded evidence and human-annotated sub-claim evidence spans. We evaluate decomposition under two evidence alignment setups: Sub-claim Aligned Evidence (SAE) and Repeated Claim-level Evidence (SRE). Our results reveal that decomposition brings significant performance improvement only when evidence is granular and strictly aligned. By contrast, standard setups that rely on repeated claim-level evidence (SRE) fail to improve and often degrade performance as shown across different datasets and domains (PHEMEPlus, MMM-Fact, COVID-Fact). Furthermore, we demonstrate that in the presence of noisy sub-claim labels, the nature of the error ends up determining downstream robustness. We find that conservative “abstention” significantly reduces error propagation compared to aggressive but incorrect predictions. These findings suggest that future claim decomposition frameworks must prioritize precise evidence synthesis and calibrate the label bias of sub-claim verification models.

[NLP-59] Hardware Co-Design Scaling Laws via Roofline Modelling for On-Device LLM s

【速读】: 该论文旨在解决在资源受限的设备端部署大语言模型(Large Language Model, LLM)时,如何在模型准确率与推理延迟、硬件效率之间实现最优平衡的问题。其核心挑战在于,不同硬件平台对LLM架构的适配需求差异显著,传统独立设计软硬件的方法难以满足实时性与性能的双重约束。解决方案的关键在于提出一种硬件协同设计规律(hardware co-design law),通过将训练损失建模为架构超参数的显式函数,并结合屋顶线模型(roofline modelling)刻画推理延迟,从而建立准确率与延迟之间的直接映射关系。在此基础上,作者通过对1942个候选架构进行大规模实验和训练,拟合出架构到训练损失的缩放律,并联合延迟模型识别出帕累托前沿(Pareto frontier),最终实现基于精度与性能联合优化的架构搜索,显著缩短设计周期(从数月降至数天),并在相同延迟下实现比Qwen2.5-0.5B更低的困惑度(19.42%)。

链接: https://arxiv.org/abs/2602.10377
作者: Luoyang Sun,Jiwen Jiang,Yifeng Ding,Fengfa Li,Yan Song,Haifeng Zhang,Jian Ying,Lei Ren,Kun Zhan,Wei Chen,Yan Xie,Cheng Deng
机构: AI Lab, The Yangtze River Delta; Institution of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Li Auto; University College London; The University of Edinburgh
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language-Action Models (VLAs) have emerged as a key paradigm of Physical AI and are increasingly deployed in autonomous vehicles, robots, and smart spaces. In these resource-constrained on-device settings, selecting an appropriate large language model (LLM) backbone is a critical challenge: models must balance accuracy with strict inference latency and hardware efficiency constraints. This makes hardware-software co-design a game-changing requirement for on-device LLM deployment, where each hardware platform demands a tailored architectural solution. We propose a hardware co-design law that jointly captures model accuracy and inference performance. Specifically, we model training loss as an explicit function of architectural hyperparameters and characterise inference latency via roofline modelling. We empirically evaluate 1,942 candidate architectures on NVIDIA Jetson Orin, training 170 selected models for 10B tokens each to fit a scaling law relating architecture to training loss. By coupling this scaling law with latency modelling, we establish a direct accuracy-latency correspondence and identify the Pareto frontier for hardware co-designed LLMs. We further formulate architecture search as a joint optimisation over precision and performance, deriving feasible design regions under industrial hardware and application budgets. Our approach reduces architecture selection from months to days. At the same latency as Qwen2.5-0.5B on the target hardware, our co-designed architecture achieves 19.42% lower perplexity on WikiText-2. To our knowledge, this is the first principled and operational framework for hardware co-design scaling laws in on-device LLM deployment. We will make the code and related checkpoints publicly available.

[NLP-60] Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

【速读】: 该论文旨在解决计算机使用代理(Computer-use Agents, CUAs)在现实数字环境中因分布偏移和未见场景频繁出现而导致的持续学习问题,尤其是如何在不依赖昂贵人工标注的情况下获取高质量、环境相关的代理数据。解决方案的关键在于提出一种自主课程强化学习框架(Autonomous Curriculum Reinforcement Learning, ACuRL),其核心机制包括:1)通过代理在目标环境中的初始探索获取基础经验;2)利用前一迭代的经验与反馈,由课程任务生成器合成适配代理当前能力的新任务;3)引入CUAJudge这一鲁棒的自动评估器提供可靠奖励信号(与人类判断一致性达93%)。该方法实现了跨环境和环境内持续学习,在无灾难性遗忘的前提下带来4–22%的性能提升,并通过稀疏参数更新(如仅更新20%参数)展现出高效且稳定的适应能力。

链接: https://arxiv.org/abs/2602.10356
作者: Tianci Xue,Zeyi Liao,Tianneng Shi,Zilu Wang,Kai Zhang,Dawn Song,Yu Su,Huan Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 6 figures

点击查看摘要

Abstract:Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded agent data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores target environments to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent’s current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without catastrophic forgetting on existing environments. Further analyses show highly sparse updates (e.g., 20% parameters), which helps explain the effective and robust adaptation. Our data and code are available at this https URL.

[NLP-61] Physically Interpretable AlphaEarth Foundation Model Embeddings Enable LLM -Based Land Surface Intelligence

【速读】: 该论文旨在解决卫星基础模型(Satellite Foundation Models)生成的密集嵌入(dense embeddings)在物理可解释性方面的不足问题,这一局限性阻碍了其在环境决策系统中的集成应用。解决方案的关键在于通过大规模实证分析(1210万样本,覆盖美国本土2017–2023年),结合线性、非线性和注意力机制方法,系统揭示了Google AlphaEarth的64维嵌入与26个环境变量(包括气候、植被、水文、温度和地形等)之间的映射关系,并验证了嵌入空间对多数环境变量具有高保真重建能力(12个变量R² > 0.90,温度与高程接近R² = 0.97)。基于此可解释性验证,研究进一步构建了一个土地表面智能系统(Land Surface Intelligence),利用FAISS索引数据库实现检索增强生成(Retrieval-Augmented Generation, RAG),将自然语言环境查询转化为基于卫星数据的精准评估,最终通过LLM-as-Judge评估框架(360次查询-响应循环)证明该系统具备强地面关联性和语义连贯性(平均得分μ = 3.74 ± 0.77,其中 grounding μ = 3.93,coherence μ = 4.25)。

链接: https://arxiv.org/abs/2602.10354
作者: Mashrekur Rahman
机构: Dartmouth College (达特茅斯学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Satellite foundation models produce dense embeddings whose physical interpretability remains poorly understood, limiting their integration into environmental decision systems. Using 12.1 million samples across the Continental United States (2017–2023), we first present a comprehensive interpretability analysis of Google AlphaEarth’s 64-dimensional embeddings against 26 environmental variables spanning climate, vegetation, hydrology, temperature, and terrain. Combining linear, nonlinear, and attention-based methods, we show that individual embedding dimensions map onto specific land surface properties, while the full embedding space reconstructs most environmental variables with high fidelity (12 of 26 variables exceed R^2 0.90 ; temperature and elevation approach R^2 = 0.97 ). The strongest dimension-variable relationships converge across all three analytical methods and remain robust under spatial block cross-validation (mean \Delta R^2 = 0.017 ) and temporally stable across all seven study years (mean inter-year correlation r = 0.963 ). Building on these validated interpretations, we then developed a Land Surface Intelligence system that implements retrieval-augmented generation over a FAISS-indexed embedding database of 12.1 million vectors, translating natural language environmental queries into satellite-grounded assessments. An LLM-as-Judge evaluation across 360 query–response cycles, using four LLMs in rotating generator, system, and judge roles, achieved weighted scores of \mu = 3.74 \pm 0.77 (scale 1–5), with grounding ( \mu = 3.93 ) and coherence ( \mu = 4.25 ) as the strongest criteria. Our results demonstrate that satellite foundation model embeddings are physically structured representations that can be operationalized for environmental and geospatial intelligence.

[NLP-62] Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

【速读】: 该论文旨在解决生成式 AI(Generative AI)中自解释方法(self-interpretation methods)因超参数敏感性而导致的不可靠问题。其核心解决方案是:在保持语言模型(Language Model, LM)完全冻结的前提下,仅对轻量级适配器(adapter)进行训练,使其学习从解释性人工制品(interpretability artifacts)中提取结构化信息。关键创新在于使用一个仅含 dmodel+1d_\text{model}+1 个参数的标量仿射适配器,即可实现跨任务和模型家族的稳定自解释能力——该适配器能生成比原始标签更优的稀疏自动编码器特征标签(71% vs 63% 生成评分),识别主题的召回率高达94%(对比未训练基线的1%),并可解码多跳推理中既不在提示也不在响应中的桥接实体,揭示隐式推理路径而无需链式思维(chain-of-thought)。研究表明,这种自解释能力随模型规模提升而增强,且无需修改被解释模型本身。

链接: https://arxiv.org/abs/2602.10352
作者: Keenan Pepper,Alex McKenzie,Florin Pop,Stijn Servaes,Martin Leitgab,Mike Vaiana,Judd Rosenblatt,Michael S. A. Graziano,Diogo de Lucena
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 17 tables, 17 figures. Code and data at this https URL

点击查看摘要

Abstract:Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just d_\textmodel+1 parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

[NLP-63] When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding

【速读】: 该论文试图解决的问题是:在多语言语音模型中,最终输出层的音素级预测是否最优,尤其是在低资源语言(如Campidanese Sardinian)中,标准评估指标(如音素错误率,PER)是否能准确反映模型的语音表征能力。解决方案的关键在于采用逐层解码策略(layer-wise decoding),对预训练的Wav2Vec2模型进行中间层探查(probing),发现音素级预测质量在中间层(而非最终层)达到最佳,并通过细粒度对齐分析揭示了中间层预测更精确保留音段身份、减少过生成和特定类别的音系错误;此外,论文提出“回退错误”(regressive errors)的概念,指出最终层可能因过度抽象而覆盖中间层的正确预测,从而暴露表面误差指标的局限性。这一方法为低资源场景下的自动语音识别(ASR)模型诊断提供了新的视角。

链接: https://arxiv.org/abs/2602.10350
作者: Domenico De Cristofaro,Alessandro Vietti,Marianne Pouplier,Aleese Block
机构: Free University of Bozen-Bolzano, Italy; ALPS, Alpine Laboratory of Phonetic Sciences; LMU Munich, Germany
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies have shown that intermediate layers in multilingual speech models often encode more phonetically accurate representations than the final output layer. In this work, we apply a layer-wise decoding strategy to a pretrained Wav2Vec2 model to investigate how phoneme-level predictions evolve across encoder layers, focusing on Campidanese Sardinian, a low-resource language. We show that truncating upper transformer layers leads to improved Phoneme Error Rates (PER), with the best performance achieved not at the final layer, but two layers earlier. Through fine-grained alignment analysis, we find that intermediate predictions better preserve segmental identity, avoid overgeneration, and reduce certain classes of phonological errors. We also introduce the notion of regressive errors, cases where correct predictions at intermediate layers are overwritten by errors at the final layer. These regressions highlight the limitations of surface-level error metrics and reveal how deeper layers may generalize or abstract away from acoustic detail. Our findings support the use of early-layer probing as a diagnostic tool for ASR models, particularly in low-resource settings where standard evaluation metrics may fail to capture linguistically meaningful behavior.

[NLP-64] Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放式生成任务中如何平衡多样性与创造力和逻辑一致性的问题。现有基于截断的采样方法虽有效,但主要依赖概率质量与熵等启发式指标,忽略了词元空间中的语义几何结构。其解决方案的关键在于提出一种几何感知的截断规则——Top-W,该方法利用基于词元嵌入几何的Wasserstein距离来衡量裁剪后分布与原分布的接近程度,并显式地在保留的概率质量与所选集合的熵之间进行权衡。理论分析揭示了固定势能子集更新的闭式结构:根据质量-熵权衡,最优裁剪要么退化为单个词元,要么形成可由线性扫描高效求解的一维前缀。通过结合高效的几何势能函数(如最近邻或k-NN)与交替解码机制,Top-W在保持标准截断-采样接口不变的前提下,显著提升了多个基准测试(GSM8K、GPQA、AlpacaEval 和 MT-Bench)上的性能,最高提升达33.7%,同时增强了判别式开放生成评估下的创造性表现。

链接: https://arxiv.org/abs/2602.10346
作者: Arash Gholami Davoodi,Navid Rezazadeh,Seyed Pouyan Mousavi Davoudi,Pouya Pezeshkpour
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.

[NLP-65] he Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

【速读】: 该论文旨在解决交通执法过程中“尊重”(respect)这一核心维度的主观性问题,即不同群体对警察与公民互动中尊重程度的感知存在显著差异,而这种差异直接影响公众信任和执法合法性。为实现跨社区视角的一致性分析,研究者构建了首个大规模交通拦截数据集,包含来自警察关联、司法系统受影响群体及非关联居民三类人群的尊重评分与自由文本理由,并基于程序正义理论、洛杉矶警察局培训材料及实地调研开发了领域特定评估量表;其关键解决方案在于提出了一种视角感知建模框架(perspective-aware modeling framework),能够预测个性化尊重评分并生成针对不同 annotator 的解释性理由,从而提升评分预测性能与理由一致性,助力执法机构理解多元社区期望,增强程序正当性与公众信任。

链接: https://arxiv.org/abs/2602.10339
作者: Preni Golazizian,Elnaz Rahmati,Jackson Trager,Zhivar Sourati,Nona Ghazizadeh,Georgios Chochlakis,Jose Alcocer,Kerby Bennett,Aarya Vijay Devnani,Parsa Hejabi,Harry G. Muttram,Akshay Kiran Padte,Mehrshad Saadatinia,Chenhao Wu,Alireza S. Zaibari,Michael Sierra-Arévalo,Nick Weller,Shrikanth Narayanan,Benjamin A. T. Graham,Morteza Dehghani
机构: University of Southern California (南加州大学); University of California Riverside (加州大学河滨分校); University of California Los Angeles (加州大学洛杉矶分校); University of Southern California (南加州大学); Harvard University (哈佛大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traffic stops are among the most frequent police-civilian interactions, and body-worn cameras (BWCs) provide a unique record of how these encounters unfold. Respect is a central dimension of these interactions, shaping public trust and perceived legitimacy, yet its interpretation is inherently subjective and shaped by lived experience, rendering community-specific perspectives a critical consideration. Leveraging unprecedented access to Los Angeles Police Department BWC footage, we introduce the first large-scale traffic-stop dataset annotated with respect ratings and free-text rationales from multiple perspectives. By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities. To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-consistent alignment; and (iii) propose a perspective-aware modeling framework that predicts personalized respect ratings and generates annotator-specific rationales for both officers and civilian drivers from traffic-stop transcripts. Across all three annotator groups, our approach improves both rating prediction performance and rationale alignment. Our perspective-aware framework enables law enforcement to better understand diverse community expectations, providing a vital tool for building public trust and procedural legitimacy.

[NLP-66] Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality

【速读】: 该论文旨在解决的问题是:在大型语言模型(Large Language Models, LLMs)中,资源理性(resource rationality)是否能在不显式引入计算成本奖励的情况下,通过推理时扩展计算(inference-time scaling)自然涌现。解决方案的关键在于设计了一个变量归因任务(Variable Attribution Task),该任务系统性地调控任务复杂度(如候选变量数量和输入输出样本数),从而观察模型如何从暴力搜索策略转向分析型策略。实验表明,尽管指令微调模型(instruction-tuned models)在复杂逻辑函数(如XOR/XNOR)上性能下降,而强化学习训练的大型推理模型(Large Reasoning Models, LRMs)保持鲁棒性,二者均随复杂度增加自发调整推理行为——这证明资源理性是推理时缩放本身所催生的涌现特性,无需额外的成本感知奖励机制。

链接: https://arxiv.org/abs/2602.10329
作者: Zhimin Hu,Riya Roshan,Sashank Varma
机构: Georgia Tech (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human reasoning is shaped by resource rationality – optimizing performance under constraints. Recently, inference-time scaling has emerged as a powerful paradigm to improve the reasoning performance of Large Language Models by expanding test-time computation. Specifically, instruction-tuned (IT) models explicitly generate long reasoning steps during inference, whereas Large Reasoning Models (LRMs) are trained by reinforcement learning to discover reasoning paths that maximize accuracy. However, it remains unclear whether resource-rationality can emerge from such scaling without explicit reward related to computational costs. We introduce a Variable Attribution Task in which models infer which variables determine outcomes given candidate variables, input-output trials, and predefined logical functions. By varying the number of candidate variables and trials, we systematically manipulate task complexity. Both models exhibit a transition from brute-force to analytic strategies as complexity increases. IT models degrade on XOR and XNOR functions, whereas LRMs remain robust. These findings suggest that models can adjust their reasoning behavior in response to task complexity, even without explicit cost-based reward. It provides compelling evidence that resource rationality is an emergent property of inference-time scaling itself.

[NLP-67] On Emergent Social World Models – Evidence for Functional Integration of Theory of Mind and Prag matic Reasoning in Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LMs)是否具备共享的计算机制来处理通用心智理论(Theory of Mind, ToM)与特定语言语用推理的问题,从而探讨LMs是否可能涌现出“社会世界模型”(social world models),即跨任务复用的心理状态表征(功能整合假说)。其解决方案的关键在于采用行为评估与因果机制实验相结合的方法,通过受认知神经科学启发的功能定位技术,在比以往研究更大规模的局部化数据集上分析LM在七个子类ToM能力上的表现,结果提供了支持功能整合假说的初步证据,表明LMs可能发展出相互关联的“社会世界模型”,而非孤立的能力模块。

链接: https://arxiv.org/abs/2602.10298
作者: Polina Tsvilodub,Jan-Felix Klumpp,Amir Mohammadpour,Jennifer Hu,Michael Franke
机构: University of Tübingen (图宾根大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: 29 pages, 13 figures, under review

点击查看摘要

Abstract:This paper investigates whether LMs recruit shared computational mechanisms for general Theory of Mind (ToM) and language-specific pragmatic reasoning in order to contribute to the general question of whether LMs may be said to have emergent “social world models”, i.e., representations of mental states that are repurposed across tasks (the functional integration hypothesis). Using behavioral evaluations and causal-mechanistic experiments via functional localization methods inspired by cognitive neuroscience, we analyze LMs’ performance across seven subcategories of ToM abilities (Beaudoin et al., 2020) on a substantially larger localizer dataset than used in prior like-minded work. Results from stringent hypothesis-driven statistical testing offer suggestive evidence for the functional integration hypothesis, indicating that LMs may develop interconnected “social world models” rather than isolated competencies. This work contributes novel ToM localizer data, methodological refinements to functional localization techniques, and empirical insights into the emergence of social cognition in artificial systems.

[NLP-68] Learning to Evict from Key-Value Cache

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理阶段因自回归Key-Value (KV) 缓存内存需求过大而导致的效率瓶颈问题。现有缓存淘汰或压缩方法依赖启发式策略(如最近访问时间或历史注意力分数),这些策略仅作为未来有用性的间接代理,且引入额外计算开销。解决方案的关键在于将KV缓存淘汰问题重新建模为强化学习(Reinforcement Learning, RL)任务:通过训练轻量级的每头(per-head)RL代理,在预计算的生成轨迹上仅使用键(key)和值(value)向量来学习对token按其预测未来解码有用性进行排序。每个代理学习一个由未来效用引导的专用淘汰策略,从而在不修改底层LLM或增加额外推理成本的前提下,实现跨不同缓存预算的高质量排序评估。这一方法显著优于现有基线,并在长上下文和多轮对话场景中展现出强大泛化能力。

链接: https://arxiv.org/abs/2602.10238
作者: Luca Moschella,Laura Manduchi,Ozan Sener
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 15 figures

点击查看摘要

Abstract:The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token’s future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (e.g., LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.

[NLP-69] Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

【速读】: 该论文旨在解决Group Relative Policy Optimization (GRPO)在处理具有明确结构和目标的生成任务时,因将单一标量优势值分配给所有token而导致不同语义段落间奖励信号耦合的问题,从而引发目标干扰(objective interference)和信用分配错误(misattributed credit)。其解决方案的关键在于提出一种名为Blockwise Advantage Estimation的方法族,该方法为每个目标独立分配优势值,并仅将其应用于对应文本块中的token,从而减少对人工设计标量奖励的依赖并自然扩展至多目标场景;其中核心创新是引入Outcome-Conditioned Baseline,通过根据前缀导出的中间结果对样本进行分层统计,仅利用组内统计信息近似中间状态价值,避免了传统无偏估计所需的昂贵嵌套回放(nested rollouts),显著提升了效率与可扩展性。

链接: https://arxiv.org/abs/2602.10231
作者: Kirill Pavlenko,Alexander Golubev,Simon Karasik,Boris Yangel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion. For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit. We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block, reducing reliance on hand-designed scalar rewards and scaling naturally to additional objectives. A key challenge is estimating advantages for later blocks whose rewards are conditioned on sampled prefixes; standard unbiased approaches require expensive nested rollouts from intermediate states. Concretely, we introduce an Outcome-Conditioned Baseline that approximates intermediate state values using only within-group statistics by stratifying samples according to a prefix-derived intermediate outcome. On math tasks with uncertainty estimation, our method mitigates reward interference, is competitive with a state-of-the-art reward-designed approach, and preserves test-time gains from confidence-weighted ensembling. More broadly, it provides a modular recipe for optimizing sequential objectives in structured generations without additional rollouts.

[NLP-70] Latent Thoughts Tuning: Bridging Context and Reasoning with Fused Information in Latent Tokens

【速读】: 该论文旨在解决当前基于连续潜空间(continuous latent space)的推理方法中存在的特征坍缩(feature collapse)和不稳定性问题,这些问题通常源于在递归使用隐藏状态作为输入嵌入时的分布不匹配,或依赖辅助模型时的对齐偏差。解决方案的关键在于提出Latent Thoughts Tuning (LT-Tuning) 框架,其核心创新是引入Context-Prediction-Fusion机制,通过联合利用上下文相关的隐藏状态与来自词汇嵌入空间的预测语义引导来重构潜思考(latent thoughts),同时结合渐进式的三阶段课程学习策略,实现潜思考与显式思维模式之间的动态切换,从而有效缓解特征坍缩并提升推理准确性。

链接: https://arxiv.org/abs/2602.10229
作者: Weihao Liu,Dehai Min,Lu Cheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: version 1.0

点击查看摘要

Abstract:While explicit Chain-of-Thought (CoT) equips Large Language Models (LLMs) with strong reasoning capabilities, it requires models to verbalize every intermediate step in text tokens, constraining the model thoughts to the discrete vocabulary space. Recently, reasoning in continuous latent space has emerged as a promising alternative, enabling more robust inference and flexible computation beyond discrete token constraints. However, current latent paradigms often suffer from feature collapse and instability, stemming from distribution mismatches when recurrently using hidden states as the input embeddings, or alignment issues when relying on assistant models. To address this, we propose Latent Thoughts Tuning (LT-Tuning), a framework that redefines how latent thoughts are constructed and deployed. Instead of relying solely on raw hidden states, our method introduces a Context-Prediction-Fusion mechanism that jointly leveraging contextual hidden states and predictive semantic guidance from the vocabulary embedding space. Combined with a progressive three-stage curriculum learning pipeline, LT-Tuning also enables dynamically switching between latent and explicit thinking modes. Experiments demonstrate that our method outperforms existing latent reasoning baselines, effectively mitigating feature collapse and achieving robust reasoning accuracy.

[NLP-71] owards Autonomous Mathematics Research

【速读】: 该论文旨在解决将AI从竞赛级数学问题求解能力扩展至专业数学研究场景中的挑战,尤其是如何实现长周期、高复杂度的数学证明构建与文献导航。其核心解决方案在于提出Aletheia——一个端到端的数学研究代理,关键创新包括:基于改进版Gemini Deep Think的深度推理引擎以应对高难度问题、提出一种超越奥数级别问题的推理时缩放定律(inference-time scaling law),以及通过密集工具调用增强对数学研究复杂性的处理能力。该系统已在无干预生成论文(Feng26)、人机协作攻关独立集边界问题(LeeSeo26)及自主解决700个开放问题中的4个等方面验证了其有效性,标志着AI在数学研究中迈向更高自主性和新颖性水平。

链接: https://arxiv.org/abs/2602.10177
作者: Tony Feng,Trieu H. Trinh,Garrett Bingham,Dawsen Hwang,Yuri Chervonyi,Junehyuk Jung,Joonkyung Lee,Carlo Pagano,Sang-hyun Kim,Federico Pasqualotto,Sergei Gukov,Jonathan N. Lee,Junsu Kim,Kaiying Hou,Golnaz Ghiasi,Yi Tay,YaGuang Li,Chenkai Kuang,Yuan Liu,Hanzhao(Maggie)Lin,Evan Zheran Liu,Nigamaa Nayakanti,Xiaomeng Yang,Heng-tze Cheng,Demis Hassabis,Koray Kavukcuoglu,Quoc V. Le,Thang Luong
机构: Google DeepMind(谷歌深度思维)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem-solving to professional research, however, requires navigating vast literature and constructing long-horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end-to-end in natural language. Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems, a novel inference-time scaling law that extends beyond Olympiad-level problems, and intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD-level exercises and most notably, through several distinct milestones in AI-assisted mathematics research: (a) a research paper (Feng26) generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper (LeeSeo26) demonstrating human-AI collaboration in proving bounds on systems of interacting particles called independent sets; and © an extensive semi-autonomous evaluation (Feng et al., 2026a) of 700 open problems on Bloom’s Erdos Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest codifying standard levels quantifying autonomy and novelty of AI-assisted results. We conclude with reflections on human-AI collaboration in mathematics.

[NLP-72] Omni-Safety under Cross-Modality Conflict: Vulnerabilities Dynamics Mechanisms and Efficient Alignment

【速读】: 该论文旨在解决Omni-modal Large Language Models (OLLMs)在跨模态交互中引入的安全风险问题,特别是针对有害输入时模型拒绝响应能力不足的漏洞。其解决方案的关键在于提出了一种基于模态-语义解耦原则的新方法:通过机制分析发现“中间层溶解”现象(Mid-layer Dissolution)及存在模态不变的纯拒绝方向,进而利用奇异值分解(Singular Value Decomposition)提取出黄金拒绝向量(golden refusal vector),并设计轻量级适配器模块 OmniSteer,实现对干预强度的自适应调节。实验表明,该方法将有害输入的拒绝成功率从69.9%提升至91.2%,同时保持各模态下的通用能力不受显著影响。

链接: https://arxiv.org/abs/2602.10161
作者: Kun Wang,Zherui Li,Zhenhong Zhou,Yitong Zhang,Yan Mi,Kun Yang,Yiming Zhang,Junhao Dong,Zhongxiang Sun,Qiankun Li,Yang Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Omni-modal Large Language Models (OLLMs) greatly expand LLMs’ multimodal capabilities but also introduce cross-modal safety risks. However, a systematic understanding of vulnerabilities in omni-modal interactions remains lacking. To bridge this gap, we establish a modality-semantics decoupling principle and construct the AdvBench-Omni dataset, which reveals a significant vulnerability in OLLMs. Mechanistic analysis uncovers a Mid-layer Dissolution phenomenon driven by refusal vector magnitude shrinkage, alongside the existence of a modal-invariant pure refusal direction. Inspired by these insights, we extract a golden refusal vector using Singular Value Decomposition and propose OmniSteer, which utilizes lightweight adapters to modulate intervention intensity adaptively. Extensive experiments show that our method not only increases the Refusal Success Rate against harmful inputs from 69.9% to 91.2%, but also effectively preserves the general capabilities across all modalities. Our code is available at: this https URL.

[NLP-73] VERA: Identifying and Leverag ing Visual Evidence Retrieval Heads in Long-Context Understanding

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理长上下文和复杂推理任务时性能受限的问题。通过注意力机制分析,研究者识别出一类关键的“视觉证据检索头”(Visual Evidence Retrieval, VER Heads),这类稀疏且动态的注意力头负责在推理过程中定位视觉线索,与静态的光学字符识别(OCR)头有本质区别。实验表明,VER Heads 对模型性能具有因果影响,屏蔽它们会导致显著性能下降。解决方案的关键在于提出一种无需训练的增强框架 VERA(Visual Evidence Retrieval Augmentation),其核心思想是基于模型不确定性(熵)检测来触发对 VER 头所关注视觉证据的显式语言化表达,从而提升模型对长上下文的理解能力,在多个基准测试中平均相对提升达 21.3%(Qwen3-VL-8B-Instruct)和 20.1%(GLM-4.1V-Thinking)。

链接: https://arxiv.org/abs/2602.10146
作者: Rongcan Pei,Huan Li,Fang Guo,Qi Zhu
机构: Tongji University (同济大学); Zhejiang University (浙江大学); Westlake University (西湖大学); Amazon Web Services (亚马逊网络服务)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 12 pages, 12 figures

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have shown promise in textual understanding, they face significant challenges when handling long context and complex reasoning tasks. In this paper, we dissect the internal mechanisms governing long-context processing in VLMs to understand their performance bottlenecks. Through the lens of attention analysis, we identify specific Visual Evidence Retrieval (VER) Heads - a sparse, dynamic set of attention heads critical for locating visual cues during reasoning, distinct from static OCR heads. We demonstrate that these heads are causal to model performance; masking them leads to significant degradation. Leveraging this discovery, we propose VERA (Visual Evidence Retrieval Augmentation), a training-free framework that detects model uncertainty (i.e., entropy) to trigger the explicit verbalization of visual evidence attended by VER heads. Comprehensive experiments demonstrate that VERA significantly improves long-context understanding of open-source VLMs: it yields an average relative improvement of 21.3% on Qwen3-VL-8B-Instruct and 20.1% on GLM-4.1V-Thinking across five benchmarks.

[NLP-74] Multimodal Information Fusion for Chart Understanding: A Survey of MLLM s – Evolution Limitations and Cognitive Enhancement

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图表理解(chart understanding)任务中面临的系统性研究碎片化问题,即当前领域缺乏对核心组件、任务分类、数据集架构和方法演进的统一梳理。其解决方案的关键在于构建一个结构化的分析框架:首先识别视觉与文本信息融合的基础挑战;其次提出一种新颖的基准测试分类法(包括标准与非标准基准),以揭示该领域的扩展范围;进而系统回顾从传统深度学习到前沿MLLM范式的演化路径,强调复杂融合策略的应用;最后通过批判性分析现有模型在感知与推理能力上的局限,指出未来方向如高级对齐技术和强化学习驱动的认知增强。这一综述为研究人员提供了清晰的路线图,推动更鲁棒、可靠的图表信息融合系统的研发。

链接: https://arxiv.org/abs/2602.10138
作者: Zhihang Yi,Jian Zhao,Jiancheng Lv,Tao Wang
机构: China Telecom (中国电信); Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chart understanding is a quintessential information fusion task, requiring the seamless integration of graphical and textual data to extract meaning. The advent of Multimodal Large Language Models (MLLMs) has revolutionized this domain, yet the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap of this nascent frontier by structuring the domain’s core components. We begin by analyzing the fundamental challenges of fusing visual and linguistic information in charts. We then categorize downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the field’s expanding scope. Subsequently, we present a comprehensive evolution of methodologies, tracing the progression from classic deep learning techniques to state-of-the-art MLLM paradigms that leverage sophisticated fusion strategies. By critically examining the limitations of current models, particularly their perceptual and reasoning deficits, we identify promising future directions, including advanced alignment techniques and reinforcement learning for cognitive enhancement. This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion and to catalyze progress toward more robust and reliable systems.

[NLP-75] Reverse-Engineering Model Editing on Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)模型中基于“定位-编辑”(locate-then-edit)范式的模型编辑方法所存在的隐私泄露问题。这类方法通过修改模型参数实现局部更新,但本文揭示了参数更新本身构成了一个侧信道,可被攻击者利用以逆向重建被编辑的数据内容。解决方案的关键在于提出了一种两阶段的逆向工程攻击方法 KSTER(KeySpace Reconstruction-then-Entropy Reduction),其核心创新包括:第一,理论证明更新矩阵的行空间编码了编辑主体的“指纹”,可通过谱分析实现高精度主体恢复;第二,设计基于熵减的提示重构攻击,从语义层面恢复编辑内容。此外,作者进一步提出子空间伪装(subspace camouflage)防御策略,通过引入语义干扰项混淆更新指纹,在不损害编辑功能的前提下有效降低数据重建风险。

链接: https://arxiv.org/abs/2602.10134
作者: Zhiyu Sun,Minrui Luo,Yu Wang,Zhili Chen,Tianxing He
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textitKSTER (\textbfKey\textbfSpaceRecons\textbfTruction-then-\textbfEntropy\textbfReduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textitsubspace camouflage, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at this https URL.

[NLP-76] Large Language Models Predict Functional Outcomes after Acute Ischemic Stroke

【速读】: 该论文旨在解决急性缺血性脑卒中患者功能预后(以改良Rankin量表,mRS评分衡量)预测的准确性问题,传统方法依赖结构化变量(如年龄、NIHSS评分)和经典机器学习模型,存在数据提取繁琐、信息利用不充分等局限。其解决方案的关键在于探索大型语言模型(LLMs)从常规入院记录文本中直接推断mRS评分的能力,通过对比编码器型(BERT、NYUTron)与生成式(Llama-3.1-8B、MedGemma-4B)LLM在冻结和微调两种设置下的表现,发现微调后的Llama模型在90天mRS预测上达到33.9%的7分类准确率和76.3%的二分类准确率,性能与基于结构化变量的基准模型相当,表明无需人工提取结构化特征即可实现可靠预后预测,为临床流程中无缝集成文本驱动的预后工具提供了实证支持。

链接: https://arxiv.org/abs/2602.10119
作者: Anjali K. Kapoor(1),Anton Alyakin(1,2,3),Jin Vivian Lee(1,2,3),Eunice Yang(1,4),Annelene M. Schulze(1),Krithik Vishwanath(5),Jinseok Lee(2,6),Yindalon Aphinyanaphongs(7,8),Howard Riina(1,9),Jennifer A. Frontera(10),Eric Karl Oermann(1,2,8,11) ((1) Department of Neurosurgery, NYU Langone Health, New York, USA (2) Global AI Frontier Lab, New York University, Brooklyn, USA (3) Department of Neurosurgery, Washington University in Saint Louis, Saint Louis, USA (4) Columbia University Vagelos College of Physicians and Surgeons, New York, USA (5) Department of Aerospace Engineering and Engineering Mechanics, University of Texas at Austin, Austin, USA (6) Department of Biomedical Engineering, Kyung Hee University, Yongin, South Korea (7) Department of Population Health, NYU Langone Health, New York, USA (8) Division of Applied AI Technologies, NYU Langone Health, New York, USA (9) Department of Radiology, NYU Langone Health, New York, USA (10) Department of Neurology, NYU Langone Health, New York, USA (11) Center for Data Science, New York University, New York, USA)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Accurate prediction of functional outcomes after acute ischemic stroke can inform clinical decision-making and resource allocation. Prior work on modified Rankin Scale (mRS) prediction has relied primarily on structured variables (e.g., age, NIHSS) and conventional machine learning. The ability of large language models (LLMs) to infer future mRS scores directly from routine admission notes remains largely unexplored. We evaluated encoder (BERT, NYUTron) and generative (Llama-3.1-8B, MedGemma-4B) LLMs, in both frozen and fine-tuned settings, for discharge and 90-day mRS prediction using a large, real-world stroke registry. The discharge outcome dataset included 9,485 History and Physical notes and the 90-day outcome dataset included 1,898 notes from the NYU Langone Get With The Guidelines-Stroke registry (2016-2025). Data were temporally split with the most recent 12 months held out for testing. Performance was assessed using exact (7-class) mRS accuracy and binary functional outcome (mRS 0-2 vs. 3-6) accuracy and compared against established structured-data baselines incorporating NIHSS and age. Fine-tuned Llama achieved the highest performance, with 90-day exact mRS accuracy of 33.9% [95% CI, 27.9-39.9%] and binary accuracy of 76.3% [95% CI, 70.7-81.9%]. Discharge performance reached 42.0% [95% CI, 39.0-45.0%] exact accuracy and 75.0% [95% CI, 72.4-77.6%] binary accuracy. For 90-day prediction, Llama performed comparably to structured-data baselines. Fine-tuned LLMs can predict post-stroke functional outcomes from admission notes alone, achieving performance comparable to models requiring structured variable abstraction. Our findings support the development of text-based prognostic tools that integrate seamlessly into clinical workflows without manual data extraction.

[NLP-77] Reviewing the Reviewer: Elevating Peer Review Quality through LLM -Guided Feedback

【速读】: 该论文旨在解决同行评审中因依赖简单启发式(lazy thinking)而导致的质量下降问题,特别是现有方法将懒惰思维检测视为单标签任务,忽略了审稿段落可能同时存在多种问题(如清晰度不足或具体性缺失),且缺乏基于指南的可操作反馈。其解决方案的关键在于提出一个由大语言模型(LLM)驱动的框架:首先将审稿文本分解为论证片段,然后通过神经符号模块(结合LLM特征与传统分类器)识别多类问题,最后利用遗传算法优化的问题特定模板生成精准反馈。该方法显著优于零样本LLM基线,并使审稿质量提升达92.4%。

链接: https://arxiv.org/abs/2602.10118
作者: Sukannya Purkayastha,Qile Wan,Anne Lauscher,Lizhen Qu,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab); Technical University of Darmstadt (达姆施塔特工业大学); National Research Center for Applied Cybersecurity ATHENE (应用网络安全国家研究中心 ATHENE); Monash University (蒙纳士大学); University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 39 pages, 22 figures, 29 tables

点击查看摘要

Abstract:Peer review is central to scientific quality, yet reliance on simple heuristics – lazy thinking – has lowered standards. Prior work treats lazy thinking detection as a single-label task, but review segments may exhibit multiple issues, including broader clarity problems, or specificity issues. Turning detection into actionable improvements requires guideline-aware feedback, which is currently missing. We introduce an LLM-driven framework that decomposes reviews into argumentative segments, identifies issues via a neurosymbolic module combining LLM features with traditional classifiers, and generates targeted feedback using issue-specific templates refined by a genetic algorithm. Experiments show our method outperforms zero-shot LLM baselines and improves review quality by up to 92.4%. We also release LazyReviewPlus, a dataset of 1,309 sentences labeled for lazy thinking and specificity.

[NLP-78] RE-LLM : Refining Empathetic Speech-LLM Responses by Integrating Emotion Nuance

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在人机交互中情感理解不足的问题,特别是忽视了情绪探索(emotional exploration)这一关键维度,导致用户与AI之间的互动缺乏深度共情。现有大语言模型(LLM)主要依赖文本输入,难以捕捉情绪的细微差异。为此,作者提出RE-LLM——一种融合维度化情绪嵌入(dimensional emotion embeddings)和辅助学习机制的语音-大语言模型(speech-LLM),通过引入多模态语音特征增强情绪表征能力,从而提升AI对用户情绪状态的理解深度与响应的共情水平。其核心创新在于将情绪空间建模为连续维度并结合语音信号进行联合训练,显著提升了情绪反应(Emotional Reaction)与情绪探索(Exploration)等关键指标,在多个公开数据集上均取得统计显著性改进。

链接: https://arxiv.org/abs/2602.10716
作者: Jing-Han Chen,Bo-Hao Su,Ya-Tse Wu,Chi-Chun Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages, 1 figure, 2 tables. Accepted at IEEE ASRU 2025

点击查看摘要

Abstract:With generative AI advancing, empathy in human-AI interaction is essential. While prior work focuses on emotional reflection, emotional exploration, key to deeper engagement, remains overlooked. Existing LLMs rely on text which captures limited emotion nuances. To address this, we propose RE-LLM, a speech-LLM integrating dimensional emotion embeddings and auxiliary learning. Experiments show statistically significant gains in empathy metrics across three datasets. RE-LLM relatively improves the Emotional Reaction score by 14.79% and 6.76% compared to text-only and speech-LLM baselines on ESD. Notably, it raises the Exploration score by 35.42% and 3.91% on IEMOCAP, 139.28% and 9.83% on ESD, and 60.95% and 22.64% on MSP-PODCAST. It also boosts unweighted accuracy by 5.4% on IEMOCAP, 2.3% on ESD, and 6.9% on MSP-PODCAST in speech emotion recognition. These results highlight the enriched emotional understanding and improved empathetic response generation of RE-LLM.

[NLP-79] AntigenLM: Structure-Aware DNA Language Modeling for Influenza ICLR2026

【速读】: 该论文旨在解决当前DNA基础模型在特定任务上表现落后于专用方法的问题,其核心挑战在于缺乏对基因组功能单元结构的合理建模。解决方案的关键在于提出AntigenLM——一个基于流感病毒基因组进行预训练的生成式DNA语言模型,该模型保留了完整且对齐的功能单元(functional units)结构,从而能够捕捉进化约束并实现跨任务泛化。实验表明,这种结构感知的预训练策略显著提升了抗原变异预测和亚型分类性能,且消融实验证明破坏基因组结构会严重损害模型效果,凸显了保持功能单元完整性在DNA语言建模中的关键作用。

链接: https://arxiv.org/abs/2602.09067
作者: Yue Pei,Xuebin Chi,Yu Kang
机构: Computer Network Information Center, Chinese Academy of Sciences (中国科学院计算机网络信息中心); Beijing Institute of Genomics, Chinese Academy of Sciences (中国科学院北京基因组研究所); China National Center for Bioinformation (国家生物信息中心); University of Chinese Academy of Sciences (中国科学院大学)
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.

信息检索

[IR-0] Diffusion-Pretrained Dense and Contextual Embeddings

【速读】:该论文旨在解决大规模多语言文档检索中嵌入表示质量不足的问题,尤其是在长文档场景下如何有效保留全局上下文信息。解决方案的关键在于基于扩散预训练的语言模型主干(diffusion-pretrained language model backbone)采用多阶段对比学习(multi-stage contrastive learning),通过扩散机制获得双向注意力能力,从而在段落内捕获更全面的双向上下文;同时结合均值池化(mean pooling)与晚期分块策略(late chunking),显著提升长文档的全局语义一致性,使嵌入向量更好地适应真实世界的大规模搜索任务。

链接: https://arxiv.org/abs/2602.11151
作者: Sedigheh Eslami,Maksim Gaiduk,Markus Krimmel,Louis Milliken,Bo Wang,Denis Bykov
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, which focuses on real-world, large-scale search scenarios over tens of millions of documents. These results validate the models’ effectiveness in production environments where retrieval quality and efficiency are critical at scale.

[IR-1] MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation AAAI2026

【速读】:该论文旨在解决推荐系统中因数据稀疏性和物品冷启动(item cold-start)问题导致的性能下降难题,尤其是对于缺乏交互历史的新物品。其解决方案的关键在于将多模态推荐转化为离散语义标记化(discrete semantic tokenization),提出了一种基于稀疏正则化的残差量化变分自编码器(sparsely-regularized Residual Quantized Variational Autoencoder, RQ-VAE)的框架 MoToRec,通过生成可组合、可解释的离散语义标记来促进解耦表示(disentangled representations)。该方法还引入了三个协同组件:稀疏正则化RQ-VAE以增强表示解耦性、自适应稀有度放大机制以优先学习冷启动物品,以及分层多源图编码器以融合协同信号与多模态信息,从而有效缓解冷启动挑战并提升整体推荐性能。

链接: https://arxiv.org/abs/2602.11062
作者: Jialin Liu,Zhaorui Zhang,Ray C.C. Cheung
机构: 未知
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注: Accepted to AAAI 2026 (Main Track)

点击查看摘要

Abstract:Graph neural networks (GNNs) have revolutionized recommender systems by effectively modeling complex user-item interactions, yet data sparsity and the item cold-start problem significantly impair performance, particularly for new items with limited or no interaction history. While multimodal content offers a promising solution, existing methods result in suboptimal representations for new items due to noise and entanglement in sparse data. To address this, we transform multimodal recommendation into discrete semantic tokenization. We present Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation (MoToRec), a framework centered on a sparsely-regularized Residual Quantized Variational Autoencoder (RQ-VAE) that generates a compositional semantic code of discrete, interpretable tokens, promoting disentangled representations. MoToRec’s architecture is enhanced by three synergistic components: (1) a sparsely-regularized RQ-VAE that promotes disentangled representations, (2) a novel adaptive rarity amplification that promotes prioritized learning for cold-start items, and (3) a hierarchical multi-source graph encoder for robust signal fusion with collaborative signals. Extensive experiments on three large-scale datasets demonstrate MoToRec’s superiority over state-of-the-art methods in both overall and cold-start scenarios. Our work validates that discrete tokenization provides an effective and scalable alternative for mitigating the long-standing cold-start challenge.

[IR-2] GraphSeek: Next-Generation Graph Analytics with LLM s

【速读】:该论文旨在解决大型、异构且动态演化的属性图(property graph)在使用过程中对专业知识依赖性强的问题,尤其是在利用大语言模型(LLM)进行自然语言(NL)图分析时效率低、效果差的挑战。其解决方案的关键在于提出一种新颖的抽象架构,通过将图分析任务划分为两个平面:一是语义平面(Semantic Plane),用于LLM基于语义目录(Semantic Catalog)进行规划和推理,该目录描述了图模式与操作;二是执行平面(Execution Plane),负责在全量数据上以数据库级精度执行确定性查询。这种分离设计显著提升了token效率和任务成功率,即使在小上下文LLM条件下也能实现高效准确的多查询图分析,从而推动了下一代低成本、易访问的图分析系统的发展。

链接: https://arxiv.org/abs/2602.11052
作者: Maciej Besta,Łukasz Jarmocik,Orest Hrycyna,Shachar Klaiman,Konrad Mączka,Robert Gerstenberger,Jürgen Müller,Piotr Nyczyk,Hubert Niewiadomski,Torsten Hoefler
机构: ETH Zurich (苏黎世联邦理工学院); IDEAS Research Institute (IDEAS 研究所); Cledar (Cledar); NCBJ Warszawa (波兰核子研究中心); BASF SE (巴斯夫公司)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Graphs are foundational across domains but remain hard to use without deep expertise. LLMs promise accessible natural language (NL) graph analytics, yet they fail to process industry-scale property graphs effectively and efficiently: such datasets are large, highly heterogeneous, structurally complex, and evolve dynamically. To address this, we devise a novel abstraction for complex multi-query analytics over such graphs. Its key idea is to replace brittle generation of graph queries directly from NL with planning over a Semantic Catalog that describes both the graph schema and the graph operations. Concretely, this induces a clean separation between a Semantic Plane for LLM planning and broader reasoning, and an Execution Plane for deterministic, database-grade query execution over the full dataset and tool implementations. This design yields substantial gains in both token efficiency and task effectiveness even with small-context LLMs. We use this abstraction as the basis of the first LLM-enhanced graph analytics framework called GraphSeek. GraphSeek achieves substantially higher success rates (e.g., 86% over enhanced LangChain) and points toward the next generation of affordable and accessible graph analytics that unify LLM reasoning with database-grade execution over large and complex property graphs.

[IR-3] raining-Induced Bias Toward LLM -Generated Content in Dense Retrieval ECIR2026

【速读】:该论文旨在解决生成式 AI(Generative AI)在密集检索(dense retrieval)任务中引发的“源偏倚”(source bias)问题,即模型对大语言模型(LLM)生成文本的偏好是否源于其低困惑度(perplexity)这一假设。研究通过受控实验设计,系统评估了不同训练阶段与数据来源对检索偏好变化的影响,关键在于使用平行的人类与LLM生成语料(SciFact 和 Natural Questions 数据集),对比无监督预训练模型与基于领域内人类文本、LLM生成文本及 MS MARCO 的监督微调效果。结果表明:源偏倚并非密集检索器固有属性,而是由训练过程诱导产生,尤其是微调阶段采用 LLM 生成语料时会显著强化对 LLM 文本的偏好,且困惑度指标无法有效解释该现象,从而揭示了源偏倚的本质为训练驱动而非内容特征驱动。

链接: https://arxiv.org/abs/2602.10833
作者: William Xion,Wolfgang Nejdl
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted at ECIR 2026

点击查看摘要

Abstract:Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called “source bias”, and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.

[IR-4] EST: Towards Efficient Scaling Laws in Click-Through Rate Prediction via Unified Modeling

【速读】:该论文旨在解决工业级点击率(Click-Through Rate, CTR)预测模型在规模化扩展时面临的效率与性能瓶颈问题。现有方法通常采用早期聚合用户行为数据以维持计算效率,但这种非统一或部分统一的建模方式会因丢弃细粒度的token级信号而造成信息瓶颈,限制了模型的可扩展性。解决方案的关键在于提出一种全统一建模架构——高效可扩展Transformer(Efficiently Scalable Transformer, EST),其核心创新包括两个模块:轻量级交叉注意力(Lightweight Cross-Attention, LCA)通过剪枝冗余自交互来聚焦高影响的跨特征依赖关系;内容稀疏注意力(Content Sparse Attention, CSA)则利用内容相似性动态筛选高信号行为序列。EST实现了对所有原始输入的单序列处理,避免了损失性聚合,在Taobao广告平台部署中显著优于基线模型,带来3.27%的每千次展示收入(RPM)提升和1.22%的CTR增长,验证了其在大规模工业场景下的有效性与可扩展性。

链接: https://arxiv.org/abs/2602.10811
作者: Mingyang Liu,Yong Bai,Zhangming Chan,Sishuo Chen,Xiang-Rong Sheng,Han Zhu,Jian Xu,Xinyang Chen
机构: Taobao & Tmall Group of Alibaba(淘宝与天猫集团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Efficiently scaling industrial Click-Through Rate (CTR) prediction has recently attracted significant research attention. Existing approaches typically employ early aggregation of user behaviors to maintain efficiency. However, such non-unified or partially unified modeling creates an information bottleneck by discarding fine-grained, token-level signals essential for unlocking scaling gains. In this work, we revisit the fundamental distinctions between CTR prediction and Large Language Models (LLMs), identifying two critical properties: the asymmetry in information density between behavioral and non-behavioral features, and the modality-specific priors of content-rich signals. Accordingly, we propose the Efficiently Scalable Transformer (EST), which achieves fully unified modeling by processing all raw inputs in a single sequence without lossy aggregation. EST integrates two modules: Lightweight Cross-Attention (LCA), which prunes redundant self-interactions to focus on high-impact cross-feature dependencies, and Content Sparse Attention (CSA), which utilizes content similarity to dynamically select high-signal behaviors. Extensive experiments show that EST exhibits a stable and efficient power-law scaling relationship, enabling predictable performance gains with model scale. Deployed on Taobao’s display advertising platform, EST significantly outperforms production baselines, delivering a 3.27% RPM (Revenue Per Mile) increase and a 1.22% CTR lift, establishing a practical pathway for scalable industrial CTR prediction models.

[IR-5] DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

【速读】:该论文旨在解决现有多模态检索系统在处理真实视觉流时的局限性问题,即当前方法通常假设查询与图像的相关性可独立评估,忽略了视觉序列中固有的时空依赖关系。为应对这一挑战,论文提出了一种基于智能体(agent)的新范式——DeepImageSearch,其核心在于将图像检索重构为一个自主探索任务,要求模型对原始视觉历史进行多步推理以定位目标。关键创新包括:构建了一个由相互关联视觉数据组成的挑战性基准DISBench,设计了一种人-模型协同的流水线来挖掘潜在的时空关联以生成上下文依赖的查询,并开发了一个具备细粒度工具和双记忆系统的模块化智能体基线,从而支持长程导航。实验表明,DISBench显著提升了现有模型的难度,验证了引入智能体推理对于下一代检索系统的重要性。

链接: https://arxiv.org/abs/2602.10809
作者: Chenlong Deng,Mengjie Deng,Junjie Wu,Dun Zeng,Teng Wang,Qingsong Xie,Jiadeng Huang,Shengjie Ma,Changwang Zhang,Zhaoxiang Wang,Jun Wang,Yutao Zhu,Zhicheng Dou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

[IR-6] VulReaD: Knowledge-Graph-guided Software Vulnerability Reasoning and Detection

【速读】:该论文旨在解决软件漏洞检测(Software Vulnerability Detection, SVD)中现有大语言模型(Large Language Models, LLMs)方法在漏洞分类粒度上局限于二分类、且生成的解释与通用弱弱点枚举(Common Weakness Enumeration, CWE)类别语义不一致的问题。解决方案的关键在于提出 VulReaD,一个基于安全知识图谱(Security Knowledge Graph, KG)引导的漏洞推理与检测框架,通过强教师模型生成与 CWE 一致的对比推理监督信号,使学生模型无需人工标注即可训练;同时采用几率比偏好优化(Odds Ratio Preference Optimization, ORPO)对齐漏洞分类体系,抑制无依据的解释,从而显著提升多类 CWE 分类性能和解释的可解释性与一致性。

链接: https://arxiv.org/abs/2602.10787
作者: Samal Mukhtar,Yinghua Yao,Zhu Sun,Mustafa Mustafa,Yew Soon Ong,Youcheng Sun
机构: The University of Manchester (曼彻斯特大学); Agency for Science, Technology and Research (新加坡科技研究局); Singapore University of Technology and Design (新加坡科技设计大学); Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注: 22 pages, 3 figures

点击查看摘要

Abstract:Software vulnerability detection (SVD) is a critical challenge in modern systems. Large language models (LLMs) offer natural-language explanations alongside predictions, but most work focuses on binary evaluation, and explanations often lack semantic consistency with Common Weakness Enumeration (CWE) categories. We propose VulReaD, a knowledge-graph-guided approach for vulnerability reasoning and detection that moves beyond binary classification toward CWE-level reasoning. VulReaD leverages a security knowledge graph (KG) as a semantic backbone and uses a strong teacher LLM to generate CWE-consistent contrastive reasoning supervision, enabling student model training without manual annotations. Students are fine-tuned with Odds Ratio Preference Optimization (ORPO) to encourage taxonomy-aligned reasoning while suppressing unsupported explanations. Across three real-world datasets, VulReaD improves binary F1 by 8-10% and multi-class classification by 30% Macro-F1 and 18% Micro-F1 compared to state-of-the-art baselines. Results show that LLMs outperform deep learning baselines in binary detection and that KG-guided reasoning enhances CWE coverage and interpretability.

[IR-7] Equity by Design: Fairness-Driven Recommendation in Heterogeneous Two-Sided Markets

【速读】:该论文旨在解决双边市场平台中公平性与业务目标之间的权衡问题,尤其是在多物品推荐场景下,如何在考虑消费者偏好差异、生产者资源容量异质性及业务约束的前提下实现公平分配。其关键解决方案在于引入条件风险价值(Conditional Value-at-Risk, CVaR)作为消费者侧的优化目标,以压缩群体间效用差距,并将业务约束直接嵌入优化模型中,从而实现从单物品到离散多物品推荐的公平性扩展。实验表明,适度的公平约束可提升商业指标,因能分散对饱和生产者的依赖,且高效求解器可在极低计算成本下逼近精确解,验证了公平性作为促进平台可持续发展的工具而非效率负担的可能性。

链接: https://arxiv.org/abs/2602.10739
作者: Dominykas Seputis,Rajeev Verma,Alexander Timans
机构: University of Amsterdam(阿姆斯特丹大学)
类目: Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Two-sided marketplaces embody heterogeneity in incentives: producers seek exposure while consumers seek relevance, and balancing these competing objectives through constrained optimization is now a standard practice. Yet real platforms face finer-grained complexity: consumers differ in preferences and engagement patterns, producers vary in catalog value and capacity, and business objectives impose additional constraints beyond raw relevance. We formalize two-sided fairness under these realistic conditions, extending prior work from soft single-item allocations to discrete multi-item recommendations. We introduce Conditional Value-at-Risk (CVaR) as a consumer-side objective that compresses group-level utility disparities, and integrate business constraints directly into the optimization. Our experiments reveal that the “free fairness” regime, where producer constraints impose no consumer cost, disappears in multi item settings. Strikingly, moderate fairness constraints can improve business metrics by diversifying exposure away from saturated producers. Scalable solvers match exact solutions at a fraction of the runtime, making fairness-aware allocation practical at scale. These findings reframe fairness not as a tax on platform efficiency but as a lever for sustainable marketplace health.

[IR-8] A Cognitive Distribution and Behavior-Consistent Framework for Black-Box Attacks on Recommender Systems

【速读】:该论文旨在解决顺序推荐系统(Sequential Recommender Systems)在黑盒攻击场景下的知识提取不完整与对抗序列语义不一致问题。现有方法多依赖硬标签或成对学习,忽视了排序位置的重要性,导致知识迁移不充分;同时,纯梯度生成的对抗序列缺乏真实用户行为的语义一致性,易被检测。解决方案的关键在于提出一种双增强攻击框架:其一,引入基于首因效应(primacy effects)和位置偏置的认知分布驱动提取机制,将离散排序映射为具有位置感知衰减的连续值分布,实现从顺序对齐到认知分布对齐的跃迁;其二,设计行为感知的噪声物品生成策略,联合优化协同信号与梯度信号,在保证语义一致性的同时提升统计隐蔽性,从而有效提升目标物品的排名。

链接: https://arxiv.org/abs/2602.10633
作者: Hongyue Zhan,Mingming Li,Dongqin Liu,Hui Wang,Yaning Zhang,Xi Zhou,Honglei Lv,Jiao Dai,Jizhong Han
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:With the growing deployment of sequential recommender systems in e-commerce and other fields, their black-box interfaces raise security concerns: models are vulnerable to extraction and subsequent adversarial manipulation. Existing black-box extraction attacks primarily rely on hard labels or pairwise learning, often ignoring the importance of ranking positions, which results in incomplete knowledge transfer. Moreover, adversarial sequences generated via pure gradient methods lack semantic consistency with real user behavior, making them easily detectable. To overcome these limitations, this paper proposes a dual-enhanced attack framework. First, drawing on primacy effects and position bias, we introduce a cognitive distribution-driven extraction mechanism that maps discrete rankings into continuous value distributions with position-aware decay, thereby advancing from order alignment to cognitive distribution alignment. Second, we design a behavior-aware noisy item generation strategy that jointly optimizes collaborative signals and gradient signals. This ensures both semantic coherence and statistical stealth while effectively promoting target item rankings. Extensive experiments on multiple datasets demonstrate that our approach significantly outperforms existing methods in both attack success rate and evasion rate, validating the value of integrating cognitive modeling and behavioral consistency for secure recommender systems.

[IR-9] S-GRec: Personalized Semantic-Aware Generative Recommendation with Asymmetric Advantage

【速读】:该论文旨在解决生成式推荐模型(Generative Recommendation Models)在训练过程中因行为日志提供的监督信号较弱而难以准确捕捉用户意图的问题。现有方法中,虽然大语言模型(Large Language Models, LLMs)能够提供丰富的语义先验以增强监督信号,但其直接应用于工业推荐系统面临两大挑战:一是语义信号可能与平台业务目标冲突,二是LLM推理在大规模场景下计算成本过高。解决方案的关键在于提出S-GRec框架,通过解耦在线轻量级生成器与离线LLM驱动的语义判别器(Personalized Semantic Judge, PSJ),实现训练时的语义监督。其中,PSJ采用两阶段设计,基于成对反馈学习用户条件下的语义聚合策略并输出可解释的细粒度特征证据;同时引入非对称优势策略优化(Asymmetric Advantage Policy Optimization, A2PO),将优化锚定于业务奖励(如eCPM),仅在语义优势与业务目标一致时注入语义优势,从而兼顾语义合理性与商业目标一致性。该方案在公开基准和大规模线上系统中均验证了有效性与可扩展性。

链接: https://arxiv.org/abs/2602.10606
作者: Jie Jiang,Hongbo Tang,Wenjie Wu,Yangru Huang,Zhenmao Li,Qian Li,Changping Wang,Jun Zhang,Huan Yu
机构: Tencent Inc.(腾讯公司)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative recommendation models sequence generation to produce items end-to-end, but training from behavioral logs often provides weak supervision on underlying user intent. Although Large Language Models (LLMs) offer rich semantic priors that could supply such supervision, direct adoption in industrial recommendation is hindered by two obstacles: semantic signals can conflict with platform business objectives, and LLM inference is prohibitively expensive at scale. This paper presents S-GRec, a semantic-aware framework that decouples an online lightweight generator from an offline LLM-based semantic judge for train-time supervision. S-GRec introduces a two-stage Personalized Semantic Judge (PSJ) that produces interpretable aspect evidence and learns user-conditional aggregation from pairwise feedback, yielding stable semantic rewards. To prevent semantic supervision from deviating from business goals, Asymmetric Advantage Policy Optimization (A2PO) anchors optimization on business rewards (e.g., eCPM) and injects semantic advantages only when they are consistent. Extensive experiments on public benchmarks and a large-scale production system validate both effectiveness and scalability, including statistically significant gains in CTR and a 1.19% lift in GMV in online A/B tests, without requiring real-time LLM inference.

[IR-10] Campaign-2-PT-RAG : LLM -Guided Semantic Product Type Attribution for Scalable Campaign Ranking

【速读】:该论文旨在解决电商广告投放(campaign)排序模型中缺乏大规模高质量训练标签的问题,即如何准确标注用户购买行为是否由特定广告活动引发。由于广告文案通常使用创意性、主题化的语言,难以直接映射到具体产品品类(Product Type, PT),导致传统监督学习方法在广告优化中受限。解决方案的关键在于提出一种名为Campaign-2-PT-RAG的可扩展标签生成框架:首先利用大语言模型(Large Language Model, LLM)解析广告内容以捕捉隐含意图,随后通过语义搜索从平台商品分类体系中检索候选产品品类,再由结构化LLM分类器评估每个品类的相关性,最终基于匹配的品类生成用户购买正样本标签。该方法将模糊的归因问题转化为可控的语义对齐任务,在保持99%召回率的同时实现78–90%的标签精度,显著提升了下游广告排序模型的训练质量与部署可行性。

链接: https://arxiv.org/abs/2602.10577
作者: Yiming Che,Mansi Mane,Keerthi Gopalakrishnan,Parisa Kaghazgaran,Murali Mohana Krishna Dandu,Archana Venkatachalapathy,Sinduja Subramaniam,Yokila Arora,Evren Korpeoglu,Sushant Kumar,Kannan Achan
机构: Walmart Global Tech(沃尔玛全球科技)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:E-commerce campaign ranking models require large-scale training labels indicating which users purchased due to campaign influence. However, generating these labels is challenging because campaigns use creative, thematic language that does not directly map to product purchases. Without clear product-level attribution, supervised learning for campaign optimization remains limited. We present \textbfCampaign-2-PT-RAG, a scalable label generation framework that constructs user–campaign purchase labels by inferring which product types (PTs) each campaign promotes. The framework first interprets campaign content using large language models (LLMs) to capture implicit intent, then retrieves candidate PTs through semantic search over the platform taxonomy. A structured LLM-based classifier evaluates each PT’s relevance, producing a campaign-specific product coverage set. User purchases matching these PTs generate positive training labels for downstream ranking models. This approach reframes the ambiguous attribution problem into a tractable semantic alignment task, enabling scalable and consistent supervision for downstream tasks such as campaign ranking optimization in production e-commerce environments. Experiments on internal and synthetic datasets, validated against expert-annotated campaign–PT mappings, show that our LLM-assisted approach generates high-quality labels with 78–90% precision while maintaining over 99% recall.

[IR-11] Boundary-Aware Multi-Behavior Dynamic Graph Transformer for Sequential Recommendation

【速读】:该论文旨在解决现有推荐系统模型在建模用户偏好时,难以同时捕捉用户-物品交互图结构的动态变化与用户行为序列模式的问题,以及在模型优化过程中无法有效区分多种用户行为边界导致个性化学习不足的局限。其解决方案的关键在于提出一种边界感知的多行为动态图Transformer(MB-DGT)模型,通过引入基于Transformer的动态图聚合器来融合不断演变的图结构和用户行为序列,从而构建更全面、动态的用户偏好表示;同时,在模型优化阶段设计了一种用户特定的多行为损失函数,显式划分不同行为类型之间的兴趣边界,增强个性化学习能力。

链接: https://arxiv.org/abs/2602.10493
作者: Jingsong Su,Xuetao Ma,Mingming Li,Qiannan Zhu,Yu Guo
机构: Beijing Normal University (北京师范大学); JD.com (京东)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In the landscape of contemporary recommender systems, user-item interactions are inherently dynamic and sequential, often characterized by various behaviors. Prior research has explored the modeling of user preferences through sequential interactions and the user-item interaction graph, utilizing advanced techniques such as graph neural networks and transformer-based architectures. However, these methods typically fall short in simultaneously accounting for the dynamic nature of graph topologies and the sequential pattern of interactions in user preference models. Moreover, they often fail to adequately capture the multiple user behavior boundaries during model optimization. To tackle these challenges, we introduce a boundary-aware Multi-Behavioral Dynamic Graph Transformer (MB-DGT) model that dynamically refines the graph structure to reflect the evolving patterns of user behaviors and interactions. Our model involves a transformer-based dynamic graph aggregator for user preference modeling, which assimilates the changing graph structure and the sequence of user behaviors. This integration yields a more comprehensive and dynamic representation of user preferences. For model optimization, we implement a user-specific multi-behavior loss function that delineates the interest boundaries among different behaviors, thereby enriching the personalized learning of user preferences. Comprehensive experiments across three datasets indicate that our model consistently delivers remarkable recommendation performance.

[IR-12] ChainRec: An Agent ic Recommender Learning to Route Tool Chains for Diverse and Evolving Interests

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的推荐系统中,现有方法普遍依赖固定推理流程、缺乏对用户上下文动态适应能力的问题。在冷启动或兴趣演变等复杂场景下,静态工作流难以有效获取所需证据,导致推荐性能受限。解决方案的关键在于提出ChainRec——一个具有规划能力的代理推荐系统,其核心创新是构建标准化工具代理库(Tool Agent Library),并通过监督微调与偏好优化训练一个规划器(planner),实现动态选择推理工具、决定执行顺序以及适时终止推理过程,从而在不同用户情境下自适应地调整推荐策略。实验表明,该方法在多个真实数据集上的平均命中率(Avg HR@1,3,5)显著优于强基线,尤其在冷启动和兴趣演化场景中提升明显。

链接: https://arxiv.org/abs/2602.10490
作者: Fuchun Li,Qian Li,Xingyu Gao,Bocheng Pan,Yang Wu,Jun Zhang,Huan Yu,Jie Jiang,Jinsheng Xiao,Hailong Shi
机构: Chinese Academy of Sciences(中国科学院); Tencent(腾讯); Wuhan University(武汉大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated into recommender systems, motivating recent interest in agentic and reasoning-based recommendation. However, most existing approaches still rely on fixed workflows, applying the same reasoning procedure across diverse recommendation scenarios. In practice, user contexts vary substantially-for example, in cold-start settings or during interest shifts, so an agent should adaptively decide what evidence to gather next rather than following a scripted process. To address this, we propose ChainRec, an agentic recommender that uses a planner to dynamically select reasoning tools. ChainRec builds a standardized Tool Agent Library from expert trajectories. It then trains a planner using supervised fine-tuning and preference optimization to dynamically select tools, decide their order, and determine when to stop. Experiments on AgentRecBench across Amazon, Yelp, and Goodreads show that ChainRec consistently improves Avg HR@1,3,5 over strong baselines, with especially notable gains in cold-start and evolving-interest scenarios. Ablation studies further validate the importance of tool standardization and preference-optimized planning.

[IR-13] Compute Only Once: UG-Separation for Efficient Large Recommendation Models

【速读】:该论文旨在解决大规模推荐系统中密集特征交互模型(如RankMixer)因用户与候选物品特征深度耦合而导致无法复用用户侧计算资源的问题,从而造成高昂的推理成本。其核心解决方案是提出User-Group Separation (UG-Sep) 框架,通过引入一种掩码机制,在token混合层中显式分离用户侧与物品侧的信息流,确保部分token仅保留纯用户侧表示并可在多样本间复用,显著降低冗余推理开销;同时设计信息补偿策略以恢复因掩码导致的表达能力损失,并结合W8A16量化技术缓解内存带宽瓶颈,最终在字节跳动的大规模在线实验中实现最高20%的推理延迟下降,且不影响用户体验和商业指标。

链接: https://arxiv.org/abs/2602.10455
作者: Hui Lu,Zheng Chai,Shipeng Bai,Hao Zhang,Zhifang Fan,Kunmin Bai,Yingwen Wu,Bingzheng Wei,Xiang Sun,Ziyan Gong,Tianyi Liu,Hua Chen,Deping Xie,Zhongkai Chen,Zhiliang Guo,Qiwei Chen,Yuchao Zheng
机构: ByteDance(字节跳动)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Large Recommender Model, Industrial Recommenders, Scaling Law

点击查看摘要

Abstract:Driven by scaling laws, recommender systems increasingly rely on large-scale models to capture complex feature interactions and user behaviors, but this trend also leads to prohibitive training and inference costs. While long-sequence models(e.g., LONGER) can reuse user-side computation through KV caching, such reuse is difficult in dense feature interaction architectures(e.g., RankMixer), where user and group (candidate item) features are deeply entangled across layers. In this work, we propose User-Group Separation (UG-Sep), a novel framework that enables reusable user-side computation in dense interaction models for the first time. UG-Sep introduces a masking mechanism that explicitly disentangles user-side and item-side information flows within token-mixing layers, ensuring that a subset of tokens to preserve purely user-side representations across layers. This design enables corresponding token computations to be reused across multiple samples, significantly reducing redundant inference cost. To compensate for potential expressiveness loss induced by masking, we further propose an Information Compensation strategy that adaptively reconstructs suppressed user-item interactions. Moreover, as UG-Sep substantially reduces user-side FLOPs and exposes memory-bound components, we incorporate W8A16 (8-bit weight, 16-bit activation) weight-only quantization to alleviate memory bandwidth bottlenecks and achieve additional acceleration. We conduct extensive offline evaluations and large-scale online A/B experiments at ByteDance, demonstrating that UG-Sep reduces inference latency by up to 20 percent without degrading online user experience or commercial metrics across multiple business scenarios, including feed recommendation and advertising systems.

[IR-14] End-to-End Semantic ID Generation for Generative Advertisement Recommendation

【速读】:该论文旨在解决生成式推荐(Generative Recommendation)中基于残差量化(Residual Quantization, RQ)的语义 ID(Semantic ID, SID)生成方法所存在的两个核心问题:一是由于两阶段压缩导致的目标函数错位和语义退化;二是RQ结构 inherent 的误差累积问题。解决方案的关键在于提出UniSID框架,通过端到端联合优化广告原始数据的嵌入表示与SID,使语义信息直接流入SID空间,从而打破传统两阶段压缩的局限性;同时引入多粒度对比学习策略以对齐不同SID层级上的细粒度语义,并设计基于摘要的广告重建机制,促使SID捕获广告上下文中未显式表达的高层语义信息,显著提升推荐效果。

链接: https://arxiv.org/abs/2602.10445
作者: Jie Jiang,Xinxun Zhang,Enming Zhang,Yuling Xiong,Jun Zhang,Jingwen Wang,Huan Yu,Yuxiang Wang,Hao Wang,Xiao Yan,Jiawei Jiang
机构: Tencent Inc.(腾讯公司); Wuhan University(武汉大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative Recommendation (GR) has excelled by framing recommendation as next-token prediction. This paradigm relies on Semantic IDs (SIDs) to tokenize large-scale items into discrete sequences. Existing GR approaches predominantly generate SIDs via Residual Quantization (RQ), where items are encoded into embeddings and then quantized to discrete SIDs. However, this paradigm suffers from inherent limitations: 1) Objective misalignment and semantic degradation stemming from the two-stage compression; 2) Error accumulation inherent in the structure of RQ. To address these limitations, we propose UniSID, a Unified SID generation framework for generative advertisement recommendation. Specifically, we jointly optimize embeddings and SIDs in an end-to-end manner from raw advertising data, enabling semantic information to flow directly into the SID space and thus addressing the inherent limitations of the two-stage cascading compression paradigm. To capture fine-grained semantics, a multi-granularity contrastive learning strategy is introduced to align distinct items across SID levels. Finally, a summary-based ad reconstruction mechanism is proposed to encourage SIDs to capture high-level semantic information that is not explicitly present in advertising contexts. Experiments demonstrate that UniSID consistently outperforms state-of-the-art SID generation methods, yielding up to a 4.62% improvement in Hit Rate metrics across downstream advertising scenarios compared to the strongest baseline.

[IR-15] Chamfer-Linkage for Hierarchical Agglomerative Clustering

【速读】:该论文旨在解决传统层次聚类(Hierarchical Agglomerative Clustering, HAC)中因链接函数(linkage function)选择不当而导致聚类质量不稳定的问题。现有经典链接函数如单链接(single-linkage)、平均链接(average-linkage)和Ward方法在真实数据集上表现出高变异性,且无法 consistently 生成高质量聚类结果。解决方案的关键在于提出一种新的链接函数——Chamfer链接(Chamfer-linkage),该方法利用点云距离度量中的Chamfer距离来衡量簇间距离,从而更好地满足理想的簇表示性质(concept representation properties)。理论分析表明,Chamfer-linkage HAC可在 O(n2)O(n^2) 时间内实现,与传统链接函数效率相当;实验验证其在多种数据集上均显著优于平均链接和Ward方法,展现出更强的鲁棒性和聚类质量,可作为经典链接函数的直接替代方案。

链接: https://arxiv.org/abs/2602.10444
作者: Kishen N Gowda,Willem Fletcher,MohammadHossein Bateni,Laxman Dhulipala,D Ellis Hershkowitz,Rajesh Jayaram,Jakub Łącki
机构: University of Maryland College Park (马里兰大学学院公园分校); Brown University (布朗大学); Google Research (谷歌研究院)
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Hierarchical Agglomerative Clustering (HAC) is a widely-used clustering method based on repeatedly merging the closest pair of clusters, where inter-cluster distances are determined by a linkage function. Unlike many clustering methods, HAC does not optimize a single explicit global objective; clustering quality is therefore primarily evaluated empirically, and the choice of linkage function plays a crucial role in practice. However, popular classical linkages, such as single-linkage, average-linkage and Ward’s method show high variability across real-world datasets and do not consistently produce high-quality clusterings in practice. In this paper, we propose \emphChamfer-linkage, a novel linkage function that measures the distance between clusters using the Chamfer distance, a popular notion of distance between point-clouds in machine learning and computer vision. We argue that Chamfer-linkage satisfies desirable concept representation properties that other popular measures struggle to satisfy. Theoretically, we show that Chamfer-linkage HAC can be implemented in O(n^2) time, matching the efficiency of classical linkage functions. Experimentally, we find that Chamfer-linkage consistently yields higher-quality clusterings than classical linkages such as average-linkage and Ward’s method across a diverse collection of datasets. Our results establish Chamfer-linkage as a practical drop-in replacement for classical linkage functions, broadening the toolkit for hierarchical clustering in both theory and practice. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR) Cite as: arXiv:2602.10444 [cs.LG] (or arXiv:2602.10444v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.10444 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-16] GeoGR: A Generative Retrieval Framework for Spatio-Temporal Aware POI Recommendation

【速读】:该论文旨在解决大规模导航类位置服务(Location-Based Services, LBS)中下一兴趣点(Next Point-of-Interest, POI)预测的挑战,尤其是在复杂、稀疏的真实环境中,现有基于会话ID(Session ID, SID)的推荐方法存在两个关键瓶颈:一是难以建模高质量的SID以捕捉跨类别时空协同关系;二是大型语言模型(Large Language Models, LLMs)与POI推荐任务之间缺乏有效对齐。解决方案的核心在于提出GeoGR框架,其关键创新包括:(1) 一种地理感知的SID标记化流水线,通过地理约束下的共访POI对、对比学习和迭代优化显式学习时空协同语义表示;(2) 多阶段LLM训练策略,借助模板驱动的持续预训练(Continued Pre-Training, CPT)实现非原生SID token的对齐,并通过监督微调(Supervised Fine-Tuning, SFT)支持自回归POI生成。该方案在多个真实数据集上优于现有最优基线,并在AMAP平台成功部署,验证了其在实际场景中的有效性与可扩展性。

链接: https://arxiv.org/abs/2602.10411
作者: Fangye Wang,Haowen Lin,Yifang Yuan,Siyuan Wang,Xiaojiang Zhou,Song Yang,Pengjie Wang
机构: AMAP, Alibaba Group(阿里巴巴集团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Next Point-of-Interest (POI) prediction is a fundamental task in location-based services, especially critical for large-scale navigation platforms like AMAP that serve billions of users across diverse lifestyle scenarios. While recent POI recommendation approaches based on SIDs have achieved promising, they struggle in complex, sparse real-world environments due to two key limitations: (1) inadequate modeling of high-quality SIDs that capture cross-category spatio-temporal collaborative relationships, and (2) poor alignment between large language models (LLMs) and the POI recommendation task. To this end, we propose GeoGR, a geographic generative recommendation framework tailored for navigation-based LBS like AMAP, which perceives users’ contextual state changes and enables intent-aware POI recommendation. GeoGR features a two-stage design: (i) a geo-aware SID tokenization pipeline that explicitly learns spatio-temporal collaborative semantic representations via geographically constrained co-visited POI pairs, contrastive learning, and iterative refinement; and (ii) a multi-stage LLM training strategy that aligns non-native SID tokens through multiple template-based continued pre-training(CPT) and enables autoregressive POI generation via supervised fine-tuning(SFT). Extensive experiments on multiple real-world datasets demonstrate GeoGR’s superiority over state-of-the-art baselines. Moreover, deployment on the AMAP platform, serving millions of users with multiple online metrics boosting, confirms its practical effectiveness and scalability in production.

[IR-17] Single-Turn LLM Reformulation Powered Multi-Stage Hybrid Re-Ranking for Tip-of-the-Tongue Known-Item Retrieval

【速读】:该论文旨在解决Tip-of-the-Tongue (ToT)检索问题,即从模糊描述中获取已知信息项的挑战,尤其在标准伪相关反馈(Pseudo-Relevance Feedback)因初始召回率低而失效的情况下。其解决方案的关键在于利用一个未经微调的80亿参数通用大语言模型(LLM)进行查询改写,通过精心设计的提示策略将不规范的ToT查询转化为更符合信息需求的具体表达,从而显著提升后续检索系统的性能。这一轻量级预检索转换方法无需领域特定训练,仅依赖于有效的 prompting 策略,便使召回率提升20.61%,并进一步通过多阶段重排序机制(包括稀疏检索、密集表示重排、交叉编码和列表级重排)大幅提升nDCG@10、MRR和MAP@10等指标,展现出成本效益高的端到端检索优化潜力。

链接: https://arxiv.org/abs/2602.10321
作者: Debayan Mukhopadhyay,Utshab Kumar Ghosh,Shubham Chatterjee
机构: Missouri University of Science and Technology (密苏里科学技术大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieving known items from vague descriptions, Tip-of-the-Tongue (ToT) retrieval, remains a significant challenge. We propose using a single call to a generic 8B-parameter LLM for query reformulation, bridging the gap between ill-formed ToT queries and specific information needs. This method is particularly effective where standard Pseudo-Relevance Feedback fails due to poor initial recall. Crucially, our LLM is not fine-tuned for ToT or specific domains, demonstrating that gains stem from our prompting strategy rather than model specialization. Rewritten queries feed a multi-stage pipeline: sparse retrieval (BM25), dense/late-interaction reranking (Contriever, E5-large-v2, ColBERTv2), monoT5 cross-encoding, and list-wise reranking (Qwen 2.5 72B). Experiments on 2025 TREC-ToT datasets show that while raw queries yield poor performance, our lightweight pre-retrieval transformation improves Recall by 20.61%. Subsequent reranking improves nDCG@10 by 33.88%, MRR by 29.92%, and MAP@10 by 29.98%, offering a cost-effective intervention that unlocks the potential of downstream rankers. Code and data: this https URL

[IR-18] ECHO: An Open Research Platform for Evaluation of Chat Human Behavior and Outcomes

【速读】:该论文旨在解决当前研究中缺乏统一、可复现的混合方法平台,用于系统性评估人类与对话式AI系统及网络搜索引擎等不同信息获取范式交互行为的问题。其解决方案的关键在于提出ECHO(Evaluation of Chat, Human behavior, and Outcomes)平台,该平台通过低代码框架集成用户同意、背景调查、基于聊天和搜索的信息检索任务、写作或判断任务以及前后测评估,实现端到端实验流程的自动化管理,并以细粒度日志记录交互轨迹与响应数据,生成结构化数据集供下游分析,从而降低跨学科研究的技术门槛,支持信息检索、人机交互(HCI)及社会科学领域开展可扩展、可复现的人本AI评估。

链接: https://arxiv.org/abs/2602.10295
作者: Jiqun Liu,Nischal Dinesh,Ran Yu
机构: The University of Oklahoma(俄克拉荷马大学); GESIS – Leibniz Institute for the Social Sciences(德国社会科学研究机构)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:ECHO (Evaluation of Chat, Human behavior, and Outcomes) is an open research platform designed to support reproducible, mixed-method studies of human interaction with both conversational AI systems and Web search engines. It enables researchers from varying disciplines to orchestrate end-to-end experimental workflows that integrate consent and background surveys, chat-based and search-based information-seeking sessions, writing or judgment tasks, and pre- and post-task evaluations within a unified, low-coding-load framework. ECHO logs fine-grained interaction traces and participant responses, and exports structured datasets for downstream analysis. By supporting both chat and search alongside flexible evaluation instruments, ECHO lowers technical barriers for studying learning, decision making, and user experience across different information access paradigms, empowering researchers from information retrieval, HCI, and the social sciences to conduct scalable and reproducible human-centered AI evaluations.

[IR-19] MLDocRAG : Multimodal Long-Context Document Retrieval Augmented Generation

【速读】:该论文旨在解决多模态长文档理解中的两大挑战:一是跨模态异质性(cross-modal heterogeneity),即如何在不同模态(如文本、图像、表格)之间准确定位相关信息;二是跨页推理(cross-page reasoning),即如何聚合分散在多个页面中的证据。其解决方案的关键在于提出一种以查询为中心的建模框架——多模态长文档检索增强生成(MLDocRAG),通过构建多模态块-查询图(Multimodal Chunk-Query Graph, MCQG)来组织文档内容,其中细粒度查询由异构文档块生成,并与跨模态和跨页的内容建立链接。该图结构实现了基于查询的选择性检索与结构化证据聚合,从而提升长上下文多模态问答任务中的答案准确性与语义连贯性。

链接: https://arxiv.org/abs/2602.10271
作者: Yongyue Zhang,Yaxiong Wu
机构: Independent Researcher(独立研究员)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Understanding multimodal long-context documents that comprise multimodal chunks such as paragraphs, figures, and tables is challenging due to (1) cross-modal heterogeneity to localize relevant information across modalities, (2) cross-page reasoning to aggregate dispersed evidence across pages. To address these challenges, we are motivated to adopt a query-centric formulation that projects cross-modal and cross-page information into a unified query representation space, with queries acting as abstract semantic surrogates for heterogeneous multimodal content. In this paper, we propose a Multimodal Long-Context Document Retrieval Augmented Generation (MLDocRAG) framework that leverages a Multimodal Chunk-Query Graph (MCQG) to organize multimodal document content around semantically rich, answerable queries. MCQG is constructed via a multimodal document expansion process that generates fine-grained queries from heterogeneous document chunks and links them to their corresponding content across modalities and pages. This graph-based structure enables selective, query-centric retrieval and structured evidence aggregation, thereby enhancing grounding and coherence in long-context multimodal question answering. Experiments on datasets MMLongBench-Doc and LongDocURL demonstrate that MLDocRAG consistently improves retrieval quality and answer accuracy, demonstrating its effectiveness for long-context multimodal understanding.

[IR-20] JAG: Joint Attribute Graphs for Filtered Nearest Neighbor Search

【速读】:该论文旨在解决过滤最近邻搜索(Filtered Nearest Neighbor Search)在实际部署中性能不稳定的问题,即现有算法对查询选择性(query selectivity)和过滤类型(filter type)高度敏感,难以同时适应多种过滤类型(如标签相等、范围、子集、布尔过滤)和广泛的选择性区间。解决方案的关键在于提出JAG(Joint Attribute Graphs),其核心创新是引入属性距离(attribute distance)与过滤距离(filter distance),将二值过滤约束转化为连续的导航引导;通过构建一个联合优化向量相似性和属性邻近性的邻近图,有效避免导航死胡同,从而在全选择性范围内实现稳定且高效的检索性能。实验表明,JAG在五个数据集上对四种过滤类型均显著优于现有最优基线方法,在吞吐量和召回鲁棒性方面表现优异。

链接: https://arxiv.org/abs/2602.10258
作者: Haike Xu,Guy Blelloch,Laxman Dhulipala,Lars Gottesbüren,Rajesh Jayaram,Jakub Łącki
机构: 未知
类目: Information Retrieval (cs.IR); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Despite filtered nearest neighbor search being a fundamental task in modern vector search systems, the performance of existing algorithms is highly sensitive to query selectivity and filter type. In particular, existing solutions excel either at specific filter categories (e.g., label equality) or within narrow selectivity bands (e.g., pre-filtering for low selectivity) and are therefore a poor fit for practical deployments that demand generalization to new filter types and unknown query selectivities. In this paper, we propose JAG (Joint Attribute Graphs), a graph-based algorithm designed to deliver robust performance across the entire selectivity spectrum and support diverse filter types. Our key innovation is the introduction of attribute and filter distances, which transform binary filter constraints into continuous navigational guidance. By constructing a proximity graph that jointly optimizes for both vector similarity and attribute proximity, JAG prevents navigational dead-ends and allows JAG to consistently outperform prior graph-based filtered nearest neighbor search methods. Our experimental results across five datasets and four filter types (Label, Range, Subset, Boolean) demonstrate that JAG significantly outperforms existing state-of-the-art baselines in both throughput and recall robustness.

[IR-21] Silence Routing: When Not Speaking Improves Collective Judgment

【速读】:该论文试图解决在品味类判断(如音乐偏好)中,如何有效利用不同类型的社交信号以提升群体智能预测准确性的关键问题。其解决方案的核心在于提出了一种“信号路由框架”(routing framework),该框架明确界定个体在何种情境下应报告自身偏好(Own)、何时应报告对群体偏好的估计(Estimated),以及何时保持沉默更为有利。研究发现,仅当允许沉默时,第二层信号(second-order signals)才能有效发挥作用,从而显著优于单纯聚合所有个体自评的基线方法,表明品味领域的集体智慧依赖于有原则的信号路由机制,而非简单的平均策略。

链接: https://arxiv.org/abs/2602.10145
作者: Itsuki Fujisaki,Kunhao Yang
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 7pages, 2 figures

点击查看摘要

Abstract:The wisdom of crowds has been shown to operate not only for factual judgments but also in matters of taste, where accuracy is defined relative to an individual’s preferences. However, it remains unclear how different types of social signals should be selectively used in such domains. Focusing on a music preference dataset in which contributors provide both personal evaluations (Own) and estimates of population-level preferences (Estimated), we propose a routing framework for collective intelligence in taste. The framework specifies when contributors should speak, what they should report, and when silence is preferable. Using simulation-based aggregation, we show that prediction accuracy improves over an all-own baseline across a broad region of the parameter space, conditional on items where routing applies. Importantly, these gains arise only when silence is allowed, enabling second-order signals to function effectively. The results demonstrate that collective intelligence in matters of taste depends on principled signal routing rather than simple averaging.

人机交互

[HC-0] LCIP: Loss-Controlled Inverse Projection of High-Dimensional Image Data

【速读】:该论文旨在解决逆投影(Inverse Projection)方法在数据空间中生成结构受限的问题,即现有方法仅能生成固定表面状的结构,难以充分覆盖高维数据空间的复杂性。其解决方案的关键在于提出一种新型逆投影方法,能够通过用户控制实现对数据空间的“扫掠”(sweep),从而更灵活地生成多样化的数据重构结果。该方法具有通用性,适用于任意降维技术(Projection, P)和数据集,仅需两个直观的用户参数即可调控,并且实现简单,已在图像风格迁移任务中得到验证。

链接: https://arxiv.org/abs/2602.11141
作者: Yu Wang,Frederik L. Dennig,Michael Behrisch,Alexandru Telea
机构: Utrecht University (乌得勒支大学); University of Konstanz (康斯坦茨大学)
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Projections (or dimensionality reduction) methods P aim to map high-dimensional data to typically 2D scatterplots for visual exploration. Inverse projection methods P^-1 aim to map this 2D space to the data space to support tasks such as data augmentation, classifier analysis, and data imputation. Current P^-1 methods suffer from a fundamental limitation – they can only generate a fixed surface-like structure in data space, which poorly covers the richness of this space. We address this by a new method that can `sweep’ the data space under user control. Our method works generically for any P technique and dataset, is controlled by two intuitive user-set parameters, and is simple to implement. We demonstrate it by an extensive application involving image manipulation for style transfer.

[HC-1] AI Sensing and Intervention in Higher Education: Student Perceptions of Learning Impacts Affective Responses and Ethical Priorities

【速读】:该论文试图解决的问题是:在教育场景中,生成式 AI (Generative AI) 技术用于感知学生注意力和情绪以实施个性化教学干预时,如何平衡其对学生学习效果、心理福祉及伦理影响的复杂性,尤其关注学生视角常被忽视的现状。解决方案的关键在于设计可定制、社交敏感且非侵入性的系统,确保学生保有控制权、学习自主性(learning agency)与心理舒适度,同时优先保障隐私与自主权等核心伦理原则,避免因AI监控引发的社会尴尬或被动感,从而提升干预的接受度与有效性。

链接: https://arxiv.org/abs/2602.11074
作者: Bingyi Han,Ying Ma,Simon Coghlan,Dana McKay,George Buchanan,Wally Smith
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CHI 2026. This is the accepted author version

点击查看摘要

Abstract:AI technologies that sense student attention and emotions to enable more personalised teaching interventions are increasingly promoted, but raise pressing questions about student learning, well-being, and ethics. In particular, students’ perspectives about AI sensing-intervention in learning are often overlooked. We conducted an online mixed-method experiment with Australian university students (N=132), presenting video scenarios varying by whether sensing was used (in-use vs. not-in-use), sensing modality (gaze-based attention detection vs. facial-based emotion detection), and intervention (by digital device vs. teacher). Participants also completed pairwise ranking tasks to prioritise six core ethical concerns. Findings revealed that students valued targeted intervention but responded negatively to AI monitoring, regardless of sensing methods. Students preferred system-generated hints over teacher-initiated assistance, citing learning agency and social embarrassment concerns. Students’ ethical considerations prioritised autonomy and privacy, followed by transparency, accuracy, fairness, and learning beneficence. We advocate designing customisable, social-sensitive, non-intrusive systems that preserve student control, agency, and well-being.

[HC-2] GenFaceUI: Meta-Design of Generative Personalized Facial Expression Interfaces for Intelligent Agents

【速读】:该论文旨在解决智能代理(intelligent agents)在运行时生成面部表情过程中面临的控制性(controllability)、一致性(coherence)与对齐性(alignment)难题。其解决方案的关键在于提出了一种元设计(meta-design)视角下的生成式个性化面部表情接口(Generative Personalized Facial Expression Interface, GPFEI)框架,该框架通过组织规则约束空间、角色身份(character identity)以及上下文-表情映射机制,实现对生成过程的结构化调控与语义一致性保障。为验证该框架,作者进一步开发了GenFaceUI工具,支持设计者创建模板、标注语义标签、定义规则并迭代测试结果,从而推动生成式AI在面部表情交互中的可控性和实用性提升。

链接: https://arxiv.org/abs/2602.11055
作者: Yate Ge,Lin Tian,Yi Dai,Shuhan Pan,Yiwen Zhang,Qi Wang,Weiwei Guo,Xiaohua Sun
机构: Tongji University (同济大学); Southern University of Science and Technology (南方科技大学); University of Washington (华盛顿大学); Wuhan University of Technology (武汉理工大学); Beijing (北京)
类目: Human-Computer Interaction (cs.HC)
备注: To appear at ACM CHI '26

点击查看摘要

Abstract:This work investigates generative facial expression interfaces for intelligent agents from a meta-design perspective. We propose the Generative Personalized Facial Expression Interface (GPFEI) framework, which organizes rule-bounded spaces, character identity, and context–expression mapping to address challenges of control, coherence, and alignment in run-time facial expression generation. To operationalize this framework, we developed GenFaceUI, a proof-of-concept tool that enables designers to create templates, apply semantic tags, define rules, and iteratively test outcomes. We evaluated the tool through a qualitative study with twelve designers. The results show perceived gains in controllability and consistency, while revealing needs for structured visual mechanisms and lightweight explanations. These findings provide a conceptual framework, a proof-of-concept tool, and empirical insights that highlight both opportunities and challenges for advancing generative facial expression interfaces within a broader meta-design paradigm.

[HC-3] Normalized Surveillance in the Datafied Car: How Autonomous Vehicle Users Rationalize Privacy Trade-offs

【速读】:该论文旨在解决自动驾驶汽车(Autonomous Vehicles, AVs)在日常数据化过程中引发的隐私关切问题,特别是用户如何理解和接受车辆内部监控系统(如舱内摄像头、激光雷达和GPS等传感器)所构成的持续性数据采集行为。研究发现,AV使用者普遍缺乏特定于自动驾驶的隐私担忧,而是将这种监控视为与智能手机和智能家居平台类似的数据化常态,从而形成对车载 surveillance 的“默认接受”。其核心解决方案在于提出治理干预措施:通过确立普遍的数据访问权、强制性的透明度要求以及数据最小化标准,以防止因数据密集型机器学习需求导致的监管竞次(race-to-the-bottom)现象,并推动社会层面的数据伦理共识构建,实现对汽车数据化生态的民主化治理。

链接: https://arxiv.org/abs/2602.11026
作者: Yehuda Perry,Tawfiq Ammari
机构: Rutgers University, USA(罗格斯大学,美国)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Autonomous vehicles (AVs) are characterized by pervasive datafication and surveillance through sensors like in-cabin cameras, LIDAR, and GPS. Drawing on 16 semi-structured interviews with AV drivers analyzed using constructivist grounded theory, this study examines how users make sense of vehicular surveillance within everyday datafication. Findings reveal drivers demonstrate few AV-specific privacy concerns, instead normalizing monitoring through comparisons with established digital platforms. We theorize this indifference by situating AV surveillance within the surveillance ecology' of platform environments, arguing the datafied car functions as a mobile extension of the leaky home’ – private spaces rendered permeable through connected technologies continuously transmitting behavioral data. The study contributes to scholarship on surveillance beliefs, datafication, and platform governance by demonstrating how users who have accepted comprehensive smartphone and smart home monitoring encounter AV datafication as just another node in normalized data extraction. We highlight how geographic restrictions on data access – currently limiting driver log access to California – create asymmetries that impede informed privacy deliberation, exemplifying `tertiary digital divides.’ Finally, we examine how machine learning’s reliance on data-intensive approaches creates structural pressure for surveillance that transcends individual manufacturer choices. We propose governance interventions to democratize social learning, including universal data access rights, binding transparency requirements, and data minimization standards to prevent race-to-the-bottom dynamics in automotive datafication. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.11026 [cs.HC] (or arXiv:2602.11026v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.11026 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-4] Reality Copilot: Voice-First Human-AI Collaboration in Mixed Reality Using Large Multimodal Models

【速读】:该论文旨在解决当前混合现实(Mixed Reality, MR)系统中人机交互方式受限、智能化程度不足的问题,即多数MR应用仍依赖手动输入(如手势或控制器),且缺乏与大规模人工智能模型的深度集成,从而限制了其在日常任务中的智能辅助能力。解决方案的关键在于提出Reality Copilot——一个以语音为先的人工智能助手,通过引入大模型多模态(Large Multimodal Models, LMMs)实现对物理环境的上下文理解、实时信息检索以及真实感3D内容生成,从而支持自然语言交互,并通过跨平台工作流实现上下文感知的文本内容生成与资产导出,推动LMM驱动的人机协作在混合现实场景中的落地。

链接: https://arxiv.org/abs/2602.11025
作者: Liuchuan Yu,Yongqi Zhang,Lap-Fai Yu
机构: George Mason University (乔治梅森大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have shown strong potential for assisting users in tasks, such as programming, content creation, and information access, yet their interaction remains largely limited to traditional interfaces such as desktops and smartphones. Meanwhile, advances in mixed reality (MR) hardware have enabled applications that extend beyond entertainment and into everyday use. However, most existing MR systems rely primarily on manual input (e.g., hand gestures or controllers) and provide limited intelligent assistance due to the lack of integration with large-scale AI models. We present Reality Copilot, a voice-first human-AI assistant for mixed reality that leverages LMMs to enable natural speech-based interaction. The system supports contextual understanding of physical environments, realistic 3D content generation, and real-time information retrieval. In addition to in-headset interaction, Reality Copilot facilitates cross-platform workflows by generating context-aware textual content and exporting generated assets. This work explores the design space of LMM-powered human-AI collaboration in mixed reality.

[HC-5] Design Development and Use of Maya Robot as an Assistant for the Therapy/Education of Children with Cancer: a Pilot Study

【速读】:该论文旨在解决儿童癌症治疗过程中疼痛感知强烈及情绪焦虑问题,通过引入便携式大象造型社交机器人Maya作为干预手段,探索其在临床环境中的辅助治疗与教育价值。解决方案的关键在于利用深度神经网络提升机器人面部表情识别准确率(达98%),并通过两项预实验验证其效果:一是通过对比有无机器人陪伴下儿童注射时的疼痛感知差异,发现机器人显著降低儿童主观痛感;二是通过游戏互动评估儿童与母亲对机器人的信任度和焦虑水平,结果显示儿童比父母表现出更低焦虑和更高信任,且差异具有统计学意义(P < 0.05)。这表明社交机器人在改善患儿心理状态和疼痛管理方面具有显著潜力,为儿科医疗场景中人机协同干预提供了实证支持。

链接: https://arxiv.org/abs/2602.10942
作者: Alireza Taheri,Minoo Alemi,Elham Ranjkar,Raman Rafatnejad,Ali F. Meghdari
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study centers around the design and implementation of the Maya Robot, a portable elephant-shaped social robot, intended to engage with children undergoing cancer treatment. Initial efforts were devoted to enhancing the robot’s facial expression recognition accuracy, achieving a 98% accuracy through deep neural networks. Two subsequent preliminary exploratory experiments were designed to advance the study’s objectives. The first experiment aimed to compare pain levels experienced by children during the injection process, with and without the presence of the Maya robot. Twenty-five children, aged 4 to 9, undergoing cancer treatment participated in this counterbalanced study. The paired T-test results revealed a significant reduction in perceived pain when the robot was actively present in the injection room. The second experiment sought to assess perspectives of hospitalized children and their mothers during engagement with Maya through a game. Forty participants, including 20 children aged 4 to 9 and their mothers, were involved. Post Human-Maya Interactions, UTAUT questionnaire results indicated that children experienced significantly less anxiety than their parents during the interaction and game play. Notably, children exhibited higher trust levels in both the robot and the games, presenting a statistically significant difference in trust levels compared to their parents (P-value 0.05). This preliminary exploratory study highlights the positive impact of utilizing Maya as an assistant for therapy/education in a clinical setting, particularly benefiting children undergoing cancer treatment. The findings underscore the potential of social robots in pediatric healthcare contexts, emphasizing improved pain management and emotional well-being among young patients.

[HC-6] What do people want to fact-check?

【速读】:该论文试图解决的是虚假信息研究中长期被忽视的需求侧问题,即当普通人能够自由地对任何信息进行事实核查时,他们实际上会提出哪些核查请求。解决方案的关键在于构建并分析一个大规模的开放式AI事实核查系统中用户提交的近2,500条陈述,通过五维语义分类(领域、认识论形式、可验证性、目标实体和时间参照)生成公众验证需求的行为图谱,从而揭示真实世界中的事实核查行为模式与现有评估基准之间的结构性偏差。

链接: https://arxiv.org/abs/2602.10935
作者: Bijean Ghafouri,Dorsaf Sallami,Luca Luceri,Taylor Lynn Curtis,Jean-Francois Godbout,Emilio Ferrara,Reihaneh Rabbany
机构: University of Southern California (南加州大学); Mila; Université de Montréal (蒙特利尔大学); McGill University (麦吉尔大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Research on misinformation has focused almost exclusively on supply, asking what falsehoods circulate, who produces them, and whether corrections work. A basic demand-side question remains unanswered. When ordinary people can fact-check anything they want, what do they actually ask about? We provide the first large-scale evidence on this question by analyzing close to 2,500 statements submitted by 457 participants to an open-ended AI fact-checking system. Each claim is classified along five semantic dimensions (domain, epistemic form, verifiability, target entity, and temporal reference), producing a behavioral map of public verification demand. Three findings stand out. First, users range widely across topics but default to a narrow epistemic repertoire, overwhelmingly submitting simple descriptive claims about present-day observables. Second, roughly one in four requests concerns statements that cannot be empirically resolved, including moral judgments, speculative predictions, and subjective evaluations, revealing a systematic mismatch between what users seek from fact-checking tools and what such tools can deliver. Third, comparison with the FEVER benchmark dataset exposes sharp structural divergences across all five dimensions, indicating that standard evaluation corpora encode a synthetic claim environment that does not resemble real-world verification needs. These results reframe fact-checking as a demand-driven problem and identify where current AI systems and benchmarks are misaligned with the uncertainty people actually experience.

[HC-7] Viewpoint Recommendation for Point Cloud Labeling through Interaction Cost Modeling

【速读】:该论文旨在解决3D点云语义分割任务中数据标注耗时过长的问题,尤其是基于lasso选择的点选操作在不同视角下存在显著的时间成本差异。其解决方案的关键在于将Fitts’定律(Fitts’ law)引入点云标注场景,建立用于建模lasso选择时间成本的数学模型,并据此推荐能最小化标注时间成本的最优观察视角,从而提升标注效率。该方法被集成至一个完整的点云标注系统中,支持用户导航至推荐视角进行高效标注,实验表明该策略可有效降低标注时间成本。

链接: https://arxiv.org/abs/2602.10871
作者: Yu Zhang,Xinyi Zhao,Chongke Bi,Siming Chen
机构: Fudan University (复旦大学); University of Oxford (牛津大学); Tianjin University (天津大学); Shanghai Key Laboratory of Data Science (上海市数据科学重点实验室)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE TVCG

点击查看摘要

Abstract:Semantic segmentation of 3D point clouds is important for many applications, such as autonomous driving. To train semantic segmentation models, labeled point cloud segmentation datasets are essential. Meanwhile, point cloud labeling is time-consuming for annotators, which typically involves tuning the camera viewpoint and selecting points by lasso. To reduce the time cost of point cloud labeling, we propose a viewpoint recommendation approach to reduce annotators’ labeling time costs. We adapt Fitts’ law to model the time cost of lasso selection in point clouds. Using the modeled time cost, the viewpoint that minimizes the lasso selection time cost is recommended to the annotator. We build a data labeling system for semantic segmentation of 3D point clouds that integrates our viewpoint recommendation approach. The system enables users to navigate to recommended viewpoints for efficient annotation. Through an ablation study, we observed that our approach effectively reduced the data labeling time cost. We also qualitatively compare our approach with previous viewpoint selection approaches on different datasets.

[HC-8] he Effect of Design Thinking on Creative Innovation Processes: An Empirical Study Across Different Design Experience Levels

【速读】:该论文旨在解决设计创造力(Design Creativity Innovation)的驱动机制问题,特别是厘清思维技能、设计思维(Design Thinking)、创造性自我效能感(Creative Self-Efficacy, CSE)与集体创造性效能感(Collective Creative Efficacy, CCE)之间的因果路径及其在不同经验水平群体中的结构稳定性。解决方案的关键在于构建并验证一个整合模型,通过线性回归和结构方程建模揭示四类设计思维技能(问题驱动型、信息驱动型、解决方案驱动型、知识驱动型)对设计思维的显著正向影响,并进一步证实设计思维对设计创造力的预测作用;同时,中介分析识别出CSE、CCE及二者链式中介路径的三重传导机制,且多组比较表明该模型在学生与专业群体间具有良好的测量与结构等价性,为差异化教学策略和职业实践指南提供了坚实的实证依据。

链接: https://arxiv.org/abs/2602.10827
作者: Yuxin Zhang,Fan Zhang
机构: Academy of Arts & Design (美术学院); Tsinghua University (清华大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study employs linear regression and structural equation modeling to explore how Thinking Skills, Design Thinking, Creative Self-Efficacy (CSE), and Collective Creative Efficacy (CCE) drive Design Creativity Innovation, and analyzes the structural stability of the model across different levels of experience. Path analysis results indicate that the four Design Thinking Skills, Problem-driven Design (beta = 0.198, p 0.01), Information-driven Design (beta = 0.241, p 0.001), Solution-driven Design (beta = 0.227, p 0.001), and Knowledge-driven Design (beta = 0.263, p 0.001) all significantly and positively influence Design Thinking. Furthermore, Design Thinking has a significant positive predictive effect on Design Creativity Innovation (beta = 0.286, p 0.001). Mediation analysis confirms three significant mediation paths: the CSE mediation path (beta = 0.128, p 0.001), the CCE mediation path (beta = 0.073, p 0.01), and the “CSE to CCE” chain mediation path (beta = 0.025, p 0.01). Multi-group comparison results reveal significant differences between the student and professional groups under the full equivalence model. After relaxing specific constraints, there were no significant differences between the nested models of the baseline model, partial measurement invariance, structural weight invariance, and structural covariance invariance. These findings elucidate the multi-dimensional pathways of Design Creativity Innovation, providing a robust empirical basis for optimizing differentiated pedagogical models and professional practice guidelines.

[HC-9] Dont blame me: How Intelligent Support Affects Moral Responsibility in Human Oversight

【速读】:该论文旨在解决在安全关键任务中,当AI决策支持系统通过限制人类监督者的选择权来辅助其进行监督时,可能削弱监督者对不良后果(如事故)的道德责任感知的问题。研究发现,当监督者被限制只能从单一选项中选择时,他们对事故发生后的道德责任感受显著降低,而对AI及开发者责任的判断则保持不变。解决方案的关键在于:用户界面设计和监督架构应避免用户将道德代理权赋予AI,帮助其清晰理解道德责任的分配机制,并在以预防伦理上不可接受结果为目标的监督场景中,确保支持系统能够维护履行道德责任所需的认知与因果条件。

链接: https://arxiv.org/abs/2602.10701
作者: Cedric Faas,Richard Uth,Sarah Sterz,Markus Langer,Anna Maria Feit
机构: Saarland Informatics Campus, Saarland University (萨尔兰大学); Department of Psychology, University of Freiburg (弗莱堡大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI-based systems can increasingly perform work tasks autonomously. In safety-critical tasks, human oversight of these systems is required to mitigate risks and to ensure responsibility in case something goes wrong. Since people often struggle to stay focused and perform good oversight, intelligent support systems are used to assist them, giving decision recommendations, alerting users, or restricting them from dangerous actions. However, in cases where recommendations are wrong, decision support might undermine the very reason why human oversight was employed – genuine moral responsibility. The goal of our study was to investigate how a decision support system that restricted available interventions would affect overseer’s perceived moral responsibility, in particular in cases where the support errs. In a simulated oversight experiment, participants (\textitN=274) monitored an autonomous drone that faced ten critical situations, choosing from six possible actions to resolve each situation. An AI system constrained participants’ choices to either six, four, two, or only one option (between-subject study). Results showed that participants, who were restricted to choosing from a single action, felt less morally responsible if a crash occurred. At the same time, participants’ judgments about the responsibility of other stakeholders (the AI; the developer of the AI) did not change between conditions. Our findings provide important insights for user interface design and oversight architectures: they should prevent users from attributing moral agency to AI, help them understand how moral responsibility is distributed, and, when oversight aims to prevent ethically undesirable outcomes, be designed to support the epistemic and causal conditions required for moral responsibility.

[HC-10] Privacy Control in Conversational LLM Platforms: A Walkthrough Study

【速读】:该论文旨在解决当前生成式 AI(Generative AI)对话平台中用户对个人数据控制不足的问题,尤其是在自然语言交互场景下,用户难以有效访问、编辑、删除或共享其数据。解决方案的关键在于识别并分析六种主流对话式大语言模型(Large Language Models, LLMs)平台的数据控制机制,揭示出三个核心挑战:一是用户数据在交互过程中动态生成与衍生;二是自然语言输入带来的控制指令模糊性;三是多用户共享数据引发的共属权与治理问题。基于此,论文提出需从平台设计、政策制定和研究实践三方面协同优化隐私控制机制,以提升用户对数据生命周期的可控性和透明度。

链接: https://arxiv.org/abs/2602.10684
作者: Zhuoyang Li,Yanlai Wu,Yao Li,Xinning Gui,Yuhan Luo
机构: Eindhoven University of Technology (埃因霍温理工大学); University of Central Florida (中佛罗里达大学); The Pennsylvania State University (宾夕法尼亚州立大学); City University of Hong Kong (香港城市大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated into daily life through conversational interfaces, processing user data via natural language inputs and exhibiting advanced reasoning capabilities, which raises new concerns about user control over privacy. While much research has focused on potential privacy risks, less attention has been paid to the data control mechanisms these platforms provide. This study examines six conversational LLM platforms, analyzing how they define and implement features for users to access, edit, delete, and share data. Our analysis reveals an emerging paradigm of data control in conversational LLM platforms, where user data is generated and derived through interaction itself, natural language enables flexible yet often ambiguous control, and multi-user interactions with shared data raise questions of co-ownership and governance. Based on these findings, we offer practical insights for platform developers, policymakers, and researchers to design more effective and usable privacy controls in LLM-powered conversational interactions.

[HC-11] From Interaction to Demonstration Quality in Virtual Reality: Effects of Interaction Modality and Visual Representation on Everyday Tasks

【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)中不同输入设备配置对用户任务执行效率与体验的影响问题,尤其关注如何根据任务类型选择最优交互方式以提升训练效果和演示质量。其解决方案的关键在于通过对比三种典型VR输入配置——动作捕捉手套、带手部可视化的控制器、以及带控制器可视化的控制器——在厨房日常活动(如放置物体、切割、清洁和倾倒)中的表现,结合系统可用性量表(System Usability Scale)和NASA任务负荷指数(NASA Task Load Index)评估用户体验,并采用轨迹分割分析运动效率、冗余动作及执行精度来量化交互行为差异。结果揭示了配置特异性的执行策略:控制器在拾取-放置类任务中更高效且稳定,而动作捕捉手套则在涉及手法操作的任务中更自然但波动更大,从而明确了效率与自然性之间的权衡关系,为VR交互设计的针对性优化提供了实证依据。

链接: https://arxiv.org/abs/2602.10618
作者: Robin Beierling,Manuel Scheibl,Jonas Dech,Abhijit Vyas,Anna-Lisa Vollmer
机构: University of Bielefeld (比勒费尔德大学); University of Bremen (不来梅大学)
类目: Human-Computer Interaction (cs.HC)
备注: 26 pages, 6 figures, 7 tables (including appendix)

点击查看摘要

Abstract:Virtual Reality (VR) is increasingly used for training and demonstration purposes including a variety of applications ranging from robot learning to rehabilitation. However, the choice of input device and its visualization might influence workload and thus user performance leading to suboptimal demonstrations or reduced training effects. This study investigates how different VR input configurations - motion capture gloves, controllers with hand visualization, and controllers with controller visualization - affect user experience and task execution, with the goal of identifying which configuration is best suited for which type of task. Participants performed various kitchen-related activities of daily living (ADLs), including object placement, cutting, cleaning, and pouring in a simulated environment. To address two research questions, we evaluated user experience using the System Usability Scale and NASA Task Load Index (RQ1), and task-specific interaction behavior (RQ2). The latter was assessed using trajectory segmentation, analyzing movement efficiency, unnecessary actions, and execution precision. While no significant differences in overall usability and workload were found, trajectory analysis revealed configuration-specific execution behaviors with different movement strategies. Controllers enabled significantly faster task completion with less movement variability in pick-and-place style tasks such as table setting. In contrast, motion capture gloves produced more natural movements with fewer unnecessary actions, but also showed greater variance in movement patterns for manner-oriented tasks such as cutting bread. These findings highlight trade-offs between efficiency and naturalism, and have implications for optimizing VR-based training, improving the quality of user-generated demonstrations, and tailoring interaction design to specific application goals.

[HC-12] Labor Capital and Machine: Toward a Labor Process Theory for HCI

【速读】:该论文试图解决的问题是:当前人机交互(Human-Computer Interaction, HCI)领域对劳动问题和计算的政治经济学关注不足,缺乏系统性的理论框架来深入理解资本主义制度下现代工作与劳动者的关系,以及技术如何中介和重塑职场权力结构。其解决方案的关键在于引入并应用劳动过程理论(Labor Process Theory, LPT),通过梳理从马克思、布拉弗曼到伯劳伊等学者的发展脉络,提出一系列面向HCI研究与实践的方向:区分劳动与工作、将工作实践与价值生产相联结、向上研究管理逻辑、分析同意与合法性机制、超越生产环节、设计替代性制度、去自然化资产阶级设计。这些方向旨在深化对技术中介职场体制的结构性分析,推动批判性与规范性设计,并加强HCI与更广泛政治经济批判之间的联系。

链接: https://arxiv.org/abs/2602.10548
作者: Yigang Qin,EunJeong Cheon
机构: Syracuse University (雪城大学)
类目: Human-Computer Interaction (cs.HC)
备注: 20 pages, 2 figures. Accepted to CHI’26

点击查看摘要

Abstract:The HCI community has called for renewed attention to labor issues and the political economy of computing. Yet much work remains in engaging with labor theory to better understand modern work and workers. This article traces the development of Labor Process Theory (LPT) – from Karl Marx and Harry Braverman to Michael Burawoy and beyond – and introduces it as an essential yet underutilized resource for structural analysis of work under capitalism and the design of computing systems. We examine HCI literature on labor, investigating focal themes and conceptual, empirical, and design approaches. Drawing from LPT, we offer directions for HCI research and practice: distinguish labor from work, link work practice to value production, study up the management, analyze consent and legitimacy, move beyond the point of production, design alternative institutions, and unnaturalize bourgeois designs. These directions can deepen analyses of tech-mediated workplace regimes, inform critical and normative designs, and strengthen the field’s connection to broader political economic critique.

[HC-13] Exploring the Interplay Between Voice Personality and Gender in Human-Agent Interactions

【速读】:该论文旨在解决如何通过用户-代理(agent)的同步性(synchrony)来提升人机交互中代理的感知与接受度问题,尤其关注人格特质(personality traits)和性别(gender)在其中的作用。其解决方案的关键在于实证检验了用户能否从四种人工语音中识别出人格特征(内向或外向),并发现女性代理在人格区分上具有显著可感知性,而男性代理则不然;同时观察到“人格同步效应”——用户倾向于将首个交互代理视为与其自身人格更相似,这一效应主要由男性参与者驱动,且对男性代理更为明显。该研究为设计更具亲和力的人机交互系统提供了基于人格与性别同步性的理论依据与实践指导。

链接: https://arxiv.org/abs/2602.10535
作者: Kai Alexander Hackney,Lucas Guarenti Zangari,Jhonathan Sora-Cardenas,Emmanuel Munoz,Sterling R. Kalogeras,Betsy DiSalvo,Pedro Guillermo Feijoo-Garcia
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:To foster effective human-agent interactions, designers need to identify characteristics that could affect how agents are perceived and accepted, and to what extent they could impact rapport-building. Aiming to explore the role of user-agent synchrony, we assessed 388 participants to determine whether they could perceive personality traits from four artificial voices we selected and adapted from human samples, considering gender (male or female) and personality (introvert or extrovert) as grouping factors. Our findings suggest that participants were able to significantly differentiate female agents by personality, while male agents were not consistently distinguished. We also observed evidence of personality synchrony, where participants tended to perceive the first agent as more similar to their own personality, with this effect driven mainly by male participants, especially toward male agents. This paper contributes findings and insights to consider the interplay of user-agent personality and gender synchrony in the design of human-agent interactions.

[HC-14] Division of Labor and Collaboration Between Parents in Family Education

【速读】:该论文旨在解决家庭中作业辅导(homework tutoring)这一照护劳动(care work)所面临的认知与情感负担问题,尤其关注父母在分工与协调过程中缺乏针对性支持的现状。研究表明,作业辅导包含物理、认知和情感三个维度,其中后两者常被忽视;同时,父亲-母亲-孩子三方互动关系中的儿童反馈是影响父母调整分工的核心因素。解决方案的关键在于:基于人机交互(HCI)研究,提出一种以维护亲子及夫妻关系为核心目标的AI设计框架,而非单纯追求任务自动化或广泛减轻劳动负担,从而推动更具情境敏感性的共同育儿(coparenting)实践发展。

链接: https://arxiv.org/abs/2602.10501
作者: Ziyi Wang,Congrong Zhang,Jingying Deng,Xiaofan Hu,Jie Cai,Nan Gao,Chun Yu,Haining Zhang
机构: Beijing University of Civil Engineering and Architecture (北京建筑大学); University of Queensland (昆士兰大学); New York University (纽约大学); Hangzhou Dianzi University (杭州电子科技大学); Tsinghua University (清华大学); Nankai University (南开大学)
类目: Human-Computer Interaction (cs.HC)
备注: 16 pages. To appear in Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI 2026), Barcelona, Spain, April 13-17, 2026

点击查看摘要

Abstract:Homework tutoring work is a demanding and often conflict-prone practice in family life, and parents often lack targeted support for managing its cognitive and emotional burdens. Through interviews with 18 parents of children in grades 1-3, we examine how homework-related labor is divided and coordinated between parents, and where AI might meaningfully intervene. We found three key insights: (1) Homework labor encompasses distinct dimensions: physical, cognitive, and emotional, with the latter two often remaining invisible. (2) We identified father-mother-child triadic dynamics in labor division, with children’s feedback as the primary factor shaping parental labor adjustments. (3) Building on prior HCI research, we propose an AI design that prioritizes relationship maintenance over task automation or broad labor mitigation. By employing labor as a lens that integrates care work, we explore the complexities of labor within family contexts, contributing to feminist and care-oriented HCI and to the development of context-sensitive coparenting practices.

[HC-15] Why Human Guidance Matters in Collaborative Vibe Coding

【速读】:该论文旨在解决“vibe coding”(即通过自然语言指令生成代码的新型编程范式)在实际应用中对生产力和协作效率的影响尚不明确的问题,尤其是人类与AI在协同编码过程中的角色分工及其效果差异。其解决方案的关键在于构建了一个受控实验框架,系统比较了由人类主导、AI主导以及人机混合驱动的编码团队表现,发现:人类在提供高层次、迭代优化的指令方面具有不可替代的优势,而单纯依赖AI生成指令会导致性能显著下降;最优的混合系统设计是让人类保持方向性控制(制定指令),同时将执行与评估任务交由AI完成,从而实现效率与质量的协同提升。

链接: https://arxiv.org/abs/2602.10473
作者: Haoyu Hu,Raja Marjieh,Katherine M Collins,Chenyi Li,Thomas L. Griffiths,Ilia Sucholutsky,Nori Jacoby
机构: Cornell University(康奈尔大学); Princeton University(普林斯顿大学); Massachusetts Institute of Technology(麻省理工学院); New York University(纽约大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Writing code has been one of the most transformative ways for human societies to translate abstract ideas into tangible technologies. Modern AI is transforming this process by enabling experts and non-experts alike to generate code without actually writing code, but instead, through natural language instructions, or “vibe coding”. While increasingly popular, the cumulative impact of vibe coding on productivity and collaboration, as well as the role of humans in this process, remains unclear. Here, we introduce a controlled experimental framework for studying collaborative vibe coding and use it to compare human-led, AI-led, and hybrid groups. Across 16 experiments involving 604 human participants, we show that people provide uniquely effective high-level instructions for vibe coding across iterations, whereas AI-provided instructions often result in performance collapse. We further demonstrate that hybrid systems perform best when humans retain directional control (providing the instructions), while evaluation is delegated to AI.

[HC-16] Exploring the Feasibility of Full-Body Muscle Activation Sensing with Insole Pressure Sensors

【速读】:该论文旨在解决移动健康领域中肌肉激活(muscle activation)感知尚未被充分探索的问题,尤其是传统方法如表面肌电图(surface electromyography)依赖专用电极、不适用于长期监测的局限性。解决方案的关键在于提出Press2Muscle系统,首次利用鞋垫压力传感器无感推断全身肌肉激活状态,其核心思想是分析由肌肉激活驱动运动所引起的足底压力变化;为应对不同用户步态、体重和运动风格带来的压力信号差异,系统采用数据驱动方法动态调整对足部不同区域的依赖度,并融合易获取的生物特征数据以提升对未见用户的泛化能力。

链接: https://arxiv.org/abs/2602.10442
作者: Hao Zhou,Mahanth Gowda
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Muscle activation initiates contractions that drive human movement, and understanding it provides valuable insights for injury prevention and rehabilitation. Yet, sensing muscle activation is barely explored in the rapidly growing mobile health market. Traditional methods for muscle activation sensing rely on specialized electrodes, such as surface electromyography, making them impractical, especially for long-term usage. In this paper, we introduce Press2Muscle, the first system to unobtrusively infer muscle activation using insole pressure sensors. The key idea is to analyze foot pressure changes resulting from full-body muscle activation that drives movements. To handle variations in pressure signals due to differences in users’ gait, weight, and movement styles, we propose a data-driven approach to dynamically adjust reliance on different foot regions and incorporate easily accessible biographical data to enhance Press2Muscle’s generalization to unseen users. We conducted an extensive study with 30 users. Under a leave-one-user-out setting, Press2Muscle achieves a root mean square error of 0.025, marking a 19% improvement over a video-based counterpart. A robustness study validates Press2Muscle’s ability to generalize across user demographics, footwear types, and walking surfaces. Additionally, we showcase muscle imbalance detection and muscle activation estimation under free-living settings with Press2Muscle, confirming the feasibility of muscle activation sensing using insole pressure sensors in real-world settings.

[HC-17] owards Affordable Non-Invasive Real-Time Hypoglycemia Detection Using Wearable Sensor Signals

【速读】:该论文旨在解决糖尿病管理中无创低血糖(hypoglycemia)检测的难题,尤其是在连续葡萄糖监测(CGM)设备昂贵或难以获取的地区。其核心解决方案是构建一个基于可穿戴传感器信号的多模态生理学框架,通过融合皮肤电活动(galvanic skin response, GSR)与心率(heart rate, HR)信号,结合端到端的深度学习模型(如CNN、LSTM、GRU和TCN),实现高敏感性和稳定性的低血糖预警。关键在于利用两种生理信号的互补性,在不依赖侵入式传感器的前提下显著提升检测性能,尤其在召回率(recall)这一临床最紧迫指标上表现最优,从而为资源匮乏环境中的低成本血糖监测提供了可行路径。

链接: https://arxiv.org/abs/2602.10407
作者: Lawrence Obiuwevwi,Krzysztof J. Rechowicz,Vikas Ashok,Sampath Jayarathna
机构: 未知
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurately detecting hypoglycemia without invasive glucose sensors remains a critical challenge in diabetes management, particularly in regions where continuous glucose monitoring (CGM) is prohibitively expensive or clinically inaccessible. This extended study introduces a comprehensive, multimodal physiological framework for non-invasive hypoglycemia detection using wearable sensor signals. Unlike prior work limited to single-signal analysis, this chapter evaluates three physiological modalities, galvanic skin response (GSR), heart rate (HR), and their combined fusion, using the OhioT1DM 2018 dataset. We develop an end-to-end pipeline that integrates advanced preprocessing, temporal windowing, handcrafted and sequence-based feature extraction, early and late fusion strategies, and a broad spectrum of machine learning and deep temporal models, including CNNs, LSTMs, GRUs, and TCNs. Our results demonstrate that physiological signals exhibit distinct autonomic patterns preceding hypoglycemia and that combining GSR with HR consistently enhances detection sensitivity and stability compared to single-signal models. Multimodal deep learning architectures achieve the most reliable performance, particularly in recall, the most clinically urgent metric. Ablation studies further highlight the complementary contributions of each modality, strengthening the case for affordable, sensor-based glycemic monitoring. The findings show that real-time hypoglycemia detection is achievable using only inexpensive, non-invasive wearable sensors, offering a pathway toward accessible glucose monitoring in underserved communities and low-resource healthcare environments.

[HC-18] Disability-First AI Dataset Annotation: Co-designing Stuttered Speech Annotation Guidelines with People Who Stutter

【速读】:该论文旨在解决当前AI数据集在无障碍(Accessibility)标注中因缺乏残障群体特定专业知识而导致的标签不一致或不准确问题,尤其聚焦于言语障碍者(如口吃人群,People Who Stutter, PWS)的语音数据标注挑战。其解决方案的关键在于将PWS的具身知识(embodied knowledge)融入标注流程,通过与PWS及领域专家的访谈和共同设计工作坊,开发出能反映真实多样性体验的标注实践,从而提升数据质量并揭示静态标签在应对复杂残疾经验时的局限性。

链接: https://arxiv.org/abs/2602.10403
作者: Xinru Tang,Jingjin Li,Shaomei Wu
机构: University of California, Irvine (加州大学欧文分校); AImpower.org (AImpower.org)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Despite efforts to increase the representation of disabled people in AI datasets, accessibility datasets are often annotated by crowdworkers without disability-specific expertise, leading to inconsistent or inaccurate labels. This paper examines these annotation challenges through a case study of annotating speech data from people who stutter (PWS). Given the variability of stuttering and differing views on how it manifests, annotating and transcribing stuttered speech remains difficult, even for trained professionals. Through interviews and co-design workshops with PWS and domain experts, we identify challenges in stuttered speech annotation and develop practices that integrate the lived experiences of PWS into the annotation process. Our findings highlight the value of embodied knowledge in improving dataset quality, while revealing tensions between the complexity of disability experiences and the rigidity of static labels. We conclude with implications for disability-first and multiplicity-aware approaches to data interpretation across the AI pipeline.

[HC-19] Discovering Differences in Strategic Behavior Between Humans and LLM s

【速读】:该论文旨在解决当前行为博弈论(Behavioral Game Theory, BGT)模型难以充分刻画人类与大型语言模型(Large Language Models, LLMs)等非人类代理在战略互动中异质行为的问题。其解决方案的关键在于使用AlphaEvolve这一前沿程序发现工具,从数据中直接自动推导出可解释的行为模型,从而实现对驱动人类与LLM行为差异的结构性因素的开放性探索。通过在重复剪刀石头布博弈中的实证分析,研究发现前沿LLMs可能具备比人类更深的战略推理能力,为理解两者在战略交互中的结构差异提供了新范式。

链接: https://arxiv.org/abs/2602.10324
作者: Caroline Wang,Daniel Kasenberg,Kim Stachenfeld,Pablo Samuel Castro
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Google DeepMind
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.

[HC-20] Actions Speak Louder Than Chats: Investigating AI Chatbot Age Gating

【速读】:该论文旨在解决当前消费级聊天机器人(chatbot)在儿童和青少年隐私与安全保护方面的漏洞问题,特别是其年龄识别能力与实际防护措施之间的脱节。研究发现,尽管聊天机器人能够基于对话内容准确估算用户年龄,但即使识别出未成年人,也未采取任何实质性防护行动,违背了其自身隐私政策中关于禁止儿童访问的规定。解决方案的关键在于构建一套系统性的审计框架,通过自动化交互和包含显性和隐性年龄线索的提示库对主流聊天机器人进行大规模测试,并据此提出一个可验证的年龄过滤(age gating)实现原型,为平台设计优化和监管政策制定提供实证依据。

链接: https://arxiv.org/abs/2602.10251
作者: Olivia Figueira,Pranathi Chamarthi,Tu Le,Athina Markopoulou
机构: University of California, Irvine (加州大学欧文分校); The University of Alabama (阿拉巴马大学)
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:AI chatbots are widely used by children and teens today, but they pose significant risks to youth’s privacy and safety due to both increasingly personal conversations and potential exposure to unsafe content. While children under 13 are protected by the Children’s Online Privacy Protection Act (COPPA), chatbot providers’ own privacy policies may also provide protections, since they typically prohibit children from accessing their platforms. Age gating is often employed to restrict children online, but chatbot age gating in particular has not been studied. In this paper, we investigate whether popular consumer chatbots are (i) able to estimate users’ ages based solely on their conversations, and (ii) whether they take action upon identifying children. To that end, we develop an auditing framework in which we programmatically interact with chatbots and conduct 1050 experiments using our comprehensive library of age-indicative prompts, including implicit and explicit age disclosures, to analyze the chatbots’ responses and actions. We find that while chatbots are capable of estimating age, they do not take any action when children are identified, contradicting their own policies. Our methodology and findings provide insights for platform design, demonstrated by our proof-of-concept chatbot age gating implementation, and regulation to protect children online.

[HC-21] Investigating the Effects of Eco-Friendly Service Options on Rebound Behavior in Ride-Hailing

【速读】:该论文试图解决的问题是:环保服务选项(Eco-friendly Service Options, EFSOs)在减少个人碳排放方面的有效性可能因“反弹效应”(rebound effect)而被削弱,即用户在感知到环保行为后反而增加消费,从而抵消原本的减排效益。尤其是在人机交互(HCI)领域,现有研究尚未充分探讨常见环保反馈方式如何塑造此类反弹效应。论文通过一项在线被试内实验(N=75)在网约车场景中验证了五种EFSO变体(无EFSO、最小化EFSO、CO₂等效量化、游戏化和社交型)对用户选择步行或网约车的影响。关键发现是:缺乏明确环保反馈指标的EFSO反而增加了网约车使用率,且定性结果表明,EFSO可能使以便利为导向的选择变得更具正当性。因此,解决方案的关键在于设计能够主动识别并缓解反弹效应的EFSO机制,从而提升其实际环境效益。

链接: https://arxiv.org/abs/2602.10237
作者: Albin Zeqiri,Michael Rietzler,Enrico Rukzio
机构: Ulm University (乌尔姆大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Eco-friendly service options (EFSOs) aim to reduce personal carbon emissions, yet their eco-friendly framing may permit increased consumption, weakening their intended impact. Such rebound effects remain underexamined in HCI, including how common eco-feedback approaches shape them. We investigate this in an online within-subjects experiment (N=75) in a ride-hailing context. Participants completed 10 trials for five conditions (No EFSO, EFSO - Minimal, EFSO - CO2 Equivalency, EFSO - Gamified, EFSO - Social), yielding 50 choices between walking and ride-hailing for trips ranging from 0.5mi - 2.0mi (0.80km - 3.22km). We measured how different EFSO variants affected ride-hailing uptake relative to a No EFSO baseline. EFSOs lacking explicit eco-feedback metrics increased ride-hailing uptake, and qualitative responses indicate that EFSOs can make convenience-driven choices more permissible. We conclude with implications for designing EFSOs that begin to take rebound effects into account.

[HC-22] Reimagining Sign Language Technologies: Analyzing Translation Work of Chinese Deaf Online Content Creators

【速读】:该论文试图解决当前手语翻译系统在设计与应用中对聋人群体沟通复杂性理解不足的问题,尤其关注技术方案可能因简化或误译而削弱手语文化的丰富性和多样性。其解决方案的关键在于以聋人主导的翻译实践为基础,引入社会语言学中的(trans)languaging概念,重新构想手语翻译系统的架构,强调多语言、跨文化语境下的意义建构能力,以及创作者在翻译过程中对政治性与文化敏感性的主动协商,从而推动更具包容性与文化适切性的技术设计。

链接: https://arxiv.org/abs/2602.10235
作者: Xinru Tang,Anne Marie Piper
机构: University of California, Irvine (加州大学欧文分校)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While sign language translation systems promise to enhance deaf people’s access to information and communication, they have been met with strong skepticism from deaf communities due to risks of misrepresenting and oversimplifying the richness of signed communication in technologies. This article provides empirical evidence of the complexity of translation work involved in deaf communication through interviews with 13 deaf Chinese content creators who actively produce and share sign language content on video sharing platforms with both deaf and hearing audiences. By studying this unique group of content creators, our findings highlight the nuances of sign language translation, showing how deaf creators create content with multilingualism and multiculturalism in mind, support meaning making across languages and cultures, and navigate politics involved in their translation work. Grounded in these deaf-led translation practices, we draw on the sociolinguistic concept of (trans)languaging to re-conceptualize and reimagine the design of sign language translation systems.

[HC-23] Understanding the Effects of AI-Assisted Critical Thinking on Human-AI Decision Making

【速读】:该论文旨在解决人机协同决策中人类决策表现不佳的问题,其核心原因在于人类对自身决策逻辑的反思不足。解决方案的关键在于提出AI辅助批判性思维(AI-Assisted Critical Thinking, AACT)框架,该框架通过领域特定的AI模型对人类决策进行反事实分析,识别决策论证中的潜在缺陷,并支持修正,从而提升决策质量。

链接: https://arxiv.org/abs/2602.10222
作者: Harry Yizhou Tian,Hasan Amin,Ming Yin
机构: Purdue University (普渡大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Despite the growing prevalence of human-AI decision making, the human-AI team’s decision performance often remains suboptimal, partially due to insufficient examination of humans’ own reasoning. In this paper, we explore designing AI systems that directly analyze humans’ decision rationales and encourage critical reflection of their own decisions. We introduce the AI-Assisted Critical Thinking (AACT) framework, which leverages a domain-specific AI model’s counterfactual analysis of human decision to help decision-makers identify potential flaws in their decision argument and support the correction of them. Through a case study on house price prediction, we find that AACT outperforms traditional AI-based decision-support in reducing over-reliance on AI, though also triggering higher cognitive load. Subgroup analysis reveals AACT can be particularly beneficial for some decision-makers such as those very familiar with AI technologies. We conclude by discussing the practical implications of our findings, use cases and design choices of AACT, and considerations for using AI to facilitate critical thinking.

[HC-24] ENIGMA: EEG-to-Image in 15 Minutes Using Less Than 1% of the Parameters

【速读】:该论文旨在解决脑机接口(Brain-Computer Interface, BCI)在实际应用中的三大关键瓶颈:模型部署的便捷性与快速适应新受试者的能力、在低成本扫描设备上的有效性,以及在本地计算资源上的轻量化运行能力。为应对这些挑战,作者提出ENIGMA——一种多受试者电生理信号(EEG)到图像解码模型,其核心创新在于采用统一的时空骨干网络(subject-unified spatio-temporal backbone)结合多受试者潜在对齐层(multi-subject latent alignment layers)和MLP投影器,将原始EEG信号映射至丰富的视觉潜在空间。该架构显著减少了可训练参数(<1%于先前方法),并在仅需15分钟数据的情况下即可有效微调新受试者,同时在研究级(THINGS-EEG2)与消费级(AllJoined-1.6M)EEG基准上均达到当前最优性能,为实用化BCI系统提供了坚实基础。

链接: https://arxiv.org/abs/2602.10361
作者: Reese Kneeland,Wangshu Jiang,Ugo Bruzadin Nunes,Paul Steven Scotti,Arnaud Delorme,Jonathan Xu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:To be practical for real-life applications, models for brain-computer interfaces must be easily and quickly deployable on new subjects, effective on affordable scanning hardware, and small enough to run locally on accessible computing resources. To directly address these current limitations, we introduce ENIGMA, a multi-subject electroencephalography (EEG)-to-Image decoding model that reconstructs seen images from EEG recordings and achieves state-of-the-art (SOTA) performance on the research-grade THINGS-EEG2 and consumer-grade AllJoined-1.6M benchmarks, while fine-tuning effectively on new subjects with as little as 15 minutes of data. ENIGMA boasts a simpler architecture and requires less than 1% of the trainable parameters necessary for previous approaches. Our approach integrates a subject-unified spatio-temporal backbone along with a set of multi-subject latent alignment layers and an MLP projector to map raw EEG signals to a rich visual latent space. We evaluate our approach using a broad suite of image reconstruction metrics that have been standardized in the adjacent field of fMRI-to-Image research, and we describe the first EEG-to-Image study to conduct extensive behavioral evaluations of our reconstructions using human raters. Our simple and robust architecture provides a significant performance boost across both research-grade and consumer-grade EEG hardware, and a substantial improvement in fine-tuning efficiency and inference cost. Finally, we provide extensive ablations to determine the architectural choices most responsible for our performance gains in both single and multi-subject cases across multiple benchmark datasets. Collectively, our work provides a substantial step towards the development of practical brain-computer interface applications.

计算机视觉

[CV-0] SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos

【速读】:该论文旨在解决两相流中界面动力学(interfacial dynamics)的三维重建难题,尤其是针对移动界面附近难以实验测量的问题。传统方法在界面附近存在固有局限性,而现有神经渲染技术主要适用于边界模糊的单相流,无法处理具有尖锐、可变形特性的液-气界面。解决方案的关键在于提出SurfPhase模型:通过将动态高斯表面元(dynamic Gaussian surfels)与符号距离函数(signed distance function, SDF)相结合,确保几何一致性;同时引入视频扩散模型(video diffusion model)以合成新视角视频,从而利用稀疏相机视角实现高质量的界面动态重建和速度估计。

链接: https://arxiv.org/abs/2602.11154
作者: Yue Gao,Hong-Xing Yu,Sanghyeon Chang,Qianxi Fu,Bo Zhu,Yoonjin Won,Juan Carlos Niebles,Jiajun Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally. Project website: this https URL

点击查看摘要

Abstract:Interfacial dynamics in two-phase flows govern momentum, heat, and mass transfer, yet remain difficult to measure experimentally. Classical techniques face intrinsic limitations near moving interfaces, while existing neural rendering methods target single-phase flows with diffuse boundaries and cannot handle sharp, deformable liquid-vapor interfaces. We propose SurfPhase, a novel model for reconstructing 3D interfacial dynamics from sparse camera views. Our approach integrates dynamic Gaussian surfels with a signed distance function formulation for geometric consistency, and leverages a video diffusion model to synthesize novel-view videos to refine reconstruction from sparse observations. We evaluate on a new dataset of high-speed pool boiling videos, demonstrating high-quality view synthesis and velocity estimation from only two camera views. Project website: this https URL.

[CV-1] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

【速读】:该论文旨在解决扩散模型(Diffusion Models)与流匹配模型(Flow-Matching Models)在偏好优化(Preference Optimization)过程中,因依赖视觉语言模型(Vision-Language Models, VLMs)作为奖励函数而带来的计算成本高、内存消耗大以及像素空间奖励与潜在空间生成器之间存在域不匹配(Domain Mismatch)的问题。解决方案的关键在于提出一种扩散原生的潜在奖励模型(Diffusion-Native Latent Reward Model, DiNa-LRM),其核心创新是将偏好学习直接建模于带噪声的扩散状态上,并引入一种噪声校准的Thurstone似然函数,以显式考虑扩散噪声依赖的不确定性;同时利用预训练的潜在扩散主干网络和时间步条件化的奖励头,支持推理时的噪声集成(Inference-Time Noise Ensembling),从而实现测试时扩展(Test-Time Scaling)和鲁棒奖励机制,显著提升对齐效率与性能,且计算开销远低于VLM基线。

链接: https://arxiv.org/abs/2602.11146
作者: Gongye Liu,Bo Yang,Yida Zhi,Zhizhou Zhong,Lei Ke,Didan Deng,Han Gao,Yongxiang Huang,Kaihao Zhang,Hongbo Fu,Wenhan Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

[CV-2] GENIUS: Generative Fluid Intelligence Evaluation Suite

【速读】:该论文旨在解决当前统一多模态模型(Unified Multimodal Models, UMMs)在评估中过度侧重于晶体智力(Crystallized Intelligence)——即对已有知识和模式的回忆与调用——而忽视了生成流体智能(Generative Fluid Intelligence, GFI)的问题。GFI 是指模型在无先验知识的情况下,通过即时推理、适应新情境并生成新颖内容的能力。为系统评估这一能力,作者提出 GENIUS(GENERATIVE Fluid Intelligence Evaluation Suite),将其形式化为三项核心能力:隐式模式诱导(Inducing Implicit Patterns)、临时约束执行(Executing Ad-hoc Constraints)和情境知识适配(Adapting to Contextual Knowledge)。关键解决方案在于引入一种无需训练的注意力干预策略,诊断出性能缺陷源于上下文理解不足而非生成能力本身,并通过该策略显著提升模型在 GFI 任务上的表现,从而推动模型从静态知识利用向动态通用推理演进。

链接: https://arxiv.org/abs/2602.11144
作者: Ruichuan An,Sihan Yang,Ziyu Guo,Wei Dai,Zijun Shen,Haodong Li,Renrui Zhang,Xinyu Wei,Guopeng Li,Wenshan Wu,Wentao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess \textitCrystallized Intelligence , which relies on recalling accumulated knowledge and learned schemas. This focus overlooks \textitGenerative Fluid Intelligence (GFI) : the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce \textbfGENIUS ( \textbfGEN Fluid \textbfI ntelligence Eval \textbfU ation \textbfS uite). We formalize \textitGFI as a synthesis of three primitives. These include \textitInducing Implicit Patterns (e.g., inferring personalized visual preferences), \textitExecuting Ad-hoc Constraints (e.g., visualizing abstract metaphors), and \textitAdapting to Contextual Knowledge (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, \textbfGENIUS establishes a rigorous standard for \textitGFI , guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: \hrefthis https URLthis https URL .

[CV-3] From Circuits to Dynamics: Understanding and Stabilizing Failure in 3D Diffusion Transformers

【速读】:该论文旨在解决生成式 3D 重建中稀疏点云(sparse point cloud)条件下的表面补全(surface completion)问题,其核心挑战在于当前最先进的 3D 扩散变换器(3D diffusion transformers)存在一种灾难性失效模式——称为 Meltdown:输入点云的微小表面扰动即可导致输出结果分裂为多个不连通的片段。解决方案的关键在于通过机制解释(mechanistic interpretability)中的激活修补(activation-patching)技术,定位到该失效现象源于一个早期去噪交叉注意力(cross-attention)模块的异常行为;进一步发现该模块的奇异值谱熵(spectral entropy)可作为稳定性的标量代理指标,其升高对应于扩散逆过程中的对称性破缺分岔(symmetry-breaking bifurcation)。基于此洞察,作者提出 PowerRemap 方法,在测试时动态调整条件信号强度以稳定扩散轨迹,从而将此类断裂失效的发生率降低至最高 98.3%,实现了从电路级机制到扩散动力学行为的跨层级理解与干预。

链接: https://arxiv.org/abs/2602.11130
作者: Maximilian Plattner,Fabian Paischer,Johannes Brandstetter,Arturs Berzins
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable surface completion from sparse point clouds underpins many applications spanning content creation and robotics. While 3D diffusion transformers attain state-of-the-art results on this task, we uncover that they exhibit a catastrophic mode of failure: arbitrarily small on-surface perturbations to the input point cloud can fracture the output into multiple disconnected pieces – a phenomenon we call Meltdown. Using activation-patching from mechanistic interpretability, we localize Meltdown to a single early denoising cross-attention activation. We find that the singular-value spectrum of this activation provides a scalar proxy: its spectral entropy rises when fragmentation occurs and returns to baseline when patched. Interpreted through diffusion dynamics, we show that this proxy tracks a symmetry-breaking bifurcation of the reverse process. Guided by this insight, we introduce PowerRemap, a test-time control that stabilizes sparse point-cloud conditioning. We demonstrate that Meltdown persists across state-of-the-art architectures (WaLa, Make-a-Shape), datasets (GSO, SimJEB) and denoising strategies (DDPM, DDIM), and that PowerRemap effectively counters this failure with stabilization rates of up to 98.3%. Overall, this work is a case study on how diffusion model behavior can be understood and guided based on mechanistic analysis, linking a circuit-level cross-attention mechanism to diffusion-dynamics accounts of trajectory bifurcations.

[CV-4] PhyCritic: Multimodal Critic Models for Physical AI

【速读】:该论文旨在解决当前多模态评判模型(judge and critic models)在物理AI任务中表现不足的问题,尤其是这些任务涉及感知、因果推理和规划等能力时,现有模型主要基于通用视觉领域(如图像描述或视觉问答)训练,缺乏对物理场景的针对性优化。解决方案的关键在于提出PhyCritic,一个专为物理AI设计的多模态评判模型,其核心创新是采用两阶段强化学习与视觉反馈(Reinforcement Learning with Visual Reward, RLVR)流程:第一阶段通过物理技能预热(physical skill warmup)增强模型的物理感知与推理能力;第二阶段引入自指式评判微调(self-referential critic finetuning),使评判模型在判断候选响应前先生成自身预测作为内部参考,从而提升判断的稳定性和物理合理性。该方法在物理和通用多模态评判基准上均显著优于开源基线,并在作为策略模型使用时进一步提升了物理任务中的感知与推理性能。

链接: https://arxiv.org/abs/2602.11124
作者: Tianyi Xiong,Shihao Wang,Guilin Liu,Yi Dong,Ming Li,Heng Huang,Jan Kautz,Zhiding Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.

[CV-5] HairWeaver: Few-Shot Photorealistic Hair Motion Synthesis with Sim-to-Real Guided Video Diffusion

【速读】:该论文旨在解决现有方法在生成人体图像动画时难以实现真实且富有表现力的头发动态问题,尤其是缺乏对头发运动的精细控制,导致动画僵硬、不自然。解决方案的关键在于提出HairWeaver,一个基于扩散模型的流水线,其核心创新是引入两个轻量级模块:Motion-Context-LoRA用于整合运动条件以精确控制头发动作,Sim2Real-Domain-LoRA用于在不同数据域间保持主体的物理真实外观;这两个模块协同引导视频扩散主干网络,在保留其生成能力的同时实现高质量头发动画的生成。

链接: https://arxiv.org/abs/2602.11117
作者: Di Chang,Ji Hou,Aljaz Bozic,Assaf Neuberger,Felix Juefei-Xu,Olivier Maury,Gene Wei-Chin Lin,Tuur Stuyck,Doug Roble,Mohammad Soleymani,Stephane Grabli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL

点击查看摘要

Abstract:We present HairWeaver, a diffusion-based pipeline that animates a single human image with realistic and expressive hair dynamics. While existing methods successfully control body pose, they lack specific control over hair, and as a result, fail to capture the intricate hair motions, resulting in stiff and unrealistic animations. HairWeaver overcomes this limitation using two specialized modules: a Motion-Context-LoRA to integrate motion conditions and a Sim2Real-Domain-LoRA to preserve the subject’s photoreal appearance across different data domains. These lightweight components are designed to guide a video diffusion backbone while maintaining its core generative capabilities. By training on a specialized dataset of dynamic human motion generated from a CG simulator, HairWeaver affords fine control over hair motion and ultimately learns to produce highly realistic hair that responds naturally to movement. Comprehensive evaluations demonstrate that our approach sets a new state of the art, producing lifelike human hair animations with dynamic details.

[CV-6] FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference ICLR

【速读】:该论文旨在解决流匹配(Flow-matching)模型在图像和视频生成中因固有的逐步去噪过程导致的推理速度慢的问题。现有加速方法如知识蒸馏、轨迹截断和一致性策略等均为静态策略,需重新训练且泛化能力差。解决方案的关键在于提出一种即插即用的自适应推理框架 FastFlow:它通过识别对去噪路径贡献微小的步骤,并利用前序预测的有限差分速度估计来外推未来状态,从而在不增加计算成本的前提下跳过中间步骤;同时将跳步决策建模为多臂赌博机(multi-armed bandit)问题,使模型能动态学习最优跳步策略,在保证生成质量的同时实现显著提速。

链接: https://arxiv.org/abs/2602.11105
作者: Divya Jyoti Bajpai,Dhruv Bhardwaj,Soumya Roy,Tejas Duseja,Harsh Agarwal,Aashay Sandansing,Manjesh Kumar Hanawal
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at International Conference on Learning Representations (ICLR) 2026

点击查看摘要

Abstract:Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower. Existing acceleration methods like distillation, trajectory truncation, and consistency approaches are static, require retraining, and often fail to generalize across tasks. We propose FastFlow, a plug-and-play adaptive inference framework that accelerates generation in flow matching models. FastFlow identifies denoising steps that produce only minor adjustments to the denoising path and approximates them without using the full neural network models used for velocity predictions. The approximation utilizes finite-difference velocity estimates from prior predictions to efficiently extrapolate future states, enabling faster advancements along the denoising path at zero compute cost. This enables skipping computation at intermediary steps. We model the decision of how many steps to safely skip before requiring a full model computation as a multi-armed bandit problem. The bandit learns the optimal skips to balance speed with performance. FastFlow integrates seamlessly with existing pipelines and generalizes across image generation, video generation, and editing tasks. Experiments demonstrate a speedup of over 2.6x while maintaining high-quality outputs. The source code for this work can be found at this https URL.

[CV-7] First International StepUP Competition for Biometric Footstep Recognition: Methods Results and Remaining Challenges

【速读】:该论文旨在解决生物特征步态识别(biometric footstep recognition)中因缺乏大规模、多样化数据集而导致的模型泛化能力不足和对鞋类变化、行走速度等因子敏感的问题。解决方案的关键在于利用最新发布的UNB StepUP-P150数据集——目前最大且最全面的高分辨率步态压力记录集合,并通过首届国际StepUP竞赛推动深度学习模型在受限参考数据条件下实现鲁棒的验证性能评估。比赛中表现最优的团队Saeid_UCC采用生成式奖励机器(generative reward machine, GRM)优化策略,取得了10.77%的最低等错误率(equal error rate, EER),凸显了先进优化方法在提升模型适应性和稳定性方面的潜力,但同时也揭示了当前模型在陌生鞋类场景下仍存在显著挑战,为后续研究指明方向。

链接: https://arxiv.org/abs/2602.11086
作者: Robyn Larracy,Eve MacDonald,Angkoon Phinyomark,Saeid Rezaei,Mahdi Laghaei,Ali Hajighasem,Aaron Tabor,Erik Scheme
机构: University of New Brunswick, Canada (新不伦瑞克大学); University College Cork, Ireland (科克大学); Islamic Azad University, Iran (伊朗伊斯兰阿扎德大学); University of New South Whales, Australia (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: to be published in 2025 IEEE International Joint Conference on Biometrics (IJCB)

点击查看摘要

Abstract:Biometric footstep recognition, based on a person’s unique pressure patterns under their feet during walking, is an emerging field with growing applications in security and safety. However, progress in this area has been limited by the lack of large, diverse datasets necessary to address critical challenges such as generalization to new users and robustness to shifts in factors like footwear or walking speed. The recent release of the UNB StepUP-P150 dataset, the largest and most comprehensive collection of high-resolution footstep pressure recordings to date, opens new opportunities for addressing these challenges through deep learning. To mark this milestone, the First International StepUP Competition for Biometric Footstep Recognition was launched. Competitors were tasked with developing robust recognition models using the StepUP-P150 dataset that were then evaluated on a separate, dedicated test set designed to assess verification performance under challenging variations, given limited and relatively homogeneous reference data. The competition attracted global participation, with 23 registered teams from academia and industry. The top-performing team, Saeid_UCC, achieved the best equal error rate (EER) of 10.77% using a generative reward machine (GRM) optimization strategy. Overall, the competition showcased strong solutions, but persistent challenges in generalizing to unfamiliar footwear highlight a critical area for future work.

[CV-8] PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation ECAI2025

【速读】:该论文旨在解决自监督单目深度估计中计算效率与结构细节保留之间的矛盾问题(computational efficiency vs. detail preservation)。现有方法要么因模型庞大而难以部署,要么因轻量化设计牺牲了结构精度。其解决方案的关键在于提出一个三阶段轻量级框架 PuriLight,包含三个创新模块:用于局部特征提取的shuffle-dilation 卷积(Shuffle-Dilation Convolution, SDC)模块、用于层次化特征增强的旋转自适应核注意力(Rotation-Adaptive Kernel Attention, RAKA)模块,以及用于全局特征净化的深度频率信号净化(Deep Frequency Signal Purification, DFSP)模块。这些模块协同工作,在显著降低参数量的同时保持高精度的深度估计性能,从而实现高效且结构保真的单目深度估计。

链接: https://arxiv.org/abs/2602.11066
作者: Yujie Chen,Li Zhang,Xiaomeng Chu,Tian Zhang
机构: Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6figures, accepted by European Conference on Artificial Intelligence (ECAI2025)

点击查看摘要

Abstract:We propose PuriLight, a lightweight and efficient framework for self-supervised monocular depth estimation, to address the dual challenges of computational efficiency and detail preservation. While recent advances in self-supervised depth estimation have reduced reliance on ground truth supervision, existing approaches remain constrained by either bulky architectures compromising practicality or lightweight models sacrificing structural precision. These dual limitations underscore the critical need to develop lightweight yet structurally precise architectures. Our framework addresses these limitations through a three-stage architecture incorporating three novel modules: the Shuffle-Dilation Convolution (SDC) module for local feature extraction, the Rotation-Adaptive Kernel Attention (RAKA) module for hierarchical feature enhancement, and the Deep Frequency Signal Purification (DFSP) module for global feature purification. Through effective collaboration, these modules enable PuriLight to achieve both lightweight and accurate feature extraction and processing. Extensive experiments demonstrate that PuriLight achieves state-of-the-art performance with minimal training parameters while maintaining exceptional computational efficiency. Codes will be available at this https URL.

[CV-9] Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting WACV2026

【速读】:该论文旨在解决手术室中高密度场景下手术器械准确计数的问题,这是保障患者安全的关键前提。现有方法在密集排列的器械场景中表现不佳,主要受限于传统基于目标检测的无序推理方式。其解决方案的关键在于提出一种名为Chain-of-Look的新型视觉推理框架,该框架通过构建结构化的视觉链(visual chain)模拟人类逐个计数的顺序过程,而非依赖无序的目标检测;同时引入邻近损失函数(neighboring loss function),显式建模密集排列器械间的空间约束,从而提升视觉链的物理合理性与计数精度。

链接: https://arxiv.org/abs/2602.11024
作者: Rishikesh Bhyri,Brian R Quaranto,Philip J Seger,Kaity Tung,Brendan Fox,Gene Yang,Steven D. Schwaitzberg,Junsong Yuan,Nan Xi,Peter C W Kim
机构: State University of New York at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to WACV 2026. This version includes additional authors who contributed during the rebuttal phase

点击查看摘要

Abstract:Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.

[CV-10] ContactGaussian-WM: Learning Physics-Grounded World Model from Videos

【速读】:该论文旨在解决在数据稀缺和复杂接触密集动态环境下,现有世界模型难以准确建模物理交互的问题。其解决方案的关键在于提出ContactGaussian-WM,一个可微分的、基于物理规则的刚体世界模型,通过统一的高斯表示同时建模视觉外观与碰撞几何,并结合端到端可微的学习框架,直接从稀疏且接触丰富的视频中推断物理属性,从而实现对复杂物理规律的有效学习与鲁棒泛化。

链接: https://arxiv.org/abs/2602.11021
作者: Meizhong Wang,Wanxin Jin,Kun Cao,Lihua Xie,Yiguang Hong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Developing world models that understand complex physical interactions is essential for advancing robotic planning and this http URL, existing methods often struggle to accurately model the environment under conditions of data scarcity and complex contact-rich dynamic this http URL address these challenges, we propose ContactGaussian-WM, a differentiable physics-grounded rigid-body world model capable of learning intricate physical laws directly from sparse and contact-rich video this http URL framework consists of two core components: (1) a unified Gaussian representation for both visual appearance and collision geometry, and (2) an end-to-end differentiable learning framework that differentiates through a closed-form physics engine to infer physical properties from sparse visual this http URL simulations and real-world evaluations demonstrate that ContactGaussian-WM outperforms state-of-the-art methods in learning complex scenarios, exhibiting robust generalization this http URL, we showcase the practical utility of our framework in downstream applications, including data synthesis and real-time MPC.

[CV-11] LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

【速读】:该论文旨在解决点云场景中基于查询的3D实例分割方法面临的两个核心问题:一是查询初始化困难,由于点云数据稀疏导致初始查询难以覆盖完整场景;二是查询解码器计算复杂度高,现有方法依赖密集的注意力机制造成冗余计算。解决方案的关键在于提出LaSSM(Lightweight and Simple Semantic-Spatial Model),其创新性体现在两方面:首先设计了一种分层语义-空间查询初始化器(hierarchical semantic-spatial query initializer),通过超点(superpoint)结合语义线索与空间分布信息生成高质量初始查询集,实现更全面的场景覆盖并加速收敛;其次引入坐标引导的状态空间模型(coordinate-guided state space model, SSM)解码器,采用局部聚合机制聚焦几何一致区域,并利用空间双路径SSM模块融合坐标信息以捕捉查询间的潜在依赖关系,从而在显著降低计算量(仅需1/3 FLOPs)的前提下提升实例预测效率与精度,在ScanNet++ V2等基准上取得最优性能。

链接: https://arxiv.org/abs/2602.11007
作者: Lei Yao,Yi Wang,Yawen Cui,Moyun Liu,Lap-Pui Chau
机构: The Hong Kong Polytechnic University (香港理工大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE-TCSVT

点击查看摘要

Abstract:Query-based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic-spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate-guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual-path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large-scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at this https URL.

[CV-12] Interpretable Vision Transformers in Monocular Depth Estimation via SVDA CVPR

【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation)中基于Transformer的自注意力机制缺乏可解释性的问题。现有方法虽在性能上表现优异,但其注意力机制如同“黑箱”,难以揭示模型内部如何组织信息以完成密集预测任务。解决方案的关键在于提出一种受奇异值分解(SVD)启发的注意力机制——SVD-Inspired Attention (SVDA),它通过将一个可学习的对角矩阵嵌入归一化的查询-键交互中,实现方向对齐与频谱调制的解耦,从而生成内在可解释的注意力图,而非依赖事后近似。此设计不仅保持了预测精度,还引入六个谱指标(熵、秩、稀疏性、对齐度、选择性和鲁棒性),揭示了训练过程中跨数据集和深度层级的稳定注意力模式,为透明化密集预测模型提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2602.11005
作者: Vasileios Arampatzakis,George Pavlidis,Nikolaos Mitianoudis,Nikos Papamarkos
机构: Democritus University of Thrace (德谟克利特大学); Athena Research Center (阿瑞斯研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, submitted to CVPR Conference 2026

点击查看摘要

Abstract:Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.

[CV-13] Enhancing Predictability of Multi-Tenant DNN Inference for Autonomous Vehicles Perception

【速读】:该论文旨在解决自动驾驶车辆(AV)感知流水线中深度神经网络(DNN)实时推理的挑战,即在计算资源受限的情况下实现可预测的感知性能。现有方法主要通过模型压缩(如剪枝和量化)来优化推理速度,但难以保证推理延迟的稳定性。本文提出Predictable Perception system with DNNs (PP-DNN),其核心创新在于:基于环境动态选择关键帧与感兴趣区域(ROI),而非对所有输入图像进行全量处理,从而在不降低多租户DNN准确率的前提下显著减少计算负载。关键机制包括:利用ROI生成器根据帧间相似性和交通场景识别关键帧与ROI,通过FLOPs预测器估算动态处理任务的乘加操作(MACs)数量,结合ROI调度器协调多模型处理,并设计检测预测器处理非关键帧。实验表明,PP-DNN在BDD100K和nuScenes数据集上显著提升了感知可预测性,融合帧数提升7.3倍、融合延迟降低2.6倍、延迟波动减少2.3倍,同时检测完整性提高75.4%,成本效益最高达98%。

链接: https://arxiv.org/abs/2602.11004
作者: Liangkai Liu,Kang G. Shin,Jinkyu Lee,Chengmo Yang,Weisong Shi
机构: 1. University of California, Riverside (加州大学河滨分校); 2. Samsung Electronics (三星电子); 3. Wayne State University (韦恩州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: 13 pages, 12 figures

点击查看摘要

Abstract:Autonomous vehicles (AVs) rely on sensors and deep neural networks (DNNs) to perceive their surrounding environment and make maneuver decisions in real time. However, achieving real-time DNN inference in the AV’s perception pipeline is challenging due to the large gap between the computation requirement and the AV’s limited resources. Most, if not all, of existing studies focus on optimizing the DNN inference time to achieve faster perception by compressing the DNN model with pruning and quantization. In contrast, we present a Predictable Perception system with DNNs (PP-DNN) that reduce the amount of image data to be processed while maintaining the same level of accuracy for multi-tenant DNNs by dynamically selecting critical frames and regions of interest (ROIs). PP-DNN is based on our key insight that critical frames and ROIs for AVs vary with the AV’s surrounding environment. However, it is challenging to identify and use critical frames and ROIs in multi-tenant DNNs for predictable inference. Given image-frame streams, PP-DNN leverages an ROI generator to identify critical frames and ROIs based on the similarities of consecutive frames and traffic scenarios. PP-DNN then leverages a FLOPs predictor to predict multiply-accumulate operations (MACs) from the dynamic critical frames and ROIs. The ROI scheduler coordinates the processing of critical frames and ROIs with multiple DNN models. Finally, we design a detection predictor for the perception of non-critical frames. We have implemented PP-DNN in an ROS-based AV pipeline and evaluated it with the BDD100K and the nuScenes dataset. PP-DNN is observed to significantly enhance perception predictability, increasing the number of fusion frames by up to 7.3x, reducing the fusion delay by 2.6x and fusion-delay variations by 2.3x, improving detection completeness by 75.4% and the cost-effectiveness by up to 98% over the baseline.

[CV-14] Interpretable Vision Transformers in Image Classification via SVDA

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)中注意力机制缺乏可解释性、结构稀疏性不足以及频谱特性不明确的问题。其核心解决方案是引入受奇异值分解(Singular Value Decomposition, SVD)启发的注意力机制(SVD-Inspired Attention, SVDA),通过几何建模的方式重构注意力计算过程,从而在保持分类准确率的前提下提升注意力模式的可解释性、稀疏性和频谱结构特性。实验表明,SVDA能够在多个基准数据集上生成更具结构性和可解释性的注意力图,为后续可解释人工智能、频谱诊断及基于注意力的模型压缩提供了新的分析工具与研究基础。

链接: https://arxiv.org/abs/2602.10994
作者: Vasileios Arampatzakis,George Pavlidis,Nikolaos Mitianoudis,Nikos Papamarkos
机构: Democritus University of Thrace (德谟克利特大学); Athena Research Center (阿thena研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, submitted to IEEE Access

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators – originally proposed with SVDA – to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks – CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 – demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.

[CV-15] DFIC: Towards a balanced facial image dataset for automatic ICAO compliance verification

【速读】:该论文旨在解决机器可读旅行证件(MRTDs)中人脸图像合规性验证的效率问题,即如何在高需求场景下实现对ISO/IEC和ICAO标准的自动化、可靠检测,以替代低效的人工检查。其关键解决方案是提出了DFIC数据集,该数据集包含约58,000张标注人脸图像和2706段视频,覆盖超过1000名受试者的多种非合规情形及合规肖像,并具有更均衡的人口统计分布,尤其一个分区接近均匀分布,从而支持训练更具鲁棒性和适应性的自动化验证模型;同时,作者基于该数据集微调了一种依赖空间注意力机制的新方法,在IATA合规验证任务中优于现有最先进方法,显著提升了自动合规检测性能。

链接: https://arxiv.org/abs/2602.10985
作者: Nuno Gonçalves,Diogo Nunes,Carla Guerra,João Marcos
机构: Institute of Systems and Robotics - University of Coimbra (系统与机器人研究所-科英布拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring compliance with ISO/IEC and ICAO standards for facial images in machine-readable travel documents (MRTDs) is essential for reliable identity verification, but current manual inspection methods are inefficient in high-demand environments. This paper introduces the DFIC dataset, a novel comprehensive facial image dataset comprising around 58,000 annotated images and 2706 videos of more than 1000 subjects, that cover a broad range of non-compliant conditions, in addition to compliant portraits. Our dataset provides a more balanced demographic distribution than the existing public datasets, with one partition that is nearly uniformly distributed, facilitating the development of automated ICAO compliance verification methods. Using DFIC, we fine-tuned a novel method that heavily relies on spatial attention mechanisms for the automatic validation of ICAO compliance requirements, and we have compared it with the state-of-the-art aimed at ICAO compliance verification, demonstrating improved results. DFIC dataset is now made public (this https URL) for the training and validation of new models, offering an unprecedented diversity of faces, that will improve both robustness and adaptability to the intrinsically diverse combinations of faces and props that can be presented to the validation system. These results emphasize the potential of DFIC to enhance automated ICAO compliance methods but it can also be used in many other applications that aim to improve the security, privacy, and fairness of facial recognition systems. Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.4 Cite as: arXiv:2602.10985 [cs.CV] (or arXiv:2602.10985v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.10985 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-16] VFGS-Net: Frequency-Guided State-Space Learning for Topology-Preserving Retinal Vessel Segmentation

【速读】:该论文旨在解决视网膜血管分割中因血管细长形态、尺度变化范围广及对比度低所带来的挑战,尤其是现有方法难以同时保留细小毛细血管并维持全局拓扑连续性的问题。其解决方案的关键在于提出一种端到端的分割框架VFGS-Net,该框架通过三个核心模块实现:一是双路径卷积特征提取模块,联合捕获局部纹理与多尺度上下文语义;二是新颖的血管感知频域通道注意力机制,自适应重加权频谱成分以增强高层特征中的血管响应;三是网络瓶颈处引入基于Mamba2的双向非对称空间建模块,高效捕捉长程空间依赖关系,强化血管结构的全局连通性。这些设计共同提升了模型在复杂分支、细小血管和低对比区域的分割精度,展现出良好的临床应用潜力。

链接: https://arxiv.org/abs/2602.10978
作者: Ruiqi Song,Lei Liu,Ya-Nan Zhang,Chao Wang,Xiaoning Li,Nan Mu
机构: Sichuan Normal University (四川师范大学); Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate retinal vessel segmentation is a critical prerequisite for quantitative analysis of retinal images and computer-aided diagnosis of vascular diseases such as diabetic retinopathy. However, the elongated morphology, wide scale variation, and low contrast of retinal vessels pose significant challenges for existing methods, making it difficult to simultaneously preserve fine capillaries and maintain global topological continuity. To address these challenges, we propose the Vessel-aware Frequency-domain and Global Spatial modeling Network (VFGS-Net), an end-to-end segmentation framework that seamlessly integrates frequency-aware feature enhancement, dual-path convolutional representation learning, and bidirectional asymmetric spatial state-space modeling within a unified architecture. Specifically, VFGS-Net employs a dual-path feature convolution module to jointly capture fine-grained local textures and multi-scale contextual semantics. A novel vessel-aware frequency-domain channel attention mechanism is introduced to adaptively reweight spectral components, thereby enhancing vessel-relevant responses in high-level features. Furthermore, at the network bottleneck, we propose a bidirectional asymmetric Mamba2-based spatial modeling block to efficiently capture long-range spatial dependencies and strengthen the global continuity of vascular structures. Extensive experiments on four publicly available retinal vessel datasets demonstrate that VFGS-Net achieves competitive or superior performance compared to state-of-the-art methods. Notably, our model consistently improves segmentation accuracy for fine vessels, complex branching patterns, and low-contrast regions, highlighting its robustness and clinical potential.

[CV-17] Healthy Harvests: A Comparative Look at Guava Disease Classification Using InceptionV3

【速读】:该论文旨在解决木瓜果实病害早期识别难题,以减少病害对果实品质和产量的影响。其核心问题是通过图像分类技术准确区分三种类别:炭疽病(Anthracnose)、果蝇侵害(Fruit flies)及健康果实。解决方案的关键在于构建高质量的数据集并采用先进的深度学习模型进行分类:首先对原始473张图像进行标准化处理与数据增强(生成3784张图像),随后使用InceptionV3和ResNet50两种预训练模型进行训练,其中InceptionV3达到98.15%的准确率,显著优于ResNet50的94.46%;同时引入CutMix和MixUp等数据混合方法提升模型鲁棒性,并结合SHAP分析增强模型预测的可解释性,从而实现高效、可靠且可解释的病害识别。

链接: https://arxiv.org/abs/2602.10967
作者: Samanta Ghosh,Shaila Afroz Anika,Umma Habiba Ahmed,B. M. Shahria Alam,Mohammad Tahmid Noor,Nishat Tasnim Niloy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 13 figures, his is the author’s accepted manuscript of a paper accepted for publication in the Proceedings of the 16th International IEEE Conference on Computing, Communication and Networking Technologies (ICCCNT 2025). The final published version will be available via IEEE Xplore

点击查看摘要

Abstract:Guava fruits often suffer from many diseases. This can harm fruit quality and fruit crop yield. Early identification is important for minimizing damage and ensuring fruit health. This study focuses on 3 different categories for classifying diseases. These are Anthracnose, Fruit flies, and Healthy fruit. The data set used in this study is collected from Mendeley Data. This dataset contains 473 original images of Guava. These images vary in size and format. The original dataset was resized to 256x256 pixels with RGB color mode for better consistency. After this, the Data augmentation process is applied to improve the dataset by generating variations of the original images. The augmented dataset consists of 3784 images using advanced preprocessing techniques. Two deep learning models were implemented to classify the images. The InceptionV3 model is well known for its advanced framework. These apply multiple convolutional filters for obtaining different features effectively. On the other hand, the ResNet50 model helps to train deeper networks by using residual learning. The InceptionV3 model achieved the impressive accuracy of 98.15%, and ResNet50got 94.46% accuracy. Data mixing methods such as CutMix and MixUp were applied to enhance the model’s robustness. The confusion matrix was used to evaluate the overall model performance of both InceptionV3 and Resnet50. Additionally, SHAP analysis is used to improve interpretability, which helps to find the significant parts of the image for the model prediction. This study purposes to highlight how advanced models enhan

[CV-18] owards Learning a Generalizable 3D Scene Representation from 2D Observations

【速读】:该论文旨在解决从机器人第一人称视角(egocentric robot observations)中准确预测三维工作空间占据情况(3D workspace occupancy)的问题,尤其针对传统基于相机坐标系的方法难以直接应用于机器人操作任务的局限性。解决方案的关键在于提出一种可泛化的神经辐射场(Neural Radiance Field, NeRF)方法,其在全局工作空间坐标系中构建占据表示,而非依赖于特定相机视角;同时通过灵活的多源视图融合机制,使模型能够在不进行场景特定微调的情况下适应未见过的物体布局,从而实现对完整三维占据信息的推断,包括被遮挡区域。

链接: https://arxiv.org/abs/2602.10943
作者: Martin Gromniak,Jan-Gerrit Habekost,Sebastian Kamp,Sven Magg,Stefan Wermter
机构: University of Hamburg - Department of Informatics (汉堡大学信息学院); ZAL Center of Applied Aeronautical Research (应用航空研究中心); Hamburger Informatik Technologie-Center e.V. (HITeC) (汉堡信息技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Paper accepted at ESANN 2026

点击查看摘要

Abstract:We introduce a Generalizable Neural Radiance Field approach for predicting 3D workspace occupancy from egocentric robot observations. Unlike prior methods operating in camera-centric coordinates, our model constructs occupancy representations in a global workspace frame, making it directly applicable to robotic manipulation. The model integrates flexible source views and generalizes to unseen object arrangements without scene-specific finetuning. We demonstrate the approach on a humanoid robot and evaluate predicted geometry against 3D sensor ground truth. Trained on 40 real scenes, our model achieves 26mm reconstruction error, including occluded regions, validating its ability to infer complete 3D occupancy beyond traditional stereo vision methods.

[CV-19] FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

【速读】:该论文旨在解决大规模扩散模型(如FLUX和Stable Diffusion 3)在多GPU并行推理时因统一序列并行(Unified Sequence Parallelism, USP)实现中存在的效率瓶颈问题,包括过度的内核启动开销和次优的计算-通信调度策略。其解决方案的关键在于提出FastUSP——一个多层次优化框架,集成编译级优化(CUDA Graph图编译与计算-通信重排序)、通信级优化(FP8量化集合通信)以及算子级优化(双缓冲流水线Ring注意力机制),从而显著提升分布式注意力计算的性能。实验表明,在FLUX(12B参数)上,FastUSP相较基线USP实现1.12×–1.16×端到端加速,其中编译级优化贡献最大;而在Qwen-Image模型上,尽管存在PyTorch Inductor对Ring注意力的兼容性限制,FastUSP仍能在2-GPU场景下实现1.09×加速,并揭示了现代高带宽GPU互连环境下,内核启动开销而非通信延迟才是主要瓶颈。

链接: https://arxiv.org/abs/2602.10940
作者: Guandong Li
机构: iFLYTEK(科大讯飞)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbfFastUSP, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf1.12 \times --1.16 \times end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf1.09 \times speedup on 2 GPUs; on 4–8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30 \times --1.46 \times of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead – rather than communication latency – is the primary bottleneck on modern high-bandwidth GPU interconnects.

[CV-20] ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving ICLR2026

【速读】:该论文旨在解决世界模型(World Model)在端到端自动驾驶框架中因静态区域冗余建模和与轨迹缺乏深度交互而导致的规划精度不足问题。其解决方案的关键在于提出Temporal Residual World Model (TR-World),通过计算场景表示的时间残差(temporal residuals)来无依赖地提取动态物体信息,仅以时间残差作为输入,从而更精确地预测动态物体的未来空间分布;同时结合当前BEV特征中的静态物体信息生成准确的未来BEV特征,并引入Future-Guided Trajectory Refinement (FGTR)模块,实现先验轨迹与未来BEV特征之间的交互,既利用未来道路条件优化轨迹,又通过稀疏时空监督防止世界模型崩溃。

链接: https://arxiv.org/abs/2602.10884
作者: Jinqing Zhang,Zehua Fu,Zelin Xu,Wenying Dai,Qingjie Liu,Yunhong Wang
机构: Beihang University (北京航空航天大学); Zhongguancun Laboratory (中关村实验室); Beijing Jingwei Hirain Technologies Co., Inc. (北京经纬恒润科技有限公司); Hangzhou Innovation Institute, Beihang University (杭州创新研究院,北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026

点击查看摘要

Abstract:The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at this https URL.

[CV-21] Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在从图表图像生成绘图代码时难以保持结构一致性的挑战,现有方法多依赖监督微调,导致模型倾向于表面级的token模仿而非对底层图表结构的准确建模,从而产生幻觉或语义不一致的输出。解决方案的关键在于提出一种名为“Chart Specification”的结构化中间表示,它通过过滤语法噪声构建结构平衡的训练集,并引入“Spec-Align Reward”提供细粒度且可验证的结构正确性反馈,使强化学习能够强制执行一致的绘图逻辑,从而显著提升生成代码的结构 fidelity 和数据效率。

链接: https://arxiv.org/abs/2602.10880
作者: Minggui He,Mingchen Dai,Jian Zhang,Yilun Liu,Shimin Tao,Pufan Zeng,Osamu Yoshie,Yuya Ieiri
机构: Waseda University (早稻田大学); Tokyo Institute of Technology (东京工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images, yet achieving structural fidelity remains challenging. Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation rather than faithful modeling of underlying chart structure, which often leads to hallucinated or semantically inconsistent outputs. We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Chart Specification filters syntactic noise to construct a structurally balanced training set and supports a Spec-Align Reward that provides fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic. Experiments on three public benchmarks show that our method consistently outperforms prior approaches. With only 3K training samples, we achieve strong data efficiency, surpassing leading baselines by up to 61.7% on complex benchmarks, and scaling to 4K samples establishes new state-of-the-art results across all evaluated metrics. Overall, our results demonstrate that precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation. Code and dataset are available at: this https URL

[CV-22] Stride-Net: Fairness-Aware Disentangled Representation Learning for Chest X-Ray Diagnosis

【速读】:该论文旨在解决深度神经网络在胸部X光图像分类中对特定人口统计学子群体(如不同种族或性别组合)表现不佳的问题,这可能引发临床安全与公平性风险。现有去偏方法常导致跨数据集性能不稳定,或通过牺牲整体诊断效用换取公平性,未能将公平性内化为模型表征的固有属性。解决方案的关键在于提出Stride-Net框架,其核心创新包括:1)在图像块(patch)级别使用可学习的步长掩码(learnable stride-based mask),选择与标签一致的区域并抑制敏感属性信息;2)引入对抗混淆损失(adversarial confusion loss)以进一步削弱敏感特征的影响;3)通过Group Optimal Transport强制图像特征与BioBERT生成的疾病标签嵌入之间进行语义对齐,从而锚定表征于临床语义并防止捷径学习(shortcut learning)。该方法在MIMIC-CXR和CheXpert数据集上验证了其在多种架构下的公平性提升与准确率保持甚至超越,实现了更优的准确性-公平性权衡。

链接: https://arxiv.org/abs/2602.10875
作者: Darakshan Rashid,Raza Imam,Dwarikanath Mahapatra,Brejesh Lall
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 Tables, 3 Figures. Our code is available this https URL

点击查看摘要

Abstract:Deep neural networks for chest X-ray classification achieve strong average performance, yet often underperform for specific demographic subgroups, raising critical concerns about clinical safety and equity. Existing debiasing methods frequently yield inconsistent improvements across datasets or attain fairness by degrading overall diagnostic utility, treating fairness as a post hoc constraint rather than a property of the learned representation. In this work, we propose Stride-Net (Sensitive Attribute Resilient Learning via Disentanglement and Learnable Masking with Embedding Alignment), a fairness-aware framework that learns disease-discriminative yet demographically invariant representations for chest X-ray analysis. Stride-Net operates at the patch level, using a learnable stride-based mask to select label-aligned image regions while suppressing sensitive attribute information through adversarial confusion loss. To anchor representations in clinical semantics and discourage shortcut learning, we further enforce semantic alignment between image features and BioBERT-based disease label embeddings via Group Optimal Transport. We evaluate Stride-Net on the MIMIC-CXR and CheXpert benchmarks across race and intersectional race-gender subgroups. Across architectures including ResNet and Vision Transformers, Stride-Net consistently improves fairness metrics while matching or exceeding baseline accuracy, achieving a more favorable accuracy-fairness trade-off than prior debiasing approaches. Our code is available at this https URL.

[CV-23] Hyperspectral Smoke Segmentation via Mixture of Prototypes

【速读】:该论文旨在解决传统可见光烟雾分割方法在野火管理和工业安全应用中因光谱信息不足而导致的局限性,尤其是面对云层干扰和半透明烟雾区域时性能下降的问题。其核心解决方案是引入高光谱成像技术,并提出一种基于原型混合(Mixture of Prototypes, MoP)的网络架构,关键创新包括:(1)通过波段分割实现光谱隔离,减少不同波段间的交互污染;(2)基于原型的光谱表示机制以捕捉多样化的光谱模式;(3)双层级路由机制实现空间感知的自适应波段加权,从而提升烟雾分割精度与鲁棒性。

链接: https://arxiv.org/abs/2602.10858
作者: Lujian Yao,Haitao Zhao,Xianghai Kong,Yuhan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 14 figures

点击查看摘要

Abstract:Smoke segmentation is critical for wildfire management and industrial safety applications. Traditional visible-light-based methods face limitations due to insufficient spectral information, particularly struggling with cloud interference and semi-transparent smoke regions. To address these challenges, we introduce hyperspectral imaging for smoke segmentation and present the first hyperspectral smoke segmentation dataset (HSSDataset) with carefully annotated samples collected from over 18,000 frames across 20 real-world scenarios using a Many-to-One annotations protocol. However, different spectral bands exhibit varying discriminative capabilities across spatial regions, necessitating adaptive band weighting strategies. We decompose this into three technical challenges: spectral interaction contamination, limited spectral pattern modeling, and complex weighting router problems. We propose a mixture of prototypes (MoP) network with: (1) Band split for spectral isolation, (2) Prototype-based spectral representation for diverse patterns, and (3) Dual-level router for adaptive spatial-aware band weighting. We further construct a multispectral dataset (MSSDataset) with RGB-infrared images. Extensive experiments validate superior performance across both hyperspectral and multispectral modalities, establishing a new paradigm for spectral-based smoke segmentation.

[CV-24] Flow caching for autoregressive video generation

【速读】:该论文旨在解决自回归视频生成模型(autoregressive video generation)在序列化生成过程中效率低下的问题,尤其是传统缓存策略因假设所有帧在相同时间步具有均匀去噪特性而无法适配此类模型的非均匀相似性模式。其解决方案的关键在于提出FlowCache,一个专为自回归视频生成设计的缓存框架:通过引入分块缓存策略(chunkwise caching strategy),使每个视频片段独立维护缓存策略,从而实现对不同片段在各时间步是否需要重新计算的细粒度控制;同时结合一种联合重要性-冗余优化的KV缓存压缩机制(joint importance-redundancy optimized KV cache compression mechanism),在固定内存约束下保持生成质量,显著提升推理速度。

链接: https://arxiv.org/abs/2602.10825
作者: Yuexiao Ma,Xuzhe Zheng,Jing Xu,Xiwei Xu,Feng Ling,Xiawu Zheng,Huafeng Kuang,Huixia Li,Xing Wang,Xuefeng Xiao,Fei Chao,Rongrong Ji
机构: Xiamen University (厦门大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames-an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by a joint importance-redundancy optimized KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality. Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively). These results demonstrate that FlowCache successfully unlocks the potential of autoregressive models for real-time, ultra-long video generation-establishing a new benchmark for efficient video synthesis at scale. The code is available at this https URL.

[CV-25] Resource-Efficient RGB-Only Action Recognition for Edge Deployment

【速读】:该论文旨在解决在边缘设备上进行动作识别时面临的延迟、内存、存储和功耗等资源约束问题。现有方法常依赖骨骼或深度等辅助模态以提升识别精度,但这些模态通常需要额外传感器或高计算成本的姿态估计流程,难以在边缘端实际部署。论文提出了一种轻量级纯RGB网络架构,其核心创新在于基于X3D风格主干网络引入时间移位(Temporal Shift)机制,并结合选择性时间适应(selective temporal adaptation)与无参数注意力(parameter-free attention),从而在不增加额外硬件或计算负担的前提下显著提升模型效率与准确率。实验表明,该方案在NTU RGB+D 60和120数据集上实现了优异的准确性-效率平衡,且在Jetson Orin Nano上的部署级分析验证了其更小的设备占用空间和更优的实际资源利用率。

链接: https://arxiv.org/abs/2602.10818
作者: Dongsik Yoon,Jongeun Kim,Dayeon Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注: Under review

点击查看摘要

Abstract:Action recognition on edge devices poses stringent constraints on latency, memory, storage, and power consumption. While auxiliary modalities such as skeleton and depth information can enhance recognition performance, they often require additional sensors or computationally expensive pose-estimation pipelines, limiting practicality for edge use. In this work, we propose a compact RGB-only network tailored for efficient on-device inference. Our approach builds upon an X3D-style backbone augmented with Temporal Shift, and further introduces selective temporal adaptation and parameter-free attention. Extensive experiments on the NTU RGB+D 60 and 120 benchmarks demonstrate a strong accuracy-efficiency balance. Moreover, deployment-level profiling on the Jetson Orin Nano verifies a smaller on-device footprint and practical resource utilization compared to existing RGB-based action recognition techniques.

[CV-26] Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

【速读】:该论文试图解决大规模视觉语言模型(Vision-Language Models, VLMs)在后训练阶段存在的分布外(out-of-distribution, OOD)泛化性能差距问题,即通过强化学习(Reinforcement Learning, RL)微调的模型在OOD场景下表现优于监督微调(Supervised Fine-Tuning, SFT)模型的现象。其核心解决方案是提出一种数据驱动的方法——难度筛选式监督微调(Difficulty-Curated SFT, DC-SFT),关键在于显式地基于样本难度对训练数据进行过滤,剔除过于困难的样本,从而提升模型的泛化能力。实验表明,DC-SFT不仅显著优于标准SFT,还超越了RL方法,同时具备更高的训练稳定性和计算效率。

链接: https://arxiv.org/abs/2602.10815
作者: Aojun Lu,Tao Feng,Hangjie Yuan,Wei Li,Yanan Sun
机构: Sichuan University (四川大学); Tsinghua University (清华大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL’s generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at this https URL.

[CV-27] DMP-3DAD: Cross-Category 3D Anomaly Detection via Realistic Depth Map Projection with Few Normal Samples

【速读】:该论文旨在解决3D点云跨类别异常检测(cross-category anomaly detection for 3D point clouds)问题,即在仅有少量正常样本的情况下,判断一个未见过的物体是否属于目标类别。传统方法依赖类别特定训练,难以适应少样本场景。其解决方案的关键在于提出一种无需训练(training-free)的框架DMP-3DAD,通过将点云投影为固定数量的真实感深度图(realistic depth map),利用冻结的CLIP视觉编码器提取多视角特征表示,并基于加权特征相似性进行异常判定,从而避免了任何微调或类别依赖的适配过程。

链接: https://arxiv.org/abs/2602.10806
作者: Zi Wang,Katsuya Hotta,Koichiro Kamide,Yawen Zou,Jianjian Qin,Chao Zhang,Jun Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-category anomaly detection for 3D point clouds aims to determine whether an unseen object belongs to a target category using only a few normal examples. Most existing methods rely on category-specific training, which limits their flexibility in few-shot scenarios. In this paper, we propose DMP-3DAD, a training-free framework for cross-category 3D anomaly detection based on multi-view realistic depth map projection. Specifically, by converting point clouds into a fixed set of realistic depth images, our method leverages a frozen CLIP visual encoder to extract multi-view representations and performs anomaly detection via weighted feature similarity, which does not require any fine-tuning or category-dependent adaptation. Extensive experiments on the ShapeNetPart dataset demonstrate that DMP-3DAD achieves state-of-the-art performance under few-shot setting. The results show that the proposed approach provides a simple yet effective solution for practical cross-category 3D anomaly detection.

[CV-28] RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation

【速读】:该论文旨在解决遥感领域多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的幻觉问题(hallucinations),即模型输出与输入遥感(Remote Sensing, RS)图像不一致的现象,这在应急管理和农业监测等高风险场景中尤为严重。解决方案的关键在于:首先,提出面向遥感任务的幻觉分类体系,并引入图像级幻觉(image-level hallucination)以捕捉超越目标中心错误(如模态、分辨率和场景语义层面)的RS特异性不一致性;其次,构建了RSHalluEval基准测试集(2,023个问答对)并实现双模式检测机制,支持高精度云端审计与低成本本地复现检查;最后,设计了一个面向训练友好的领域定制数据集RSHalluShield(30k个问答对),并提出无需训练的插件式缓解策略,包括解码时的logit校正和遥感感知提示(RS-aware prompting),在统一评估协议下使代表性RS-MLLMs的无幻觉率提升最高达21.63个百分点,同时保持下游任务(RSVQA/RSVG)性能竞争力。

链接: https://arxiv.org/abs/2602.10799
作者: Zihui Zhou,Yong Feng,Yanying Chen,Guofan Duan,Zhenxi Song,Mingliang Zhou,Weijia Jia
机构: Chongqing University (重庆大学); Chongqing Vocational College of Science and Technology (重庆科技职业学院); Harbin Institute of Technology (哈尔滨工业大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.

[CV-29] Kill it with FIRE: On Leverag ing Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

【速读】:该论文旨在解决部署后模型中存在的后门攻击(backdoor attack)问题,尤其针对已投入使用的神经网络模型,现有缓解策略如训练数据过滤、模型修改或输入样本的昂贵调整在模型部署后往往失效或效率低下。其解决方案的关键在于提出一种推理阶段的后门缓解方法FIRE(Feature-space Inference-time REpair),核心思想是假设触发器(trigger)会在模型内部表示空间中引发结构化且可重复的变化,从而将触发器视为层间潜在空间中的特定方向;通过反向施加这些方向来修正模型的推理机制,使中毒样本的特征沿后门方向移动以消除触发效应,实现对后门行为的自我修复。

链接: https://arxiv.org/abs/2602.10780
作者: Enrico Ahlers,Daniel Passon,Yannic Noller,Lars Grunske
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine learning models are increasingly present in our everyday lives; as a result, they become targets of adversarial attackers seeking to manipulate the systems we interact with. A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process. Backdoors can be used to induce unwanted behavior by including a certain trigger in the input. Existing mitigations filter training data, modify the model, or perform expensive input modifications on samples. If a vulnerable model has already been deployed, however, those strategies are either ineffective or inefficient. To address this gap, we propose our inference-time backdoor mitigation approach called FIRE (Feature-space Inference-time REpair). We hypothesize that a trigger induces structured and repeatable changes in the model’s internal representation. We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism. Therefore, we turn the backdoored model against itself by manipulating its latent representations and moving a poisoned sample’s features along the backdoor directions to neutralize the trigger. Our evaluation shows that FIRE has low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.

[CV-30] From Steering to Pedalling: Do Autonomous Driving VLMs Generalize to Cyclist-Assistive Spatial Perception and Planning ?

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在城市交通场景中缺乏以骑行者为中心的感知与推理能力评估的问题。现有评测体系多基于车辆视角,无法充分反映骑行者在复杂交通环境中对安全决策的需求。解决方案的关键在于提出CyclingVQA这一诊断性基准,专门用于测试模型从骑行者视角出发的感知能力、时空理解能力以及交通规则到车道映射的推理能力。通过在31种以上主流VLM上进行系统评估,该研究揭示了当前模型在识别骑行者特有交通提示和关联标志与正确车道方面的不足,并指出驾驶专用模型在骑行辅助任务中表现不如通用型模型,从而为开发更有效的骑行者辅助智能系统提供了方向。

链接: https://arxiv.org/abs/2602.10771
作者: Krishna Kanth Nakka,Vedasri Nakka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint

点击查看摘要

Abstract:Cyclists often encounter safety-critical situations in urban traffic, highlighting the need for assistive systems that support safe and informed decision-making. Recently, vision-language models (VLMs) have demonstrated strong performance on autonomous driving benchmarks, suggesting their potential for general traffic understanding and navigation-related reasoning. However, existing evaluations are predominantly vehicle-centric and fail to assess perception and reasoning from a cyclist-centric viewpoint. To address this gap, we introduce CyclingVQA, a diagnostic benchmark designed to probe perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning from a cyclist’s perspective. Evaluating 31+ recent VLMs spanning general-purpose, spatially enhanced, and autonomous-driving-specialized models, we find that current models demonstrate encouraging capabilities, while also revealing clear areas for improvement in cyclist-centric perception and reasoning, particularly in interpreting cyclist-specific traffic cues and associating signs with the correct navigational lanes. Notably, several driving-specialized models underperform strong generalist VLMs, indicating limited transfer from vehicle-centric training to cyclist-assistive scenarios. Finally, through systematic error analysis, we identify recurring failure modes to guide the development of more effective cyclist-assistive intelligent systems.

[CV-31] Dual-End Consistency Model

【速读】:该论文旨在解决扩散模型和基于流的生成模型在实际部署中因迭代采样速度慢而导致的效率瓶颈问题,尤其是当前一致性模型(Consistency Models, CMs)在大规模应用中面临的两个关键挑战:训练不稳定性和采样灵活性差。其解决方案的关键在于提出双端一致性模型(Dual-End Consistency Model, DE-CM),通过选择关键子轨迹簇来稳定训练并提升采样效率;具体而言,DE-CM将PF-ODE轨迹分解并选取三个关键子轨迹作为优化目标,利用连续时间一致性模型目标实现少步蒸馏,并引入流匹配作为边界正则项以稳定训练过程;此外,创新性地设计了噪声到噪声(Noise-to-Noisy, N2N)映射机制,可将任意噪声点映射至目标分布,从而缓解初始步骤的误差累积问题。实验表明,DE-CM在ImageNet 256x256数据集上实现单步生成FID分数1.70,优于现有基于CM的方法。

链接: https://arxiv.org/abs/2602.10764
作者: Linwei Dong,Ruoyu Guo,Ge Bai,Zehuan Yuan,Yawei Luo,Changqing Zou
机构: Zhejiang University (浙江大学); Bytedance Inc. (字节跳动); Zhejiang Lab (浙江省实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The slow iterative sampling nature remains a major bottleneck for the practical deployment of diffusion and flow-based generative models. While consistency models (CMs) represent a state-of-the-art distillation-based approach for efficient generation, their large-scale application is still limited by two key issues: training instability and inflexible sampling. Existing methods seek to mitigate these problems through architectural adjustments or regularized objectives, yet overlook the critical reliance on trajectory selection. In this work, we first conduct an analysis on these two limitations: training instability originates from loss divergence induced by unstable self-supervised term, whereas sampling inflexibility arises from error accumulation. Based on these insights and analysis, we propose the Dual-End Consistency Model (DE-CM) that selects vital sub-trajectory clusters to achieve stable and effective training. DE-CM decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. Specifically, our approach leverages continuous-time CMs objectives to achieve few-step distillation and utilizes flow matching as a boundary regularizer to stabilize the training process. Furthermore, we propose a novel noise-to-noisy (N2N) mapping that can map noise to any point, thereby alleviating the error accumulation in the first step. Extensive experimental results show the effectiveness of our method: it achieves a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset, outperforming existing CM-based one-step approaches.

[CV-32] xt-to-Vector Conversion for Residential Plan Design

【速读】:该论文旨在解决从文本描述生成矢量住宅平面图以及将光栅平面图转换为结构化矢量图像的问题。其核心挑战在于如何在保持几何精度(如直角)和视觉质量的同时,提升生成效率与可扩展性。解决方案的关键在于提出了一种新颖的生成方法,利用CLIPScore指标衡量视觉一致性,相比现有方案提升约5%;同时开发了一种新的矢量化算法,将光栅平面图转化为结构化矢量图像,CLIPScore相较其他方法提高约4%,体现出对几何约束和灵活参数设置的优越处理能力。

链接: https://arxiv.org/abs/2602.10757
作者: Egor Bazhenov,Stepan Kasai,Viacheslav Shalamov,Valeria Efimova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 1 figure

点击查看摘要

Abstract:Computer graphics, comprising both raster and vector components, is a fundamental part of modern science, industry, and digital communication. While raster graphics offer ease of use, its pixel-based structure limits scalability. Vector graphics, defined by mathematical primitives, provides scalability without quality loss, however, it is more complex to produce. For design and architecture, the versatility of vector graphics is paramount, despite its computational demands. This paper introduces a novel method for generating vector residential plans from textual descriptions. Our approach surpasses existing solutions by approximately 5% in CLIPScore-based visual quality, benefiting from its inherent handling of right angles and flexible settings. Additionally, we present a new algorithm for vectorizing raster plans into structured vector images. Such images have a better CLIPscore compared to others by about 4%.

[CV-33] SecureScan: An AI-Driven Multi-Layer Framework for Malware and Phishing Detection Using Logistic Regression and Threat Intelligence Integration

【速读】:该论文旨在解决传统基于签名的入侵检测系统在应对日益复杂的现代恶意软件(malware)和网络钓鱼(phishing)攻击时有效性下降的问题。其解决方案的关键在于提出了一种名为SecureScan的AI驱动三重检测框架,通过集成逻辑回归分类、启发式分析与外部威胁情报(如VirusTotal API),实现对URL、文件哈希及二进制文件的综合筛查;该架构以效率优先,利用启发式规则过滤已知威胁,用机器学习模型分类不确定样本,并借助第三方情报验证边界案例,同时引入校准阈值和灰区逻辑(0.45–0.55)以降低误报率并提升实际部署稳定性,最终在基准数据集上实现了93.1%的准确率,且性能可媲美复杂深度学习系统。

链接: https://arxiv.org/abs/2602.10750
作者: Rumman Firdos,Aman Dangi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The growing sophistication of modern malware and phishing campaigns has diminished the effectiveness of traditional signature-based intrusion detection systems. This work presents SecureScan, an AI-driven, triple-layer detection framework that integrates logistic regression-based classification, heuristic analysis, and external threat intelligence via the VirusTotal API for comprehensive triage of URLs, file hashes, and binaries. The proposed architecture prioritizes efficiency by filtering known threats through heuristics, classifying uncertain samples using machine learning, and validating borderline cases with third-party intelligence. On benchmark datasets, SecureScan achieves 93.1 percent accuracy with balanced precision (0.87) and recall (0.92), demonstrating strong generalization and reduced overfitting through threshold-based decision calibration. A calibrated threshold and gray-zone logic (0.45-0.55) were introduced to minimize false positives and enhance real-world stability. Experimental results indicate that a lightweight statistical model, when augmented with calibrated verification and external intelligence, can achieve reliability and performance comparable to more complex deep learning systems.

[CV-34] Spectral-Spatial Contrastive Learning Framework for Regression on Hyperspectral Data

【速读】:该论文旨在解决对比学习(Contrastive Learning)在回归任务中应用不足的问题,尤其是针对高光谱数据(Hyperspectral Data)的回归建模缺乏有效方法的现状。其解决方案的关键在于提出了一种面向回归任务的光谱-空间对比学习框架(Spectral-Spatial Contrastive Learning Framework),该框架具有模型无关性(model-agnostic),可无缝增强如3D卷积网络和基于Transformer的骨干模型;同时,论文还系统构建了适用于高光谱数据增强的一系列变换策略(transformations),实验证明这些设计能显著提升多种骨干模型在合成与真实数据集上的回归性能。

链接: https://arxiv.org/abs/2602.10745
作者: Mohamad Dhaini,Paul Honeine,Maxime Berar,Antonin Van Exem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Contrastive learning has demonstrated great success in representation learning, especially for image classification tasks. However, there is still a shortage in studies targeting regression tasks, and more specifically applications on hyperspectral data. In this paper, we propose a spectral-spatial contrastive learning framework for regression tasks for hyperspectral data, in a model-agnostic design allowing to enhance backbones such as 3D convolutional and transformer-based networks. Moreover, we provide a collection of transformations relevant for augmenting hyperspectral data. Experiments on synthetic and real datasets show that the proposed framework and transformations significantly improve the performance of all studied backbone models.

[CV-35] Self-Supervised Image Super-Resolution Quality Assessment based on Content-Free Multi-Model Oriented Representation Learning

【速读】:该论文旨在解决真实场景下超分辨率图像质量评估(Super-Resolution Image Quality Assessment, SR-IQA)的难题,尤其是针对由复杂、不规则退化引起的高病态问题。现有方法多基于合成低分辨率(Low-Resolution, LR)图像,难以适配现实世界中不可预测且多样化的退化模式。解决方案的关键在于提出一种无参考(no-reference)SR-IQA方法S3 RIQA,其核心创新包括:1)采用自监督学习(Self-Supervised Learning, SSL)策略,在预训练阶段构建基于相同SR模型生成图像的正样本对与不同SR模型生成图像的负样本对,从而提取与SR算法相关而非图像内容相关的特征表示;2)引入针对性预处理以提取互补的质量信息,并设计辅助任务以适应不同缩放因子下的退化特性;3)构建新数据集SRMORSS用于无监督预训练,涵盖多种真实LR图像和SR算法,填补了现有数据集的空白。实验表明,S3 RIQA在真实SR-IQA基准上显著优于主流指标。

链接: https://arxiv.org/abs/2602.10744
作者: Kian Majlessi,Amir Masoud Soltani,Mohammad Ebrahim Mahdavi,Aurelien Gourrier,Peyman Adibi
机构: University of Isfahan (伊斯法罕大学); Univ. Grenoble Alpes (格勒诺布尔阿尔卑斯大学); CNRS (法国国家科学研究中心); LIPhy (物理化学实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Super-resolution (SR) applied to real-world low-resolution (LR) images often results in complex, irregular degradations that stem from the inherent complexity of natural scene acquisition. In contrast to SR artifacts arising from synthetic LR images created under well-defined scenarios, those distortions are highly unpredictable and vary significantly across different real-life contexts. Consequently, assessing the quality of SR images (SR-IQA) obtained from realistic LR, remains a challenging and underexplored problem. In this work, we introduce a no-reference SR-IQA approach tailored for such highly ill-posed realistic settings. The proposed method enables domain-adaptive IQA for real-world SR applications, particularly in data-scarce domains. We hypothesize that degradations in super-resolved images are strongly dependent on the underlying SR algorithms, rather than being solely determined by image content. To this end, we introduce a self-supervised learning (SSL) strategy that first pretrains multiple SR model oriented representations in a pretext stage. Our contrastive learning framework forms positive pairs from images produced by the same SR model and negative pairs from those generated by different methods, independent of image content. The proposed approach S3 RIQA, further incorporates targeted preprocessing to extract complementary quality information and an auxiliary task to better handle the various degradation profiles associated with different SR scaling factors. To this end, we constructed a new dataset, SRMORSS, to support unsupervised pretext training; it includes a wide range of SR algorithms applied to numerous real LR images, which addresses a gap in existing datasets. Experiments on real SR-IQA benchmarks demonstrate that S3 RIQA consistently outperforms most state-of-the-art relevant metrics.

[CV-36] OccFace: Unified Occlusion-Aware Facial Landmark Detection with Per-Point Visibility

【速读】:该论文旨在解决在遮挡条件下准确检测人脸关键点(facial landmark)的问题,尤其针对具有大外观变化和由旋转引起的自遮挡的人类及类人面部。现有方法通常隐式处理遮挡,缺乏对每个关键点可见性的显式预测,限制了下游应用的性能。其解决方案的关键在于提出一个称为OccFace的遮挡感知框架,采用统一的100点密集布局和基于热图的骨干网络,并引入一个遮挡模块,通过结合局部证据与跨关键点上下文,联合预测关键点坐标与每点可见性;同时设计了一种混合监督策略,融合人工标注与基于掩码-热图重叠推导的伪可见性标签,从而提升模型在遮挡区域的鲁棒性,同时保持对可见关键点的高精度。

链接: https://arxiv.org/abs/2602.10728
作者: Xinhao Xiang,Zhengxin Li,Saurav Dhakad,Theo Bancroft,Jiawei Zhang,Weiyang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate facial landmark detection under occlusion remains challenging, especially for human-like faces with large appearance variation and rotation-driven self-occlusion. Existing detectors typically localize landmarks while handling occlusion implicitly, without predicting per-point visibility that downstream applications can benefits. We present OccFace, an occlusion-aware framework for universal human-like faces, including humans, stylized characters, and other non-human designs. OccFace adopts a unified dense 100-point layout and a heatmap-based backbone, and adds an occlusion module that jointly predicts landmark coordinates and per-point visibility by combining local evidence with cross-landmark context. Visibility supervision mixes manual labels with landmark-aware masking that derives pseudo visibility from mask-heatmap overlap. We also create an occlusion-aware evaluation suite reporting NME on visible vs. occluded landmarks and benchmarking visibility with Occ AP, F1@0.5, and ROC-AUC, together with a dataset annotated with 100-point landmarks and per-point visibility. Experiments show improved robustness under external occlusion and large head rotations, especially on occluded regions, while preserving accuracy on visible landmarks.

[CV-37] A Diffusion-Based Generative Prior Approach to Sparse-view Computed Tomography

【速读】:该论文旨在解决从稀疏或有限角度几何条件下重建X射线计算机断层扫描(CT)图像时存在的难题,此类问题常因数据不足导致图像伪影甚至目标失真。其解决方案的关键在于引入深度生成先验(Deep Generative Prior, DGP)框架,将基于扩散的生成模型与迭代优化算法相结合,在保持模型驱动方法可解释性的同时,利用神经网络的生成能力提升重建质量。该方法通过在最小化问题求解过程中嵌入生成式先验,显著改善了高稀疏条件下的重建效果。

链接: https://arxiv.org/abs/2602.10722
作者: Davide Evangelista,Pasquale Cascarano,Elena Loli Piccolomini
机构: University of Bologna (博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures, 1 table

点击查看摘要

Abstract:The reconstruction of X-rays CT images from sparse or limited-angle geometries is a highly challenging task. The lack of data typically results in artifacts in the reconstructed image and may even lead to object distortions. For this reason, the use of deep generative models in this context has great interest and potential success. In the Deep Generative Prior (DGP) framework, the use of diffusion-based generative models is combined with an iterative optimization algorithm for the reconstruction of CT images from sinograms acquired under sparse geometries, to maintain the explainability of a model-based approach while introducing the generative power of a neural network. There are therefore several aspects that can be further investigated within these frameworks to improve reconstruction quality, such as image generation, the model, and the iterative algorithm used to solve the minimization problem, for which we propose modifications with respect to existing approaches. The results obtained even under highly sparse geometries are very promising, although further research is clearly needed in this direction.

[CV-38] Ecological mapping with geospatial foundation models

【速读】:该论文旨在解决生成式 AI (Generative AI) 在生态学高价值应用场景中的潜力尚未被充分挖掘的问题,特别是针对土地利用/土地覆盖(LULC)制图、森林功能性状映射和泥炭地识别等任务。解决方案的关键在于通过微调两种预训练的地学基础模型(Geospatial Foundation Models, GFMs)——Prithvi-E0-2.0 和 TerraMind —— 并与基准模型 ResNet-101 进行对比实验,验证 GFMs 在多模态输入下的性能优势;结果表明,TerraMind 在多数任务中优于 Prithvi,尤其在融合额外模态数据时显著超越 ResNet 和 Prithvi,但其效果仍受限于输入数据与预训练模态的差异性,且需更高分辨率和更精确标签以提升像素级动态映射能力。

链接: https://arxiv.org/abs/2602.10720
作者: Craig Mahlasi,Gciniwe S. Baloyi,Zaheed Gaffoor,Levente Klein,Anne Jones,Etienne Vos,Michal Muszynski,Geoffrey Dawson,Campbell Watson
机构: University of Cape Town (开普敦大学); University of the Witwatersrand (威特沃特斯兰德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Geospatial foundation models (GFMs) are a fast-emerging paradigm for various geospatial tasks, such as ecological mapping. However, the utility of GFMs has not been fully explored for high-value use cases. This study aims to explore the utility, challenges and opportunities associated with the application of GFMs for ecological uses. In this regard, we fine-tune several pretrained AI models, namely, Prithvi-E0-2.0 and TerraMind, across three use cases, and compare this with a baseline ResNet-101 model. Firstly, we demonstrate TerraMind’s LULC generation capabilities. Lastly, we explore the utility of the GFMs in forest functional trait mapping and peatlands detection. In all experiments, the GFMs outperform the baseline ResNet models. In general TerraMind marginally outperforms Prithvi. However, with additional modalities TerraMind significantly outperforms the baseline ResNet and Prithvi models. Nonetheless, consideration should be given to the divergence of input data from pretrained modalities. We note that these models would benefit from higher resolution and more accurate labels, especially for use cases where pixel-level dynamics need to be mapped.

[CV-39] From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)驱动系统在端到端(end-to-end, E2E)规划中,除传统准确率-成本权衡外的潜在行为差异与优化空间问题。研究发现,引入视觉语言模型(Vision-Language Model, VLM)作为骨干网络时,会在视觉仅模型(Vision-only backbone, 如ViT)的基础上引入新的子空间,导致在长尾场景下行为模式不同:VLM更激进、ViT更保守,且两者在约2–3%的测试场景中各自表现更优。关键解决方案是提出HybridDriveVLA和DualDriveVLA两种架构——前者通过并行运行ViT与VLM分支并使用学习评分器选择最优终点轨迹,将PDMS提升至92.10;后者采用“快慢策略”,默认运行ViT,仅当评分器置信度低于阈值时调用VLM(覆盖15%场景),在保持91.00 PDMS的同时实现3.2倍吞吐量提升。

链接: https://arxiv.org/abs/2602.10719
作者: Sining Ang,Yuguang Yang,Chenxu Dang,Canyu Chen,Cheng Chi,Haiyan Liu,Xuanyao Mao,Jason Bao,Xuliang,Bingchuan Sun,Yan Wang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages (10 pages main text + 12 pages appendix), 18 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy–cost trade-off. We revisit this question with 3–RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2–3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast–slow policy: it runs ViT by default and invokes the VLM only when the scorer’s confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.

[CV-40] FGAA-FPN: Foreground-Guided Angle-Aware Feature Pyramid Network for Oriented Object Detection

【速读】:该论文旨在解决定向目标检测(oriented object detection)中因背景杂乱、尺度变化剧烈及方向差异大所带来的挑战,尤其针对现有方法在缺乏显式前景建模和几何方向先验利用不足时导致的特征判别能力受限问题。解决方案的关键在于提出FGAA-FPN框架,其核心创新为两个模块:一是前景引导的特征调制模块(Foreground-Guided Feature Modulation),通过弱监督学习提取前景显著性,增强低层特征中的目标区域并抑制背景干扰;二是角度感知的多头注意力模块(Angle-Aware Multi-Head Attention),编码相对方向关系以指导高层语义特征间的全局交互,从而显式融合几何方向信息。该设计在DOTA数据集上实现了75.5%和68.3%的mAP,显著优于现有方法。

链接: https://arxiv.org/abs/2602.10710
作者: Jialin Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to The Visual Computer

点击查看摘要

Abstract:With the increasing availability of high-resolution remote sensing and aerial imagery, oriented object detection has become a key capability for geographic information updating, maritime surveillance, and disaster response. However, it remains challenging due to cluttered backgrounds, severe scale variation, and large orientation changes. Existing approaches largely improve performance through multi-scale feature fusion with feature pyramid networks or contextual modeling with attention, but they often lack explicit foreground modeling and do not leverage geometric orientation priors, which limits feature discriminability. To overcome these limitations, we propose FGAA-FPN, a Foreground-Guided Angle-Aware Feature Pyramid Network for oriented object detection. FGAA-FPN is built on a hierarchical functional decomposition that accounts for the distinct spatial resolution and semantic abstraction across pyramid levels, thereby strengthening multi-scale representations. Concretely, a Foreground-Guided Feature Modulation module learns foreground saliency under weak supervision to enhance object regions and suppress background interference in low-level features. In parallel, an Angle-Aware Multi-Head Attention module encodes relative orientation relationships to guide global interactions among high-level semantic features. Extensive experiments on DOTA v1.0 and DOTA v1.5 demonstrate that FGAA-FPN achieves state-of-the-art results, reaching 75.5% and 68.3% mAP, respectively.

[CV-41] (MGS)2-Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization

【速读】:该论文针对的是跨视角地理定位(Cross-view geo-localization, CVGL)在无卫星导航系统(GNSS-denied)环境下,因倾斜航空视图与正射卫星图像之间存在剧烈几何失配而导致的定位性能脆弱问题。现有方法多局限于二维流形空间,忽视了三维几何结构中视图依赖的垂直立面(macro-structure)和尺度变化(micro-scale)对特征对齐的严重干扰。解决方案的关键在于提出一个基于几何约束的框架(MGS)²,其核心创新是Macro-Geometric Structure Filtering (MGSF) 模块——通过稀疏化几何梯度物理滤除高频立面伪影并增强视图不变的水平面特征,从而缓解域偏移;同时引入Micro-Geometric Scale Adaptation (MGSA) 模块,利用深度先验动态校正尺度差异,并结合Geometric-Appearance Contrastive Distillation (GACD) 损失强化对遮挡区域的判别能力,显著提升跨视角匹配鲁棒性与泛化性能。

链接: https://arxiv.org/abs/2602.10704
作者: Minglei Li,Mengfan He,Chao Chen,Ziyang Meng
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) is pivotal for GNSS-denied UAV navigation but remains brittle under the drastic geometric misalignment between oblique aerial views and orthographic satellite references. Existing methods predominantly operate within a 2D manifold, neglecting the underlying 3D geometry where view-dependent vertical facades (macro-structure) and scale variations (micro-scale) severely corrupt feature alignment. To bridge this gap, we propose (MGS) ^2 , a geometry-grounded framework. The core of our innovation is the Macro-Geometric Structure Filtering (MGSF) module. Unlike pixel-wise matching sensitive to noise, MGSF leverages dilated geometric gradients to physically filter out high-frequency facade artifacts while enhancing the view-invariant horizontal plane, directly addressing the domain shift. To guarantee robust input for this structural filtering, we explicitly incorporate a Micro-Geometric Scale Adaptation (MGSA) module. MGSA utilizes depth priors to dynamically rectify scale discrepancies via multi-branch feature fusion. Furthermore, a Geometric-Appearance Contrastive Distillation (GACD) loss is designed to strictly discriminate against oblique occlusions. Extensive experiments demonstrate that (MGS) ^2 achieves state-of-the-art performance, recording a Recall@1 of 97.5% on University-1652 and 97.02% on SUES-200. Furthermore, the framework exhibits superior cross-dataset generalization against geometric ambiguity. The code is available at: \hrefthis https URLthis https URL.

[CV-42] AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在复杂三维(3D)环境中感知与动作接地能力受限的问题,其根源在于现有方法主要依赖于基于二维(2D)图像训练的视觉语言模型(VLM),难以有效建模空间结构信息。解决方案的关键在于引入深度估计模块(如VGGT)以从标准RGB输入中提取几何感知的3D线索,并通过新增的“动作辅助模块”(action assistant)利用动作先验约束学习到的3D特征,确保其与下游控制任务的一致性;最终将增强后的3D特征与传统2D视觉token融合,显著提升VLA模型在复杂场景下的泛化能力和动作预测准确性。

链接: https://arxiv.org/abs/2602.10698
作者: Zhifeng Rao,Wenlong Chen,Lei Xie,Xia Hua,Dongfu Yin,Zhen Tian,F. Richard Yu
机构: Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); Southern University of Science and Technology (南方科技大学); Shanghai University (上海大学); Carleton University (卡尔顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.

[CV-43] OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

【速读】:该论文旨在解决现实世界中多模态伪造内容(如文本、图像和视频交织出现)的统一检测与定位问题,现有方法通常局限于单模态或双模态场景,难以应对跨模态交互复杂性和检测-定位双重任务之间的梯度主导偏差(即“难度偏置”问题)。解决方案的关键在于提出 OmniVL-Guard 框架,其核心创新为两个设计:一是自演化思维链生成(Self-Evolving CoT Generation),用于克服初始阶段推理路径质量低的问题;二是自适应奖励缩放策略优化(Adaptive Reward Scaling Policy Optimization, ARSPO),通过动态调节奖励尺度和任务权重实现多任务联合优化的平衡,从而显著提升细粒度伪造定位性能并具备零样本泛化能力。

链接: https://arxiv.org/abs/2602.10687
作者: Jinjie Shen,Jing Wu,Yaxiong Wang,Lechao Cheng,Shengeng Tang,Tianrui Hui,Nan Pu,Zhun Zhong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 38 pages, DeepFake Detection

点击查看摘要

Abstract:Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the interplay between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical difficulty bias problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbfOmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). Self-Evolving CoT Generation synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, Adaptive Reward Scaling Policy Optimization (ARSPO) dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.

[CV-44] wiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning

【速读】:该论文旨在解决现有视觉思维链(Visual Chain-of-Thought, VCoT)方法在动态场景下推理能力不足的问题,尤其是其难以捕捉视频中时间动态性以支持指令执行、预测和相机运动等任务。解决方案的关键在于构建首个大规模、时序对齐的VCoT数据集TwiFF-2.7M(基于270万段视频片段),并配套提出TwiFF-Bench评估基准,用于衡量推理轨迹合理性与最终答案准确性;在此基础上,设计了TwiFF模型,该模型通过融合预训练视频生成与图像理解能力,迭代生成未来动作帧和文本推理线索,从而实现时序一致的动态视觉推理,显著优于现有静态VCoT及纯文本思维链(Textual Chain-of-Thought)方法。

链接: https://arxiv.org/abs/2602.10675
作者: Junhua Liu,Zhangcheng Wang,Zhike Han,Ningli Wang,Guotao Liang,Kun Kuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from 2.7 million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of 1,078 samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre-trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues-iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at this https URL.

[CV-45] AMAP-APP: Efficient Segmentation and Morphometry Quantification of Fluorescent Microscopy Images of Podocytes

【速读】:该论文旨在解决传统自动足细胞足突量化方法“Automatic Morphological Analysis of Podocytes”(AMAP)存在的三大问题:计算资源需求高、缺乏用户界面以及依赖Linux操作系统,从而限制了其在肾病研究中的广泛应用。解决方案的关键在于开发出一款跨平台桌面应用程序AMAP-APP,通过将高计算成本的实例分割(instance segmentation)替换为经典图像处理技术,同时保留原始语义分割模型以维持精度;此外引入改进的感兴趣区域(Region of Interest, ROI)算法,显著提升定量结果的准确性,并实现对小鼠和人类肾脏图像(STED与共聚焦显微镜)的高效自动化分析。该方案使深度学习驱动的足细胞形态计量学工具无需高性能计算集群即可在Windows、macOS和Linux系统上运行,极大增强了可及性与实用性。

链接: https://arxiv.org/abs/2602.10663
作者: Arash Fatehi,David Unnersjö-Jess,Linus Butt,Noémie Moreau,Thomas Benzing,Katarzyna Bozek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Automated podocyte foot process quantification is vital for kidney research, but the established “Automatic Morphological Analysis of Podocytes” (AMAP) method is hindered by high computational demands, a lack of a user interface, and Linux dependency. We developed AMAP-APP, a cross-platform desktop application designed to overcome these barriers. Methods: AMAP-APP optimizes efficiency by replacing intensive instance segmentation with classic image processing while retaining the original semantic segmentation model. It introduces a refined Region of Interest (ROI) algorithm to improve precision. Validation involved 365 mouse and human images (STED and confocal), benchmarking performance against the original AMAP via Pearson correlation and Two One-Sided T-tests (TOST). Results: AMAP-APP achieved a 147-fold increase in processing speed on consumer hardware. Morphometric outputs (area, perimeter, circularity, and slit diaphragm density) showed high correlation (r0.90) and statistical equivalence (TOST P0.05) to the original method. Additionally, the new ROI algorithm demonstrated superior accuracy compared to the original, showing reduced deviation from manual delineations. Conclusion: AMAP-APP democratizes deep learning-based podocyte morphometry. By eliminating the need for high-performance computing clusters and providing a user-friendly interface for Windows, macOS, and Linux, it enables widespread adoption in nephrology research and potential clinical diagnostics. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.10663 [cs.CV] (or arXiv:2602.10663v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.10663 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Arash Fatehi [view email] [v1] Wed, 11 Feb 2026 09:07:03 UTC (8,190 KB)

[CV-46] Dynamic Frequency Modulation for Controllable Text-driven Image Generation

【速读】:该论文试图解决文本引导的扩散模型在进行语义调整时,因修改原始文本提示而导致全局结构发生 unintended changes(意外变化)的问题。现有方法依赖经验性地选择特征图进行干预,其性能高度依赖于选取得当与否,从而导致稳定性不足。解决方案的关键在于从频率视角出发,提出一种无需训练的频率调制方法,利用具有动态衰减特性的频率依赖加权函数,直接操控噪声潜在变量。该方法通过区分低频分量主导早期结构框架生成、高频分量主导后期细节纹理合成的机制,在保持结构一致性的同时实现精准的语义修改,避免了对内部特征图的经验性选择,显著优于当前最优方法。

链接: https://arxiv.org/abs/2602.10662
作者: Tiandong Shi,Ling Zhao,Ji Qi,Jiayi Ma,Chengli Peng
机构: Central South University (中南大学); Guangzhou University (广州大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The success of text-guided diffusion models has established a new image generation paradigm driven by the iterative refinement of text prompts. However, modifying the original text prompt to achieve the expected semantic adjustments often results in unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection for intervention, whose performance heavily depends on appropriate selection, leading to suboptimal stability. This paper tries to solve the aforementioned problem from a frequency perspective and analyzes the impact of the frequency spectrum of noisy latent variables on the hierarchical emergence of the structure framework and fine-grained textures during the generation process. We find that lower-frequency components are primarily responsible for establishing the structure framework in the early generation stage. Their influence diminishes over time, giving way to higher-frequency components that synthesize fine-grained textures. In light of this, we propose a training-free frequency modulation method utilizing a frequency-dependent weighting function with dynamic decay. This method maintains the structure framework consistency while permitting targeted semantic modifications. By directly manipulating the noisy latent variable, the proposed method avoids the empirical selection of internal feature maps. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art methods, achieving an effective balance between preserving structure and enabling semantic updates.

[CV-47] AurigaNet: A Real-Time Multi-Task Network for Enhanced Urban Driving Perception

【速读】:该论文旨在解决自动驾驶感知系统中多任务协同优化与实时性能之间的矛盾问题,即如何在保证高精度的同时实现高效计算和部署。其解决方案的关键在于提出一种名为AurigaNet的先进多任务网络架构,该架构通过端到端的实例分割能力统一处理三个核心感知任务:目标检测、车道线检测与可行驶区域分割,从而提升模型整体效率与泛化性能。AurigaNet在BDD100K数据集上验证了其优越性,在可行驶区域分割(IoU达85.2%)、车道线检测(IoU达60.8%)及目标检测(mAP@0.5:0.95为47.6%)三项指标上均显著优于现有方法,并且在Jetson Orin NX等嵌入式平台上实现了具有竞争力的实时推理性能,证明了其工程落地可行性。

链接: https://arxiv.org/abs/2602.10660
作者: Kiarash Ghasemzadeh,Sedigheh Dehghani
机构: University of Alberta (阿尔伯塔大学); Shahid Beheshti University (谢里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-driving cars hold significant potential to reduce traffic accidents, alleviate congestion, and enhance urban mobility. However, developing reliable AI systems for autonomous vehicles remains a substantial challenge. Over the past decade, multi-task learning has emerged as a powerful approach to address complex problems in driving perception. Multi-task networks offer several advantages, including increased computational efficiency, real-time processing capabilities, optimized resource utilization, and improved generalization. In this study, we present AurigaNet, an advanced multi-task network architecture designed to push the boundaries of autonomous driving perception. AurigaNet integrates three critical tasks: object detection, lane detection, and drivable area instance segmentation. The system is trained and evaluated using the BDD100K dataset, renowned for its diversity in driving conditions. Key innovations of AurigaNet include its end-to-end instance segmentation capability, which significantly enhances both accuracy and efficiency in path estimation for autonomous vehicles. Experimental results demonstrate that AurigaNet achieves an 85.2% IoU in drivable area segmentation, outperforming its closest competitor by 0.7%. In lane detection, AurigaNet achieves a remarkable 60.8% IoU, surpassing other models by more than 30%. Furthermore, the network achieves an mAP@0.5:0.95 of 47.6% in traffic object detection, exceeding the next leading model by 2.9%. Additionally, we validate the practical feasibility of AurigaNet by deploying it on embedded devices such as the Jetson Orin NX, where it demonstrates competitive real-time performance. These results underscore AurigaNet’s potential as a robust and efficient solution for autonomous driving perception systems. The code can be found here this https URL.

[CV-48] Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

【速读】:该论文旨在解决文本驱动的3D人-物交互(Human-Object Interaction, HOI)运动生成任务中存在三大核心问题:(Q1)人类动作质量不佳,(Q2)物体运动不自然,以及(Q3)人与物体之间交互弱。为应对这些挑战,作者提出MP-HOI框架,其关键在于四个核心设计:(1)利用大规模多模态模型提供的文本、图像、姿态/物体数据作为先验信息以优化数据建模;(2)通过引入几何关键点、接触特征和动态属性改进物体表征,增强物体表达能力;(3)构建模态感知的混合专家(Multimodal-Aware Mixture-of-Experts, MoE)模型实现高效多模态特征融合;(4)设计级联扩散机制并辅以交互监督策略,逐步精细化人-物交互特征。这一系列创新有效提升了HOI运动生成的质量与真实性。

链接: https://arxiv.org/abs/2602.10659
作者: Yin Wang,Ziyao Zhang,Zhiying Leng,Haitian Liu,Frederick W. B. Li,Mu Li,Xiaohui Liang
机构: Beihang University (北京航空航天大学); University of Durham (杜伦大学); Zhongguancun Laboratory (中关村实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.

[CV-49] VideoSTF: Stress-Testing Output Repetition in Video Large Language Models

【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, VideoLLMs)中普遍存在但此前未被充分关注的输出重复问题,即模型在生成文本时陷入自强化循环,反复输出相同短语或句子,导致内容退化。这一问题在现有基准测试中未被捕捉,因其主要关注任务准确性和事实正确性。解决方案的关键在于提出首个系统性评估框架VideoSTF,该框架通过三种互补的n-gram基指标量化重复程度,并提供包含10,000个多样化视频的标准测试集及一系列受控的时间变换工具库,从而实现对模型输出稳定性的压力测试与对抗性挖掘,揭示了时间扰动对重复现象的高度敏感性,表明输出重复是可被利用的安全漏洞,进而推动面向稳定性的视频-语言系统评估范式发展。

链接: https://arxiv.org/abs/2602.10639
作者: Yuxin Cao,Wei Song,Shangzhi Xu,Jingling Xue,Jin Song Dong
机构: National University of Singapore (新加坡国立大学); University of New South Wales (新南威尔士大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) have recently achieved strong performance in video understanding tasks. However, we identify a previously underexplored generation failure: severe output repetition, where models degenerate into self-reinforcing loops of repeated phrases or sentences. This failure mode is not captured by existing VideoLLM benchmarks, which focus primarily on task accuracy and factual correctness. We introduce VideoSTF, the first framework for systematically measuring and stress-testing output repetition in VideoLLMs. VideoSTF formalizes repetition using three complementary n-gram-based metrics and provides a standardized testbed of 10,000 diverse videos together with a library of controlled temporal transformations. Using VideoSTF, we conduct pervasive testing, temporal stress testing, and adversarial exploitation across 10 advanced VideoLLMs. We find that output repetition is widespread and, critically, highly sensitive to temporal perturbations of video inputs. Moreover, we show that simple temporal transformations can efficiently induce repetitive degeneration in a black-box setting, exposing output repetition as an exploitable security vulnerability. Our results reveal output repetition as a fundamental stability issue in modern VideoLLMs and motivate stability-aware evaluation for video-language systems. Our evaluation code and scripts are available at: this https URL.

[CV-50] Eliminating VAE for Fast and High-Resolution Generative Detail Restoration ICLR2026

【速读】:该论文旨在解决扩散模型(Diffusion Models)在真实世界超分辨率(Super-Resolution, SR)任务中推理速度慢、内存占用高,以及因VAE(Variational Auto-Encoder)瓶颈导致无法直接处理大尺寸图像(如4K)的问题。其关键解决方案是提出GenDR-Pix,通过引入像素级的反向操作(pixel-(un)shuffle)完全移除VAE模块,将基于潜在空间的GenDR重构为像素空间的GenDR-Pix;同时设计多阶段对抗蒸馏机制,逐步消除编码器与解码器,并结合随机填充(random padding)、掩码傅里叶空间损失(masked Fourier space loss)及基于padding的自集成策略与无分类器引导(classifier-free guidance),有效抑制伪影并提升推理效率。实验表明,该方法相较原GenDR实现2.8倍加速和60%内存节省,且能在1秒内完成4K图像恢复,仅需6GB显存。

链接: https://arxiv.org/abs/2602.10630
作者: Yan Wang,Shijie Zhao,Junlin Li,Li Zhang
机构: MMLab, ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Diffusion models have attained remarkable breakthroughs in the real-world super-resolution (SR) task, albeit at slow inference and high demand on devices. To accelerate inference, recent works like GenDR adopt step distillation to minimize the step number to one. However, the memory boundary still restricts the maximum processing size, necessitating tile-by-tile restoration of high-resolution images. Through profiling the pipeline, we pinpoint that the variational auto-encoder (VAE) is the bottleneck of latency and memory. To completely solve the problem, we leverage pixel-(un)shuffle operations to eliminate the VAE, reversing the latent-based GenDR to pixel-space GenDR-Pix. However, upscale with x8 pixelshuffle may induce artifacts of repeated patterns. To alleviate the distortion, we propose a multi-stage adversarial distillation to progressively remove the encoder and decoder. Specifically, we utilize generative features from the previous stage models to guide adversarial discrimination. Moreover, we propose random padding to augment generative features and avoid discriminator collapse. We also introduce a masked Fourier space loss to penalize the outliers of amplitude. To improve inference performance, we empirically integrate a padding-based self-ensemble with classifier-free guidance to improve inference scaling. Experimental results show that GenDR-Pix performs 2.8x acceleration and 60% memory-saving compared to GenDR with negligible visual degradation, surpassing other one-step diffusion SR. Against all odds, GenDR-Pix can restore 4K image in only 1 second and 6GB.

[CV-51] A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

【速读】:该论文旨在解决医学基础模型在临床部署中依赖任务特定微调(task-specific fine-tuning)导致的泛化性差与应用门槛高的问题。其核心解决方案是提出DermFM-Zero,一个基于掩码潜在建模(masked latent modelling)和对比学习(contrastive learning)训练的皮肤科视觉-语言基础模型,利用超过400万条多模态数据进行预训练,从而实现无需任何任务适配即可在20个基准测试中达到最先进性能。关键创新在于其零样本(zero-shot)能力验证:在三项跨国临床读者研究中,该模型在初级医疗场景中显著提升全科医生对98种皮肤疾病的鉴别诊断准确率,在专科场景中超越板级认证皮肤科医生的皮肤癌多模态评估表现,并在协作流程中使非专家表现优于未受助专家,同时通过稀疏自编码器无监督解耦潜在表示,识别出具有临床意义的概念并针对性抑制伪影诱导偏差,提升了模型鲁棒性且无需重新训练。

链接: https://arxiv.org/abs/2602.10624
作者: Siyuan Yan,Xieji Li,Dan Mo,Philipp Tschandl,Yiwen Jiang,Zhonghua Wang,Ming Hu,Lie Ju,Cristina Vico-Alonso,Yizhen Zheng,Jiahe Liu,Juexiao Zhou,Camilla Chello,Jen G. Cheung,Julien Anriot,Luc Thomas,Clare Primiero,Gin Tan,Aik Beng Ng,Simon See,Xiaoying Tang,Albert Ip,Xiaoyang Liao,Adrian Bowling,Martin Haskett,Shuang Zhao,Monika Janda,H. Peter Soyer,Victoria Mar,Harald Kittler,Zongyuan Ge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: reports

点击查看摘要

Abstract:Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero’s latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.

[CV-52] Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

【速读】:该论文旨在解决当前基于规则的强化微调(Reinforcement Fine-Tuning, RFT)方法在跨模态、以视觉为中心的医学影像领域中应用不足的问题,尤其是在需要强视觉感知与结构化推理能力协同作用的医疗场景下。其解决方案的关键在于提出了一种专为医学领域设计的视觉强化微调框架VRFT-Aug,该框架通过四项核心训练策略实现感知与推理能力的增强:先验知识注入(prior knowledge injection)、感知驱动的策略精炼(perception-driven policy refinement)、医学启发的奖励塑造(medically informed reward shaping)以及行为模仿(behavioral imitation),从而稳定并提升RFT过程的性能表现。

链接: https://arxiv.org/abs/2602.10619
作者: Guangjing Yang,ZhangYuan Yu,Ziyuan Qin,Xinyuan Song,Huahui Yi,Qingbo Kang,Jun Gao,Yiyue Li,Chenlin Du,Qicheng Lao
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Emory University (埃默里大学); Sichuan University (四川大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CPAL 2026

点击查看摘要

Abstract:While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications. Comments: CPAL 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.10619 [cs.CV] (or arXiv:2602.10619v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.10619 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-53] Fast Person Detection Using YOLOX With AI Accelerator For Train Station Safety

【速读】:该论文旨在解决火车站乘客穿越黄线等不安全行为导致的事故问题,以提升交通安全性。其解决方案的关键在于利用YOLOX目标检测算法与边缘AI加速硬件(Hailo-8)相结合,在实时性与准确性之间取得优化平衡。实验表明,相较于Jetson Orin Nano平台,Hailo-8在准确率上提升超过12%,且延迟降低20毫秒,显著增强了列车站台场景下对乘客行为的识别能力与响应速度。

链接: https://arxiv.org/abs/2602.10593
作者: Mas Nurul Achmadiah,Novendra Setyawan,Achmad Arif Bryantono,Chi-Chia Sun,Wen-Kai Kuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 8 figures, 2 tables. Presented at 2024 International Electronics Symposium (IES). IEEE DOI: https://doi.org/10.1109/IES63037.2024.10665874

点击查看摘要

Abstract:Recently, Image processing has advanced Faster and applied in many fields, including health, industry, and transportation. In the transportation sector, object detection is widely used to improve security, for example, in traffic security and passenger crossings at train stations. Some accidents occur in the train crossing area at the station, like passengers uncarefully when passing through the yellow line. So further security needs to be developed. Additional technology is required to reduce the number of accidents. This paper focuses on passenger detection applications at train stations using YOLOX and Edge AI Accelerator hardware. the performance of the AI accelerator will be compared with Jetson Orin Nano. The experimental results show that the Hailo-8 AI hardware accelerator has higher accuracy than Jetson Orin Nano (improvement of over 12%) and has lower latency than Jetson Orin Nano (reduced 20 ms).

[CV-54] Enhancing YOLOv11n for Reliable Child Detection in Noisy Surveillance Footage

【速读】:该论文旨在解决在低质量监控视频中检测儿童的难题,这是实际应用中失踪儿童警报和托儿所监控系统的关键环节。其核心挑战包括遮挡、小目标尺寸、低分辨率、运动模糊及光照不良等常见于现有闭路电视(CCTV)基础设施的问题。解决方案的关键在于提出一个轻量级且可部署的检测流水线:首先基于高效的YOLOv11n架构,引入一种领域特定的数据增强策略,通过空间扰动(如部分可见性、截断和重叠)与光度退化(如光照变化和噪声)合成真实场景下的儿童图像;其次,在推理阶段集成切片辅助超推理(Slicing Aided Hyper Inference, SAHI),以提升对小尺度和部分遮挡实例的召回率。该方案在不改变模型结构的前提下,显著提升了检测性能(mAP@0.5 达到 0.967,mAP@0.5:0.95 达到 0.783),同时保持边缘设备兼容性和实时性能,适用于低成本或资源受限的工业监控部署场景。

链接: https://arxiv.org/abs/2602.10592
作者: Khanh Linh Tran,Minh Nguyen Dang,Thien Nguyen Trong,Hung Nguyen Quoc,Linh Nguyen Kieu
机构: PTIT(越南信息技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a practical and lightweight solution for enhancing child detection in low-quality surveillance footage, a critical component in real-world missing child alert and daycare monitoring systems. Building upon the efficient YOLOv11n architecture, we propose a deployment-ready pipeline that improves detection under challenging conditions including occlusion, small object size, low resolution, motion blur, and poor lighting commonly found in existing CCTV infrastructures. Our approach introduces a domain-specific augmentation strategy that synthesizes realistic child placements using spatial perturbations such as partial visibility, truncation, and overlaps, combined with photometric degradations including lighting variation and noise. To improve recall of small and partially occluded instances, we integrate Slicing Aided Hyper Inference (SAHI) at inference time. All components are trained and evaluated on a filtered, child-only subset of the Roboflow Daycare dataset. Compared to the baseline YOLOv11n, our enhanced system achieves a mean Average Precision at 0.5 IoU (mAP@0.5) of 0.967 and a mean Average Precision averaged over IoU thresholds from 0.5 to 0.95 (mAP@0.5:0.95) of 0.783, yielding absolute improvements of 0.7 percent and 2.3 percent, respectively, without architectural changes. Importantly, the entire pipeline maintains compatibility with low-power edge devices and supports real-time performance, making it particularly well suited for low-cost or resource-constrained industrial surveillance deployments. The example augmented dataset and the source code used to generate it are available at: this https URL

[CV-55] Enhancing Underwater Images via Adaptive Semantic-aware Codebook Learning

【速读】:该论文旨在解决水下图像增强(Underwater Image Enhancement, UIE)中因场景内不同语义区域退化程度不一致而导致的颜色失真和细节丢失问题。现有方法通常采用全局单一模型处理整张图像,忽略了水下场景中退化特征的空间异质性。其解决方案的关键在于提出SUCode(Semantic-aware Underwater Codebook Network),通过语义感知的像素级离散码本表示实现自适应增强;该方法引入三阶段训练范式以避免伪真值污染,并结合门控通道注意力模块(Gated Channel Attention Module, GCAM)与频域感知特征融合(Frequency-Aware Feature Fusion, FAFF)协同整合通道与频率信息,从而更准确地恢复颜色和纹理细节。

链接: https://arxiv.org/abs/2602.10586
作者: Bosen Lin,Feng Gao,Yanwei Yu,Junyu Dong,Qian Du
机构: Ocean University of China (中国海洋大学); Mississippi State University (密西西比州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted for publication in IEEE TGRS 2026

点击查看摘要

Abstract:Underwater Image Enhancement (UIE) is an ill-posed problem where natural clean references are not available, and the degradation levels vary significantly across semantic regions. Existing UIE methods treat images with a single global model and ignore the inconsistent degradation of different scene components. This oversight leads to significant color distortions and loss of fine details in heterogeneous underwater scenes, especially where degradation varies significantly across different image regions. Therefore, we propose SUCode (Semantic-aware Underwater Codebook Network), which achieves adaptive UIE from semantic-aware discrete codebook representation. Compared with one-shot codebook-based methods, SUCode exploits semantic-aware, pixel-level codebook representation tailored to heterogeneous underwater degradation. A three-stage training paradigm is employed to represent raw underwater image features to avoid pseudo ground-truth contamination. Gated Channel Attention Module (GCAM) and Frequency-Aware Feature Fusion (FAFF) jointly integrate channel and frequency cues for faithful color restoration and texture recovery. Extensive experiments on multiple benchmarks demonstrate that SUCode achieves state-of-the-art performance, outperforming recent UIE methods on both reference and no-reference metrics. The code will be made public available at this https URL.

[CV-56] MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像隐喻理解(image implication comprehension)任务中的显著局限性,即难以准确捕捉视觉内容中蕴含的文化、情感与语境等复杂信息。这一挑战源于任务所需的多跳推理(multi-hop reasoning)、文化背景知识以及心智理论(Theory of Mind, ToM)能力,而现有模型普遍缺乏这些关键能力。解决方案的核心是提出首个端到端的视觉强化学习(Visual Reinforcement Learning, RL)框架——MetaphorStar,其包含三个核心组件:细粒度数据集TFQ-Data、基于策略梯度的视觉强化学习方法TFQ-GRPO,以及结构化的评估基准TFQ-Bench。通过在TFQ-Data上训练得到的MetaphorStar系列模型,在多个图像隐喻理解任务上平均性能提升达82.6%,并在多项指标上达到当前最优水平(SOTA),尤其在True-False类问题上超越了顶级闭源模型Gemini-3.0-pro,同时实验证明该训练过程还能增强模型的一般视觉理解与复杂推理能力。

链接: https://arxiv.org/abs/2602.10575
作者: Chenhao Zhang,Yazhe Niu,Hongsheng Li
机构: Shanghai AI Laboratory (上海人工智能实验室); Huazhong University of Science and Technology (华中科技大学); The Chinese University of Hong Kong MMLab (香港中文大学多媒体实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 14 pages, 4 figures, 11 tables; Code: this https URL , Model Dataset: this https URL

点击查看摘要

Abstract:Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task’s demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar-32B achieves state-of-the-art (SOTA) on Multiple-Choice Question and Open-Style Question, significantly outperforms the top closed-source model Gemini-3.0-pro on True-False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open-sourced all model weights, datasets, and method code at this https URL. Comments: 14 pages, 4 figures, 11 tables; Code: this https URL, Model Dataset: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2602.10575 [cs.CV] (or arXiv:2602.10575v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.10575 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-57] C2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning ICRA2026

【速读】:该论文旨在解决当前基于大语言模型(LLM)的3D多模态模型(LMMs)中,因继承旋转位置编码(Rotary Position Embedding, RoPE)而引发的两个核心问题:一是将一维时间位置索引应用于视觉特征时破坏了列方向上的空间连续性,导致空间局部性丢失;二是RoPE遵循“时间上更接近的图像标记具有更强因果关联”的先验假设,造成注意力分配随序列长度增加而长期衰减,使模型逐渐忽略早期视觉标记。解决方案的关键在于提出C²RoPE,一种显式建模局部空间连续性和空间因果关系的改进型RoPE机制:首先构建包含1D时间位置与笛卡尔坐标系下空间坐标的三元组混合位置索引,再通过频率分配策略对三个索引维度进行联合编码;同时引入切比雪夫因果掩码(Chebyshev Causal Masking),基于2D空间中图像标记间的切比雪夫距离动态确定因果依赖关系,从而实现更精准的视觉特征定位与长序列注意力保持。

链接: https://arxiv.org/abs/2602.10551
作者: Guanting Ye,Qiyan Zhao,Wenhao Yu,Xiaofeng Zhang,Jianmin Ji,Yanyong Zhang,Ka-Veng Yuen
机构: University of Macau (澳门大学); Shanghai Jiaotong University (上海交通大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in ICRA 2026

点击查看摘要

Abstract:Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long-term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio-temporal continuous positional embedding mechanism for visual tokens. It first integrates 1D temporal positions with Cartesian-based spatial coordinates to construct a triplet hybrid positional index, and then employs a frequency allocation strategy to encode spatio-temporal positional information across the three index components. Additionally, we introduce Chebyshev Causal Masking, which determines causal dependencies by computing the Chebyshev distance of image tokens in 2D space. Evaluation results across various benchmarks, including 3D scene reasoning and 3D visual question answering, demonstrate C^2RoPE’s effectiveness. The code is be available at this https URL.

[CV-58] Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

【速读】:该论文旨在解决弱监督多模态视频异常检测中文本模态潜力未被充分挖掘的问题,具体包括:通用语言模型难以捕捉异常特异性语义特征、相关异常描述数据稀缺,以及多模态融合中存在的冗余与不平衡问题。解决方案的关键在于提出一种新颖的文本引导框架:首先设计基于上下文学习的多阶段文本增强机制,生成高质量异常文本样本以微调文本特征提取器;其次构建多尺度瓶颈Transformer融合模块,通过压缩后的瓶颈token逐级融合多模态信息,有效缓解冗余与不平衡问题。

链接: https://arxiv.org/abs/2602.10549
作者: Shengyang Sun,Jiashen Hua,Junyi Feng,Xiaojin Gong
机构: Hangzhou Dianzi University (杭州电子科技大学); Alibaba Cloud (阿里云); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.

[CV-59] RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images ACM-MM2025

【速读】:该论文旨在解决当前AI生成图像检测模型泛化能力不足的问题,其根源在于现有数据集存在图像质量低、多样性差、提示词过于简单以及缺乏丰富标注信息等局限性。解决方案的关键在于构建一个高质量、大规模(超过73万张图像)且多类别均衡的数据集,涵盖真实图像与多种生成方式(如文本到图像生成、图像修复、图像增强和人脸替换)的AI生成图像,并为每张图像提供详细的生成方法标签及图像修复区域的二值掩码等丰富元数据。在此基础上,论文进一步提出一种基于图像噪声熵的轻量级检测方法,通过将原始图像转换为非局部均值(Non-Local Means, NLM)噪声的熵张量进行分类,实验表明该方法在训练数据上具有优异的泛化性能,为未来AI生成图像检测研究奠定了坚实基线。

链接: https://arxiv.org/abs/2602.10546
作者: Hanzhe Yu,Yun Ye,Jintao Rong,Qi Xuan,Chen Ma
机构: Zhejiang University of Technology (浙江工业大学); Intel Corporation (英特尔公司); Binjiang Institute of Artificial Intelligence, ZJU (滨江人工智能研究院,浙大)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in the Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM 2025)

点击查看摘要

Abstract:The rapid advancement of generative AI has raised concerns about the authenticity of digital images, as highly realistic fake images can now be generated at low cost, potentially increasing societal risks. In response, several datasets have been established to train detection models aimed at distinguishing AI-generated images from real ones. However, existing datasets suffer from limited generalization, low image quality, overly simple prompts, and insufficient image diversity. To address these limitations, we propose a high-quality, large-scale dataset comprising over 730,000 images across multiple categories, including both real and AI-generated images. The generated images are synthesized via state-of-the-art methods, including text-to-image generation (guided by over 10,000 carefully designed prompts), image inpainting, image refinement, and face swapping. Each generated image is annotated with its generation method and category. Inpainting images further include binary masks to indicate inpainted regions, providing rich metadata for analysis. Compared to existing datasets, detection models trained on our dataset demonstrate superior generalization capabilities. Our dataset not only serves as a strong benchmark for evaluating detection methods but also contributes to advancing the robustness of AI-generated image detection techniques. Building upon this, we propose a lightweight detection method based on image noise entropy, which transforms the original image into an entropy tensor of Non-Local Means (NLM) noise before classification. Extensive experiments demonstrate that models trained on our dataset achieve strong generalization, and our method delivers competitive performance, establishing a solid baseline for future research. The dataset and source code are publicly available at this https URL.

[CV-60] MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在地图推理任务中表现不足的问题,特别是其在整合空间关系、视觉线索与真实世界背景知识方面的能力有限,且现有基准数据集多依赖于人工生成内容,缺乏对真实地理空间推理能力的全面评估。解决方案的关键在于构建一个大规模、基于真实地图的人工标注问答数据集——MapVerse,该数据集包含1,025张真实地图和11,837个由人类编写的问答对,覆盖10类地图类型及多种问题类别,从而为地图阅读、解释与多模态推理提供更丰富的评估场景。通过在此基准上对10个前沿VLM进行评测,研究揭示了当前模型在复杂空间推理任务上的显著性能差距,凸显了真实世界地图理解的重要性与挑战性。

链接: https://arxiv.org/abs/2602.10518
作者: Sharat Bhat,Harshita Khandelwal,Tushar Kataria,Vivek Gupta
机构: University of Southern California (南加州大学); University of California Los Angeles (加州大学洛杉矶分校); University of Utah (犹他大学); Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.

[CV-61] 3DXTalker: Unifying Identity Lip Sync Emotion and Spatial Dynamics in Expressive 3D Talking Avatars

【速读】:该论文旨在解决3D说话头像生成中表达性不足的问题,具体包括身份保真度低、唇部动作与语音同步差、情感表达不细腻以及头部姿态动态自然性弱等挑战。解决方案的关键在于三个方面:一是通过2D到3D的数据整理管道和解耦表示实现可扩展的身份建模,缓解数据稀缺并提升身份泛化能力;二是引入逐帧幅度和情感线索,超越传统语音嵌入,增强唇同步精度和表情细微调节能力;三是基于流匹配(flow-matching)的Transformer架构统一上述多模态信号,生成连贯的面部动态,并支持通过提示词控制实现自然的头部姿态运动与风格化调节。

链接: https://arxiv.org/abs/2602.10516
作者: Zhongju Wang,Zhenhong Sun,Beier Wang,Yifu Wang,Daoyi Dong,Huadong Mo,Hongdong Li
机构: University of New South Wales (新南威尔士大学); Australian National University (澳大利亚国立大学); University of Technology Sydney (悉尼科技大学); Vertex Lab (顶点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-driven 3D talking avatar generation is increasingly important in virtual communication, digital humans, and interactive media, where avatars must preserve identity, synchronize lip motion with speech, express emotion, and exhibit lifelike spatial dynamics, collectively defining a broader objective of expressivity. However, achieving this remains challenging due to insufficient training data with limited subject identities, narrow audio representations, and restricted explicit controllability. In this paper, we propose 3DXTalker, an expressive 3D talking avatar through data-curated identity modeling, audio-rich representations, and spatial dynamics controllability. 3DXTalker enables scalable identity modeling via 2D-to-3D data curation pipeline and disentangled representations, alleviating data scarcity and improving identity generalization. Then, we introduce frame-wise amplitude and emotional cues beyond standard speech embeddings, ensuring superior lip synchronization and nuanced expression modulation. These cues are unified by a flow-matching-based transformer for coherent facial dynamics. Moreover, 3DXTalker also enables natural head-pose motion generation while supporting stylized control via prompt-based conditioning. Extensive experiments show that 3DXTalker integrates lip synchronization, emotional expression, and head-pose dynamics within a unified framework, achieves superior performance in 3D talking avatar generation.

[CV-62] 1%100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models)在部署过程中因传统全量微调(Full Fine-Tuning)导致的计算成本高、效率低的问题。现有delta-tuning方法虽在大语言模型(Large Language Models, LLMs)中表现出色,但难以直接迁移至视觉模型的微调流程。其解决方案的关键在于提出一种基于复数线性投影优化(Complex Linear Projection Optimization, CoLin)的低秩复杂适配器(Low-rank Complex Adapter),该结构仅引入约1%的参数量,并通过理论证明低秩复合矩阵在训练中存在严重收敛问题,进而设计了针对性损失函数以提升训练稳定性与性能。实验表明,CoLin在目标检测、分割、图像分类及遥感场景下的旋转目标检测任务中,均以极低参数开销超越全量微调和经典delta-tuning方法,首次实现了高效且高性能的视觉模型适应策略。

链接: https://arxiv.org/abs/2602.10513
作者: Dongshuo Yin,Xue Yang,Deng-Ping Fan,Shi-Min Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on this https URL.

[CV-63] Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

【速读】:该论文旨在解决现代分割模型虽具备强大预测性能但缺乏可解释性的问题,这限制了对模型失败原因的诊断、数据集偏移的理解以及进行系统性干预的能力。其解决方案的关键在于提出Med-SegLens框架,通过在SegFormer和U-Net等模型上训练稀疏自编码器(sparse autoencoders),将模型激活分解为可解释的潜在特征(latent features),并实现跨架构与跨数据集的潜在空间对齐。研究发现,尽管存在数据集偏移,仍存在一个稳定的共享表示基础;而具体的数据集差异则源于不同人群特异性潜在特征的依赖程度变化。这些潜在特征作为分割失败的因果瓶颈,通过对它们进行针对性干预即可修正错误、提升跨数据集适应能力,且无需重新训练模型——在70%的失败案例中恢复性能,并使Dice分数从39.4%提升至74.2%。

链接: https://arxiv.org/abs/2602.10508
作者: Salma J. Ahmed,Emad A. Mohammed,Azam Asilian Bidgoli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern segmentation models achieve strong predictive performance but remain largely opaque, limiting our ability to diagnose failures, understand dataset shift, or intervene in a principled manner. We introduce Med-SegLens, a model-diffing framework that decomposes segmentation model activations into interpretable latent features using sparse autoencoders trained on SegFormer and U-Net. Through cross-architecture and cross-dataset latent alignment across healthy, adult, pediatric, and sub-Saharan African glioma cohorts, we identify a stable backbone of shared representations, while dataset shift is driven by differential reliance on population-specific latents. We show that these latents act as causal bottlenecks for segmentation failures, and that targeted latent-level interventions can correct errors and improve cross-dataset adaption without retraining, recovering performance in 70% of failure cases and improving Dice score from 39.4% to 74.2%. Our results demonstrate that latent-level model diffing provides a practical and mechanistic tool for diagnosing failures and mitigating dataset shift in segmentation models.

[CV-64] he Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation

【速读】:该论文旨在解决自动化垃圾分拣中因数据稀缺与多样性不足导致的模型泛化能力差的问题,其核心解决方案是构建并公开发布一个名为Garbage Dataset (GD) 的高质量、多样化的图像数据集,涵盖10类常见家庭废弃物(如金属、玻璃、塑料等),共计13,348张标注图像。关键在于通过多源采集(包括移动应用和网络资源)、严格的异常值检测与校验机制,以及对类别不平衡、背景复杂度(熵与显著性分析)的量化评估,确保数据质量;同时结合多种先进深度学习模型(如EfficientNetV2S、ResNet系列)进行性能与碳足迹双维度基准测试,揭示了模型精度与环境成本之间的权衡关系,为后续可持续智能垃圾分类研究提供了可靠的数据基础与技术参考。

链接: https://arxiv.org/abs/2602.10500
作者: Suman Kunwar
机构: DWaste, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages 10 figures and 1 table

点击查看摘要

Abstract:This study introduces the Garbage Dataset (GD), a publicly available image dataset designed to advance automated waste segregation through machine learning and computer vision. It’s a diverse dataset covering 10 common household waste categories: metal, glass, biological, paper, battery, trash, cardboard, shoes, clothes, and plastic. The dataset comprises 13,348 labeled images collected through multiple methods, including DWaste mobile app and curated web sources. Methods included rigorous validation through checksums and outlier detection, analysis of class imbalance and visual separability via PCA/t-SNE, and assessment of background complexity using entropy and saliency measures. The dataset was benchmarked using state-of-the-art deep learning models (EfficientNetV2M, EfficientNetV2S, MobileNet, ResNet50, ResNet101) evaluated on performance metrics and operational carbon emissions. Experiment results indicate EfficientNetV2S achieved the highest performance with 96.19% accuracy and a 0.96 F1-score, though with a moderate carbon cost. Analysis revealed inherent dataset characteristics including class imbalance, a skew toward high-outlier classes (plastic, cardboard, paper), and brightness variations that require consideration. The main conclusion is that GD provides a valuable, real-world benchmark for waste classification research while highlighting important challenges such as class imbalance, background complexity, and environmental trade-offs in model selection that must be addressed for practical deployment. The dataset is publicly released to support further research in environmental sustainability applications.

[CV-65] Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings ICLR2026

【速读】:该论文旨在解决多分辨率哈希编码(Multi-Resolution Hash Encoding, MHE)在神经场参数化中缺乏物理系统层面的严谨理解问题,从而导致超参数选择依赖启发式方法。其核心解决方案是引入一种基于点扩散函数(Point Spread Function, PSF)的分析框架,将PSF类比为系统的格林函数,以量化MHE的空间分辨率和保真度。关键发现包括:推导出无碰撞PSF的闭式近似,揭示了由网格引起的各向异性及对数空间分布;指出理想空间带宽(FWHM)由平均分辨率 NavgN_\text{avg} 决定,而非最细分辨率 N_\max,且优化动态会引发展宽效应;同时阐明有限哈希容量导致碰撞并引入斑点噪声、降低信噪比(SNR)。基于此理论洞察,作者提出旋转多分辨率哈希编码(Rotated MHE, R-MHE),通过在每层分辨率下对输入坐标施加不同旋转,有效缓解各向异性,同时保持原始MHE的效率与参数量。该研究建立了一套基于物理原理的分析与优化方法,超越了传统启发式策略。

链接: https://arxiv.org/abs/2602.10495
作者: Tianxiang Dai,Jonathan Fan
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026 (Poster); LaTeX source; 11 figures; 7 tables

点击查看摘要

Abstract:Multi-Resolution Hash Encoding (MHE), the foundational technique behind Instant Neural Graphics Primitives, provides a powerful parameterization for neural fields. However, its spatial behavior lacks rigorous understanding from a physical systems perspective, leading to reliance on heuristics for hyperparameter selection. This work introduces a novel analytical approach that characterizes MHE by examining its Point Spread Function (PSF), which is analogous to the Green’s function of the system. This methodology enables a quantification of the encoding’s spatial resolution and fidelity. We derive a closed-form approximation for the collision-free PSF, uncovering inherent grid-induced anisotropy and a logarithmic spatial profile. We establish that the idealized spatial bandwidth, specifically the Full Width at Half Maximum (FWHM), is determined by the average resolution, N_\textavg . This leads to a counterintuitive finding: the effective resolution of the model is governed by the broadened empirical FWHM (and therefore N_\textavg ), rather than the finest resolution N_\max , a broadening effect we demonstrate arises from optimization dynamics. Furthermore, we analyze the impact of finite hash capacity, demonstrating how collisions introduce speckle noise and degrade the Signal-to-Noise Ratio (SNR). Leveraging these theoretical insights, we propose Rotated MHE (R-MHE), an architecture that applies distinct rotations to the input coordinates at each resolution level. R-MHE mitigates anisotropy while maintaining the efficiency and parameter count of the original MHE. This study establishes a methodology based on physical principles that moves beyond heuristics to characterize and optimize MHE.

[CV-66] End-to-End LiDAR optimization for 3D point cloud registration BMVC

【速读】:该论文旨在解决传统LiDAR(光探测与测距)传感器在3D感知任务中,尤其是点云配准(point cloud registration)过程中存在的数据采集效率低和计算资源浪费问题。传统方法通常在固定传感器参数下进行数据采集,无法根据下游任务需求动态调整,导致点云密度、噪声和稀疏性难以平衡,进而影响配准精度与效率。解决方案的关键在于提出一种自适应LiDAR传感框架,通过将配准反馈嵌入感知闭环,联合优化传感器参数与配准超参数,从而实现点云质量的动态调控,在保证泛化能力的同时显著提升配准性能。

链接: https://arxiv.org/abs/2602.10492
作者: Siddhant Katyan,Marc-André Gardner,Jean-François Lalonde
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025. Project page: this https URL

点击查看摘要

Abstract:LiDAR sensors are a key modality for 3D perception, yet they are typically designed independently of downstream tasks such as point cloud registration. Conventional registration operates on pre-acquired datasets with fixed LiDAR configurations, leading to suboptimal data collection and significant computational overhead for sampling, noise filtering, and parameter tuning. In this work, we propose an adaptive LiDAR sensing framework that dynamically adjusts sensor parameters, jointly optimizing LiDAR acquisition and registration hyperparameters. By integrating registration feedback into the sensing loop, our approach optimally balances point density, noise, and sparsity, improving registration accuracy and efficiency. Evaluations in the CARLA simulation demonstrate that our method outperforms fixed-parameter baselines while retaining generalization abilities, highlighting the potential of adaptive LiDAR for autonomous perception and robotic applications.

[CV-67] owards Remote Sensing Change Detection with Neural Memory

【速读】:该论文旨在解决遥感变化检测中长期依赖关系建模困难与计算效率难以兼顾的问题。现有方法在捕捉全局上下文时存在局限,而基于Transformer的方法虽能有效建模长程依赖,但其二次复杂度导致扩展性差;同时,现有的线性注意力机制往往无法充分建模复杂的时空关系。解决方案的关键在于提出ChangeTitans框架:首先设计VTitans作为首个基于Titans的视觉骨干网络,通过引入神经记忆(neural memory)与分段局部注意力(segmented local attention)协同机制,在保持较低计算开销的同时捕获长距离依赖;其次构建层级式VTitans-Adapter以优化多尺度特征提取;最后提出TS-CBAM双流融合模块,利用跨时域注意力抑制伪变化并提升检测精度。实验表明,该方法在多个基准数据集上达到当前最优性能,且具备良好的计算效率。

链接: https://arxiv.org/abs/2602.10491
作者: Zhenyu Yang,Gensheng Pei,Yazhou Yao,Tianfei Zhou,Lizhong Ding,Fumin Shen
机构: Nanjing University of Science and Technology (南京理工大学); Beijing Institute of Technology (北京理工大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by IEEE Transactions on Geoscience Remote Sensing

点击查看摘要

Abstract:Remote sensing change detection is essential for environmental monitoring, urban planning, and related applications. However, current methods often struggle to capture long-range dependencies while maintaining computational efficiency. Although Transformers can effectively model global context, their quadratic complexity poses scalability challenges, and existing linear attention approaches frequently fail to capture intricate spatiotemporal relationships. Drawing inspiration from the recent success of Titans in language tasks, we present ChangeTitans, the Titans-based framework for remote sensing change detection. Specifically, we propose VTitans, the first Titans-based vision backbone that integrates neural memory with segmented local attention, thereby capturing long-range dependencies while mitigating computational overhead. Next, we present a hierarchical VTitans-Adapter to refine multi-scale features across different network layers. Finally, we introduce TS-CBAM, a two-stream fusion module leveraging cross-temporal attention to suppress pseudo-changes and enhance detection accuracy. Experimental evaluations on four benchmark datasets (LEVIR-CD, WHU-CD, LEVIR-CD+, and SYSU-CD) demonstrate that ChangeTitans achieves state-of-the-art results, attaining \textbf84.36% IoU and \textbf91.52% F1-score on LEVIR-CD, while remaining computationally competitive.

[CV-68] HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)中存在的由语言偏见引发的幻觉问题,特别是那些在缺乏视觉证据时仍倾向于生成与场景高度典型但实际不存在的对象描述的现象。其解决方案的关键在于设计了一种新颖的合成管道,用于精确生成“诱发幻觉图像”(Hallucination-Inducing Images, HIIs),并基于这些图像揭示出一种一致的、受场景条件约束的幻觉模式;进一步利用HIIs构建高质量偏好数据集以实现细粒度对齐,从而有效降低幻觉发生率,同时保持模型的通用能力,在标准幻觉基准测试上相较当前最优方法提升达38%。

链接: https://arxiv.org/abs/2602.10425
作者: Yilin Yang,Zhenghui Guo,Yuke Wang,Omprakash Gnawali,Sheng Di,Chengming Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) have achieved remarkable success across diverse multimodal tasks but remain vulnerable to hallucinations rooted in inherent language bias. Despite recent progress, existing hallucination mitigation methods often overlook the underlying hallucination patterns driven by language bias. In this work, we design a novel pipeline to accurately synthesize Hallucination-Inducing Images (HIIs). Using synthesized HIIs, we reveal a consistent scene-conditioned hallucination pattern: models tend to mention objects that are highly typical of the scene even when visual evidence is removed. To quantify the susceptibility of VLMs to this hallucination pattern, we establish the Masked-Object-Hallucination (MOH) benchmark to rigorously evaluate existing state-of-the-art alignment frameworks. Finally, we leverage HIIs to construct high-quality preference datasets for fine-grained alignment. Experimental results demonstrate that our approach effectively mitigates hallucinations while preserving general model capabilities. Specifically, our method achieves up to a 38% improvement over the current state-of-the-art on standard hallucination benchmarks.

[CV-69] Comp2Comp: Open-Source Software with FDA-Cleared Artificial Intelligence Algorithms for Computed Tomography Image Analysis ALT

【速读】:该论文旨在解决当前医学影像分析中开放源代码工具缺乏严格验证、而商业解决方案又存在透明度不足的问题,从而导致算法在临床部署时可能出现意外失败。其关键解决方案是开发并验证了两个首批完全开源且获得美国食品药品监督管理局(FDA)510(k)认证的深度学习流程——腹主动脉定量(Abdominal Aortic Quantification, AAQ)和骨密度(Bone Mineral Density, BMD)估计,均集成于Comp2Comp软件包中,用于对已有的CT扫描进行机会性分析。AAQ可自动分割腹主动脉以评估动脉瘤大小,BMD则通过分割椎体来估算骨小梁密度及骨质疏松风险,并在多个外部机构的数据集上分别实现了高精度的测量误差(平均绝对误差1.57 mm)与分类性能(敏感性81.0%,特异性78.4%),证明其具备临床可用性。该方案显著提升了算法透明度,使医疗机构可在正式临床试验前测试模型性能,同时为研究者提供业界领先的方法学基础。

链接: https://arxiv.org/abs/2602.10364
作者: Adrit Rao,Malte Jensen,Andrea T. Fisher,Louis Blankemeier,Pauline Berens,Arash Fereydooni,Seth Lirette,Eren Alkan,Felipe C. Kitamura,Juan M. Zambrano Chaves,Eduardo Reis,Arjun Desai,Marc H. Willis,Jason Hom,Andrew Johnston,Leon Lenchik,Robert D. Boutin,Eduardo M. J. M. Farina,Augusto S. Serpa,Marcelo S. Takahashi,Jordan Perchik,Steven A. Rothenberg,Jamie L. Schroeder,Ross Filice,Leonardo K. Bittencourt,Hari Trivedi,Marly van Assen,John Mongan,Kimberly Kallianos,Oliver Aalami,Akshay S. Chaudhari
机构: Stanford Mussallem Center for Biodesign, Stanford, CA, USA; Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford, CA, USA; Stanford Department of Radiology, Stanford, CA, USA; Stanford Division of Vascular Surgery, Department of Surgery, Stanford, CA, USA; University of Mississippi Medical Center Department of Data Science, Jackson, MS, USA; Bunkerhill Health, San Francisco, CA, USA; Departamento de Diagnóstico por Imagem, Universidade Federal de São Paulo, São Paulo, Brazil; Microsoft Research, Redmond, WA, USA; Cartesia AI, San Francisco, CA, USA; Stanford University Human-Centered AI, Stanford, CA, USA; Diagnósticos da América S.A., Barueri, São Paulo, Brazil; Departamento de Diagnóstico por Imagem, Universidade de São Paulo, São Paulo, Brazil; Department of Radiology, University of North Carolina, Chapel Hill, NC, USA; Department of Radiology, University of Alabama, Birmingham, AL, USA; Department of Radiology, MedStar Georgetown University, Washington, DC, USA; Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, OH, USA; Department of Radiology, Emory University, Atlanta, GA, USA; Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, CA, USA; Weill Cancer Hub West, Stanford, CA and San Francisco, CA, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Adrit Rao, Malte Jensen, Andrea T. Fisher, Louis Blankemeier: Co-first authors. Oliver Aalami, Akshay S. Chaudhari: Co-senior authors

点击查看摘要

Abstract:Artificial intelligence allows automatic extraction of imaging biomarkers from already-acquired radiologic images. This paradigm of opportunistic imaging adds value to medical imaging without additional imaging costs or patient radiation exposure. However, many open-source image analysis solutions lack rigorous validation while commercial solutions lack transparency, leading to unexpected failures when deployed. Here, we report development and validation for two of the first fully open-sourced, FDA-510(k)-cleared deep learning pipelines to mitigate both challenges: Abdominal Aortic Quantification (AAQ) and Bone Mineral Density (BMD) estimation are both offered within the Comp2Comp package for opportunistic analysis of computed tomography scans. AAQ segments the abdominal aorta to assess aneurysm size; BMD segments vertebral bodies to estimate trabecular bone density and osteoporosis risk. AAQ-derived maximal aortic diameters were compared against radiologist ground-truth measurements on 258 patient scans enriched for abdominal aortic aneurysms from four external institutions. BMD binary classifications (low vs. normal bone density) were compared against concurrent DXA scan ground truths obtained on 371 patient scans from four external institutions. AAQ had an overall mean absolute error of 1.57 mm (95% CI 1.38-1.80 mm). BMD had a sensitivity of 81.0% (95% CI 74.0-86.8%) and specificity of 78.4% (95% CI 72.3-83.7%). Comp2Comp AAQ and BMD demonstrated sufficient accuracy for clinical use. Open-sourcing these algorithms improves transparency of typically opaque FDA clearance processes, allows hospitals to test the algorithms before cumbersome clinical pilots, and provides researchers with best-in-class methods.

[CV-70] Monte Carlo Maximum Likelihood Reconstruction for Digital Holography with Speckle

【速读】:该论文旨在解决相干成像(coherent imaging)中由散斑噪声(speckle)引起的图像重建难题,特别是针对数字全息(digital holography)系统中有限孔径建模下的最大似然估计(MLE)计算成本过高问题。传统MLE方法因需进行高维矩阵求逆而难以在高分辨率下应用,且常依赖简化假设以保证可计算性。解决方案的关键在于提出一种基于随机线性代数的优化框架——投影梯度下降与蒙特卡洛估计(PGD-MC),其通过利用传感矩阵的结构特性并结合共轭梯度法计算似然梯度,避免了显式的矩阵求逆操作,从而实现了无需牺牲物理准确性的高效MLE重建,显著提升了重建质量与计算效率,并支持高分辨率数字全息场景下的可扩展应用。

链接: https://arxiv.org/abs/2602.10344
作者: Xi Chen,Arian Maleki,Shirin Jalali
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In coherent imaging, speckle is statistically modeled as multiplicative noise, posing a fundamental challenge for image reconstruction. While maximum likelihood estimation (MLE) provides a principled framework for speckle mitigation, its application to coherent imaging system such as digital holography with finite apertures is hindered by the prohibitive cost of high-dimensional matrix inversion, especially at high resolutions. This computational burden has prevented the use of MLE-based reconstruction with physically accurate aperture modeling. In this work, we propose a randomized linear algebra approach that enables scalable MLE optimization without explicit matrix inversions in gradient computation. By exploiting the structural properties of sensing matrix and using conjugate gradient for likelihood gradient evaluation, the proposed algorithm supports accurate aperture modeling without the simplifying assumptions commonly imposed for tractability. We term the resulting method projected gradient descent with Monte Carlo estimation (PGD-MC). The proposed PGD-MC framework (i) demonstrates robustness to diverse and physically accurate aperture models, (ii) achieves substantial improvements in reconstruction quality and computational efficiency, and (iii) scales effectively to high-resolution digital holography. Extensive experiments incorporating three representative denoisers as regularization show that PGD-MC provides a flexible and effective MLE-based reconstruction framework for digital holography with finite apertures, consistently outperforming prior Plug-and-Play model-based iterative reconstruction methods in both accuracy and speed. Our code is available at: this https URL.

[CV-71] Conditional Uncertainty-Aware Political Deepfake Detection with Stochastic Convolutional Neural Networks

【速读】:该论文旨在解决政治类深度伪造图像(political deepfakes)检测中现有自动化系统缺乏不确定性感知能力的问题,即多数检测模型仅提供点预测而无法标识其输出在高风险政治场景下的不可靠性,从而影响内容审核的可信度与决策安全性。解决方案的关键在于引入基于随机卷积神经网络(stochastic convolutional neural networks)的条件性、不确定性感知检测框架,并通过校准质量(calibration quality)、适当评分规则(proper scoring rules)及置信度条件下的误差一致性等可观测指标评估不确定性,而非单纯依赖贝叶斯解释。研究进一步比较了多种不确定性估计方法(如蒙特卡洛Dropout、温度缩放和集成学习代理),并构建了一个面向政治领域的二分类图像数据集,最终表明校准后的概率输出与不确定性估计能够支撑风险导向的内容 moderation 策略,在特定置信区间内显著提升检测系统的操作价值。

链接: https://arxiv.org/abs/2602.10343
作者: Rafael-Petruţ Gardoş
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 12 figures, 18 tables

点击查看摘要

Abstract:Recent advances in generative image models have enabled the creation of highly realistic political deepfakes, posing risks to information integrity, public trust, and democratic processes. While automated deepfake detectors are increasingly deployed in moderation and investigative pipelines, most existing systems provide only point predictions and fail to indicate when outputs are unreliable, being an operationally critical limitation in high-stakes political contexts. This work investigates conditional, uncertainty-aware political deepfake detection using stochastic convolutional neural networks within an empirical, decision-oriented reliability framework. Rather than treating uncertainty as a purely Bayesian construct, it is evaluated through observable criteria, including calibration quality, proper scoring rules, and its alignment with prediction errors under both global and confidence-conditioned analyses. A politically focused binary image dataset is constructed via deterministic metadata filtering from a large public real-synthetic corpus. Two pretrained CNN backbones (ResNet-18 and EfficientNet-B4) are fully fine-tuned for classification. Deterministic inference is compared with single-pass stochastic prediction, Monte Carlo dropout with multiple forward passes, temperature scaling, and ensemble-based uncertainty surrogates. Evaluation reports ROC-AUC, thresholded confusion matrices, calibration metrics, and generator-disjoint out-of-distribution performance. Results demonstrate that calibrated probabilistic outputs and uncertainty estimates enable risk-aware moderation policies. A systematic confidence-band analysis further clarifies when uncertainty provides operational value beyond predicted confidence, delineating both the benefits and limitations of uncertainty-aware deepfake detection in political settings.

[CV-72] Flow Matching with Uncertainty Quantification and Guidance

【速读】:该论文旨在解决采样型生成模型(如流匹配,Flow Matching)在生成样本时可能出现的质量不一致或退化问题。其解决方案的关键在于提出了一种轻量级扩展方法——不确定性感知流匹配(Uncertainty-aware Flow Matching, UA-Flow),该方法在预测速度场(velocity field)的同时,显式建模异方差不确定性(heteroscedastic uncertainty),并通过流动动力学传播速度不确定性来估计每个样本的不确定性。这种不确定性估计作为个体样本的可靠性信号,并进一步用于引导生成过程,例如通过不确定性感知的分类器指导(uncertainty-aware classifier guidance)和无分类器指导(classifier-free guidance),从而提升生成质量。实验表明,UA-Flow生成的不确定性信号与样本保真度的相关性显著优于基线方法,且不确定性引导采样可进一步改善生成效果。

链接: https://arxiv.org/abs/2602.10326
作者: Juyeop Han,Lukas Lao Beyer,Sertac Karaman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the remarkable success of sampling-based generative models such as flow matching, they can still produce samples of inconsistent or degraded quality. To assess sample reliability and generate higher-quality outputs, we propose uncertainty-aware flow matching (UA-Flow), a lightweight extension of flow matching that predicts the velocity field together with heteroscedastic uncertainty. UA-Flow estimates per-sample uncertainty by propagating velocity uncertainty through the flow dynamics. These uncertainty estimates act as a reliability signal for individual samples, and we further use them to steer generation via uncertainty-aware classifier guidance and classifier-free guidance. Experiments on image generation show that UA-Flow produces uncertainty signals more highly correlated with sample fidelity than baseline methods, and that uncertainty-guided sampling further improves generation quality.

[CV-73] A Low-Rank Defense Method for Adversarial Attack on Diffusion Models ICME2025

【速读】:该论文旨在解决扩散模型(Diffusion Models)在微调过程中易受对抗攻击(Adversarial Attacks)影响的问题,以防止这些攻击算法对实际应用造成干扰。解决方案的关键在于提出一种名为低秩防御(Low-Rank Defense, LoRD)的高效防御策略,其核心是引入合并思想(merging idea)与平衡参数,并结合低秩适配(Low-Rank Adaptation, LoRA)模块,实现对对抗样本的检测与防御。LoRD构建了一个防御流水线,通过在微调阶段同时使用对抗样本和干净样本训练,使潜在扩散模型(Latent Diffusion Models, LDMs)仍能生成高质量图像,且显著优于基线方法。

链接: https://arxiv.org/abs/2602.10319
作者: Jiaxuan Zhu,Siyu Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME2025

点击查看摘要

Abstract:Recently, adversarial attacks for diffusion models as well as their fine-tuning process have been developed rapidly. To prevent the abuse of these attack algorithms from affecting the practical application of diffusion models, it is critical to develop corresponding defensive strategies. In this work, we propose an efficient defensive strategy, named Low-Rank Defense (LoRD), to defend the adversarial attack on Latent Diffusion Models (LDMs). LoRD introduces the merging idea and a balance parameter, combined with the low-rank adaptation (LoRA) modules, to detect and defend the adversarial samples. Based on LoRD, we build up a defense pipeline that applies the learned LoRD modules to help diffusion models defend against attack algorithms. Our method ensures that the LDM fine-tuned on both adversarial and clean samples can still generate high-quality images. To demonstrate the effectiveness of our approach, we conduct extensive experiments on facial and landscape images, and our method shows significantly better defense performance compared to the baseline methods.

[CV-74] ERGO: Excess-Risk-Guided Optimization for High-Fidelity Monocular 3D Gaussian Splatting

【速读】:该论文旨在解决从单张图像生成3D内容时面临的几何与纹理信息缺失问题,尤其针对由合成视图提供的监督信号中存在的几何不一致性和纹理错位导致的重建误差放大问题。解决方案的关键在于提出一种基于过量风险分解(excess risk decomposition)的自适应优化框架ERGO,其将3D Gaussian splatting中的优化损失分解为可优化的过量风险(quantifies the suboptimality gap between current and optimal parameters)与贝叶斯误差(Bayes error,建模合成视图中固有不可约噪声),从而动态估计每视角的过量风险并自适应调整损失权重;同时引入几何感知和纹理感知的目标函数,构建全局-局部协同优化机制,显著提升了对监督噪声的鲁棒性,并同步改善了重建结果的几何保真度与纹理质量。

链接: https://arxiv.org/abs/2602.10278
作者: Zehua Ma,Hanhui Li,Zhenyu Xie,Xiaonan Luo,Michael Kampffmeyer,Feng Gao,Xiaodan Liang
机构: Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Peking University (北京大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Guilin University of Electronic Technology (桂林电子科技大学); UiT The Arctic University of Norway (挪威北极大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating 3D content from a single image remains a fundamentally challenging and ill-posed problem due to the inherent absence of geometric and textural information in occluded regions. While state-of-the-art generative models can synthesize auxiliary views to provide additional supervision, these views inevitably contain geometric inconsistencies and textural misalignments that propagate and amplify artifacts during 3D reconstruction. To effectively harness these imperfect supervisory signals, we propose an adaptive optimization framework guided by excess risk decomposition, termed ERGO. Specifically, ERGO decomposes the optimization losses in 3D Gaussian splatting into two components, i.e., excess risk that quantifies the suboptimality gap between current and optimal parameters, and Bayes error that models the irreducible noise inherent in synthesized views. This decomposition enables ERGO to dynamically estimate the view-specific excess risk and adaptively adjust loss weights during optimization. Furthermore, we introduce geometry-aware and texture-aware objectives that complement the excess-risk-derived weighting mechanism, establishing a synergistic global-local optimization paradigm. Consequently, ERGO demonstrates robustness against supervision noise while consistently enhancing both geometric fidelity and textural quality of the reconstructed 3D content. Extensive experiments on the Google Scanned Objects dataset and the OmniObject3D dataset demonstrate the superiority of ERGO over existing state-of-the-art methods.

[CV-75] Colorimeter-Supervised Skin Tone Estimation from Dermatoscopic Images for Fairness Auditing

【速读】:该论文旨在解决当前基于神经网络的皮肤镜图像诊断模型在不同肤色群体中存在性能差异的问题,其核心挑战在于公共皮肤镜数据集中缺乏可靠的肤色标注。解决方案的关键在于开发了两个神经网络模型:一是通过序数回归(ordinal regression)预测Fitzpatrick皮肤类型,二是通过颜色回归(color regression)估计个体体型角度(Individual Typology Angle, ITA),并以现场采集的Fitzpatrick标签和色度计测量值作为监督信号进行训练。此外,模型还利用大量合成与真实皮肤镜及临床图像进行预训练,显著提升了肤色估计的准确性,且首次在皮肤镜图像上验证了ITA预测与色度计测量的一致性,为肤色相关的公平性审计提供了可量化、可复现的工具。

链接: https://arxiv.org/abs/2602.10265
作者: Marin Benčević,Krešimir Romić,Ivana Hartmann Tolić,Irena Galić
机构: University of Rijeka (里耶卡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint submitted to Computer Methods and Programs in Biomedicine

点击查看摘要

Abstract:Neural-network-based diagnosis from dermatoscopic images is increasingly used for clinical decision support, yet studies report performance disparities across skin tones. Fairness auditing of these models is limited by the lack of reliable skin-tone annotations in public dermatoscopy datasets. We address this gap with neural networks that predict Fitzpatrick skin type via ordinal regression and the Individual Typology Angle (ITA) via color regression, using in-person Fitzpatrick labels and colorimeter measurements as targets. We further leverage extensive pretraining on synthetic and real dermatoscopic and clinical images. The Fitzpatrick model achieves agreement comparable to human crowdsourced annotations, and ITA predictions show high concordance with colorimeter-derived ITA, substantially outperforming pixel-averaging approaches. Applying these estimators to ISIC 2020 and MILK10k, we find that fewer than 1% of subjects belong to Fitzpatrick types V and VI. We release code and pretrained models as an open-source tool for rapid skin-tone annotation and bias auditing. This is, to our knowledge, the first dermatoscopic skin-tone estimation neural network validated against colorimeter measurements, and it supports growing evidence of clinically relevant performance gaps across skin-tone groups.

[CV-76] PMMA: The Polytechnique Montreal Mobility Aids Dataset

【速读】:该论文旨在解决现有行人检测数据集缺乏对使用助行设备(如轮椅、拐杖和助行器)的行人类别覆盖的问题,从而提升自动驾驶系统或智能监控场景中对多样化行人行为的感知能力。解决方案的关键在于构建了一个名为PMMA的新颖行人检测数据集,该数据集在户外环境中采集,包含九类使用不同助行设备的行人(如推空轮椅者、推载人轮椅者等),并在此基础上评估了七种主流目标检测模型(如YOLOX、Deformable DETR和Faster R-CNN)及三种跟踪算法(ByteTrack、BOT-SORT、OC-SORT),结果表明YOLOX、Deformable DETR和Faster R-CNN在检测性能上表现最优,为后续研究提供了基准与工具支持。

链接: https://arxiv.org/abs/2602.10259
作者: Qingwu Liu,Nicolas Saunier,Guillaume-Alexandre Bilodeau
机构: Polytechnique Montréal (蒙特利尔工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the journal IEEE Transactions on Intelligent Transportation Systems, under review

点击查看摘要

Abstract:This study introduces a new object detection dataset of pedestrians using mobility aids, named PMMA. The dataset was collected in an outdoor environment, where volunteers used wheelchairs, canes, and walkers, resulting in nine categories of pedestrians: pedestrians, cane users, two types of walker users, whether walking or resting, five types of wheelchair users, including wheelchair users, people pushing empty wheelchairs, and three types of users pushing occupied wheelchairs, including the entire pushing group, the pusher and the person seated on the wheelchair. To establish a benchmark, seven object detection models (Faster R-CNN, CenterNet, YOLOX, DETR, Deformable DETR, DINO, and RT-DETR) and three tracking algorithms (ByteTrack, BOT-SORT, and OC-SORT) were implemented under the MMDetection framework. Experimental results show that YOLOX, Deformable DETR, and Faster R-CNN achieve the best detection performance, while the differences among the three trackers are relatively small. The PMMA dataset is publicly available at this https URL, and the video processing and model training code is available at this https URL.

[CV-77] XSPLAIN: XAI-enabling Splat-based Prototype Learning for Attribute-aware INterpretability

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在分类任务中缺乏可解释性的问题,尤其是现有解释方法难以捕捉高斯原语(Gaussian primitives)的体积一致性,导致解释结果模糊且不可靠。其解决方案的关键在于提出 XSPLAIN——首个面向3DGS分类任务的前向(ante-hoc)原型驱动型可解释性框架。该框架采用体素聚合的 PointNet 主干网络,并引入一种可逆正交变换,实现特征通道解耦以增强可解释性,同时严格保留原始决策边界;解释基于代表性训练样本,支持直观的“this looks like that”推理方式,在不损失分类性能的前提下显著提升用户信任度。

链接: https://arxiv.org/abs/2602.10239
作者: Dominik Galus,Julia Farganus,Tymoteusz Zapala,Mikołaj Czachorowski,Piotr Borycki,Przemysław Spurek,Piotr Syga
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has rapidly become a standard for high-fidelity 3D reconstruction, yet its adoption in multiple critical domains is hindered by the lack of interpretability of the generation models as well as classification of the Splats. While explainability methods exist for other 3D representations, like point clouds, they typically rely on ambiguous saliency maps that fail to capture the volumetric coherence of Gaussian primitives. We introduce XSPLAIN, the first ante-hoc, prototype-based interpretability framework designed specifically for 3DGS classification. Our approach leverages a voxel-aggregated PointNet backbone and a novel, invertible orthogonal transformation that disentangles feature channels for interpretability while strictly preserving the original decision boundaries. Explanations are grounded in representative training examples, enabling intuitive ``this looks like that’’ reasoning without any degradation in classification performance. A rigorous user study (N=51) demonstrates a decisive preference for our approach: participants selected XSPLAIN explanations 48.4% of the time as the best, significantly outperforming baselines (p0.001) , showing that XSPLAIN provides transparency and user trust. The source code for this work is available at: this https URL

[CV-78] DEGMC: Denoising Diffusion Models Based on Riemannian Equivariant Group Morphological Convolutions

【速读】:该论文旨在解决当前去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPM)中的两个关键问题:一是几何关键特征提取能力不足,二是网络结构缺乏群等变性(equivariance)。针对U-Net架构仅具备平移等变性的局限,作者提出一种结合欧几里得群(Euclidean group)对称性的几何方法,该群包含旋转、反射和排列变换。解决方案的核心在于引入黎曼流形上的群形态卷积(group morphological convolutions),其基于一阶哈密顿-雅可比型偏微分方程(Hamilton-Jacobi-type PDEs)的粘性解,实现多尺度形态学膨胀与腐蚀操作;并通过在模型中加入对流项并采用特征线法求解,显著提升了非线性建模能力、细长几何结构的表达精度,并将对称性有效嵌入学习过程中,实验表明在MNIST、RotoMNIST和CIFAR-10数据集上相较基线DDPM有明显性能提升。

链接: https://arxiv.org/abs/2602.10221
作者: El Hadji S. Diop,Thierno Fall,Mohamed Daoudi
机构: University Iba Der Thiam (大学伊巴德雷西亚姆); Institut Mines-Telecom Nord Europe (欧洲电信矿业学院北欧中心); CNRS Centrale Lille (法国国家科学研究中心中央里尔); University of Lille (里尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we address two major issues in recent Denoising Diffusion Probabilistic Models (DDPM): \bf 1) geometric key feature extraction and \bf 2) network equivariance. Since the DDPM prediction network relies on the U-net architecture, which is theoretically only translation equivariant, we introduce a geometric approach combined with an equivariance property of the more general Euclidean group, which includes rotations, reflections, and permutations. We introduce the notion of group morphological convolutions in Riemannian manifolds, which are derived from the viscosity solutions of first-order Hamilton-Jacobi-type partial differential equations (PDEs) that act as morphological multiscale dilations and erosions. We add a convection term to the model and solve it using the method of characteristics. This helps us better capture nonlinearities, represent thin geometric structures, and incorporate symmetries into the learning process. Experimental results on the MNIST, RotoMNIST, and CIFAR-10 datasets show noticeable improvements compared to the baseline DDPM model.

[CV-79] When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

【速读】:该论文旨在解决视觉引导型图像编辑模型(vision-prompt editing models)中存在的新型安全漏洞问题,即攻击者可通过纯视觉输入(如标记、箭头或视觉提示)发起“视觉中心越狱攻击”(Vision-Centric Jailbreak Attack, VJA),从而诱导模型生成恶意内容。此类攻击将攻击面从文本扩展至视觉域,且目前尚无有效防御机制。解决方案的关键在于提出一种无需训练的防御方法——基于内省式多模态推理(introspective multimodal reasoning),该方法在不依赖额外保护模型的前提下,显著提升模型安全性,且计算开销可忽略不计,使性能较差的模型安全水平达到与商用系统相当的程度。

链接: https://arxiv.org/abs/2602.10179
作者: Jiacheng Hou,Yining Sun,Ruochong Jin,Haochen Han,Fangming Liu,Wai Kin Victor Chan,Alex Jinpeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project homepage: this https URL

点击查看摘要

Abstract:Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.

[CV-80] ArtisanGS: Interactive Tools for Gaussian Splat Selection with AI and Human in the Loop

【速读】:该论文旨在解决从真实场景捕获的3D高斯斑点(3D Gaussian Splats, 3DGS)表示中难以提取可用物体以及缺乏可控编辑手段的问题。现有方法多聚焦于自动化或高层级编辑,而忽视了用户交互的灵活性与精确性。其解决方案的关键在于提出一套以灵活的高斯斑点选择与分割为核心的交互式工具集,其中包含一种快速的AI驱动方法,可将用户引导的2D选择掩码传播至3DGS空间,实现高效且可控的三维分割;同时结合手动选择与分割工具,使用户能够在未经优化的任意自然场景中实现任意二值分割,并进一步基于自定义视频扩散模型开发了用户引导的局部编辑流程,从而赋予用户对AI修改区域的直接控制权。

链接: https://arxiv.org/abs/2602.10173
作者: Clement Fuji Tsang,Anita Hu,Or Perel,Carsten Kolve,Maria Shugrina
机构: NVIDIA(英伟达); University of Toronto(多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, includes supplementary material

点击查看摘要

Abstract:Representation in the family of 3D Gaussian Splats (3DGS) are growing into a viable alternative to traditional graphics for an expanding number of application, including recent techniques that facilitate physics simulation and animation. However, extracting usable objects from in-the-wild captures remains challenging and controllable editing techniques for this representation are limited. Unlike the bulk of emerging techniques, focused on automatic solutions or high-level editing, we introduce an interactive suite of tools centered around versatile Gaussian Splat selection and segmentation. We propose a fast AI-driven method to propagate user-guided 2D selection masks to 3DGS selections. This technique allows for user intervention in the case of errors and is further coupled with flexible manual selection and segmentation tools. These allow a user to achieve virtually any binary segmentation of an unstructured 3DGS scene. We evaluate our toolset against the state-of-the-art for Gaussian Splat selection and demonstrate their utility for downstream applications by developing a user-guided local editing approach, leveraging a custom Video Diffusion Model. With flexible selection tools, users have direct control over the areas that the AI can modify. Our selection and editing tools can be used for any in-the-wild capture without additional optimization.

[CV-81] AD2: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems WACV2026

【速读】:该论文旨在解决端到端自动驾驶系统在面对黑盒对抗威胁时的鲁棒性不足问题,尤其关注视觉感知链路中的多种攻击向量对驾驶决策的影响。其关键解决方案是提出一种基于注意力机制的轻量级攻击检测模型AD²,该模型通过捕捉时空一致性来识别物理层(如声波引起的模糊)、电磁干扰和数字扰动等多类对抗攻击,从而提升自动驾驶系统的安全性与可靠性。

链接: https://arxiv.org/abs/2602.10160
作者: Ishan Sahu,Somnath Hazra,Somak Aditya,Soumyajit Dey
机构: Indian Institute of Technology Kharagpur (印度理工学院克哈格普尔分校); TCS Research, India (塔塔咨询服务公司研究部)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to WACV 2026

点击查看摘要

Abstract:End-to-end autonomous driving systems have achieved significant progress, yet their adversarial robustness remains largely underexplored. In this work, we conduct a closed-loop evaluation of state-of-the-art autonomous driving agents under black-box adversarial threat models in CARLA. Specifically, we consider three representative attack vectors on the visual perception pipeline: (i) a physics-based blur attack induced by acoustic waves, (ii) an electromagnetic interference attack that distorts captured images, and (iii) a digital attack that adds ghost objects as carefully crafted bounded perturbations on images. Our experiments on two advanced agents, Transfuser and Interfuser, reveal severe vulnerabilities to such attacks, with driving scores dropping by up to 99% in the worst case, raising valid safety concerns. To help mitigate such threats, we further propose a lightweight Attack Detection model for Autonomous Driving systems (AD ^2 ) based on attention mechanisms that capture spatial-temporal consistency. Comprehensive experiments across multi-camera inputs on CARLA show that our detector achieves superior detection capability and computational efficiency compared to existing approaches.

[CV-82] Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

【速读】:该论文旨在解决传统视频检索基准在开放网络环境下无法有效模拟真实世界中基于模糊、多维记忆进行视频搜索的问题。现有方法通常依赖于精确描述与封闭视频池的匹配,难以应对用户以不完整或非结构化线索(如情感印象、关键时刻、时间背景和听觉记忆)进行检索的实际场景。解决方案的关键在于提出RVMS-Bench这一综合性评估系统,其包含1,440个来自真实开放网络视频的样本,覆盖20类主题和4种时长组别,并采用分层描述框架(Global Impression, Key Moment, Temporal Context, Auditory Memory)来模拟多维度搜索线索;同时引入RACLO代理框架,通过溯因推理(abductive reasoning)模拟人类“回忆-搜索-验证”的认知过程,从而显著提升模型在模糊记忆驱动下的视频检索与片段定位能力。实验表明,当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在该任务上仍存在明显不足,凸显了本研究对推动视频检索在真实无结构场景中鲁棒性发展的价值。

链接: https://arxiv.org/abs/2602.10159
作者: Tao Yu,Yujia Yang,Haopeng Jin,Junhao Gong,Xinlong Chen,Yuxuan Zhou,Shanbin Zhang,Jiabing Yang,Xinming Wang,Hongzhu Yi,Ping Nie,Kai Zou,Zhang Zhang,Yan Huang,Liang Wang,Yeshani,Ruiwen Tao,Jin Ma,Haijin Liang,Jinwen Luo
机构: CASIA(中国科学院自动化研究所); UCAS(中国科学院大学); Tencent(腾讯); Peking University(北京大学); Tsinghua University(清华大学); Netmind.AI(Netmind.AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 49 pages, 9 figures

点击查看摘要

Abstract:Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbfRVMS-Bench, a comprehensive system for evaluating real-world video memory search. It consists of \textbf1,440 samples spanning \textbf20 diverse categories and \textbffour duration groups, sourced from \textbfreal-world open-web videos. RVMS-Bench utilizes a hierarchical description framework encompassing \textbfGlobal Impression, Key Moment, Temporal Context, and Auditory Memory to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbfRACLO, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify’’ cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.

[CV-83] MPA: Multimodal Prototype Augmentation for Few-Shot Learning AAAI2026

【速读】:该论文旨在解决当前少样本学习(Few-shot Learning, FSL)方法主要依赖单一视觉模态、仅从原始支持图像中直接计算原型而导致语义信息不足的问题。为提升模型在新类别识别中的泛化能力,作者提出了一种多模态原型增强框架(Multimodal Prototype Augmentation FSL, MPA),其核心创新在于三个模块:基于大语言模型的多变体语义增强(LLM-based Multi-Variant Semantic Enhancement, LMSE),用于生成多样化类别描述以丰富支持集语义;分层多视角增强(Hierarchical Multi-View Augmentation, HMA),通过自然和多视角数据增强提升特征多样性;以及自适应不确定类吸收器(Adaptive Uncertain Class Absorber, AUCA),通过插值与高斯采样引入不确定类来建模不确定性并吸收模糊样本。这一系列设计显著提升了FSL在单域与跨域场景下的性能表现。

链接: https://arxiv.org/abs/2602.10143
作者: Liwen Wu,Wei Wang,Lei Zhao,Zhan Gao,Qika Lin,Shaowen Yao,Zuozhu Liu,Bin Pu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by AAAI 2026

点击查看摘要

Abstract:Recently, few-shot learning (FSL) has become a popular task that aims to recognize new classes from only a few labeled examples and has been widely applied in fields such as natural science, remote sensing, and medical images. However, most existing methods focus only on the visual modality and compute prototypes directly from raw support images, which lack comprehensive and rich multimodal information. To address these limitations, we propose a novel Multimodal Prototype Augmentation FSL framework called MPA, including LLM-based Multi-Variant Semantic Enhancement (LMSE), Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA). LMSE leverages large language models to generate diverse paraphrased category descriptions, enriching the support set with additional semantic cues. HMA exploits both natural and multi-view augmentations to enhance feature diversity (e.g., changes in viewing distance, camera angles, and lighting conditions). AUCA models uncertainty by introducing uncertain classes via interpolation and Gaussian sampling, effectively absorbing uncertain samples. Extensive experiments on four single-domain and six cross-domain FSL benchmarks demonstrate that MPA achieves superior performance compared to existing state-of-the-art methods across most settings. Notably, MPA surpasses the second-best method by 12.29% and 24.56% in the single-domain and cross-domain setting, respectively, in the 5-way 1-shot setting.

[CV-84] Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

【速读】:该论文旨在解决多光谱遥感影像中土地覆盖分割(land cover segmentation)的精度与效率问题,尤其针对不同波段组合(如RGB+NIR与加入NDVI/NDWI指数的6通道输入)下模型性能不稳定及特征融合不足的挑战。解决方案的关键在于提出一种多分支编码器-解码器架构MeCSAFNet,其核心创新包括:1)双ConvNeXt编码器分别处理可见光与非可见光通道,实现光谱信息的独立提取;2)专用融合解码器在多尺度上整合中间特征,结合细粒度空间信息与高层语义光谱表示;3)引入CBAM注意力机制增强特征融合能力,并采用ASAU激活函数提升优化稳定性与效率。该设计使模型在多种光谱配置下均表现出显著优于现有方法的分割性能,且紧凑版本具备低训练开销和推理成本,适用于资源受限场景。

链接: https://arxiv.org/abs/2602.10137
作者: Leo Thomas Ramos,Angel D. Sappa
机构: Universitat Autònoma de Barcelona (巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This is an extended version of the study presented at IEEE SoutheastCon2025. It presents substantial new content and original contributions beyond the previous version, including an expanded and enhanced background, new architectural refinements, additional experiments conducted on a broader range of datasets and experimental scenarios, and a more comprehensive analysis of results

点击查看摘要

Abstract:This work proposes MeCSAFNet, a multi-branch encoder-decoder architecture for land cover segmentation in multispectral imagery. The model separately processes visible and non-visible channels through dual ConvNeXt encoders, followed by individual decoders that reconstruct spatial information. A dedicated fusion decoder integrates intermediate features at multiple scales, combining fine spatial cues with high-level spectral representations. The feature fusion is further enhanced with CBAM attention, and the ASAU activation function contributes to stable and efficient optimization. The model is designed to process different spectral configurations, including a 4-channel (4c) input combining RGB and NIR bands, as well as a 6-channel (6c) input incorporating NDVI and NDWI indices. Experiments on the Five-Billion-Pixels (FBP) and Potsdam datasets demonstrate significant performance gains. On FBP, MeCSAFNet-base (6c) surpasses U-Net (4c) by +19.21%, U-Net (6c) by +14.72%, SegFormer (4c) by +19.62%, and SegFormer (6c) by +14.74% in mIoU. On Potsdam, MeCSAFNet-large (4c) improves over DeepLabV3+ (4c) by +6.48%, DeepLabV3+ (6c) by +5.85%, SegFormer (4c) by +9.11%, and SegFormer (6c) by +4.80% in mIoU. The model also achieves consistent gains over several recent state-of-the-art approaches. Moreover, compact variants of MeCSAFNet deliver notable performance with lower training time and reduced inference cost, supporting their deployment in resource-constrained environments.

[CV-85] SceneSmith: Agent ic Generation of Simulation-Ready Indoor Scenes

【速读】:该论文旨在解决现有仿真环境在训练和评估家用机器人时无法充分模拟真实室内空间多样性与物理复杂性的问题,特别是当前场景合成方法生成的房间布局稀疏、缺乏密集杂物、可动家具及物理属性,难以支持机器人操作任务的真实测试。其解决方案的关键在于提出一种分层代理框架 SceneSmith,通过视觉语言模型(VLM)驱动的设计者(designer)、批评者(critic)和协调者(orchestrator)三类智能体协同工作,依次完成从建筑布局到家具布置再到小物件填充的多阶段场景构建;同时深度融合文本到3D生成、数据集检索与物理属性估计技术,从而生成包含3–6倍更多物体、碰撞率仅2%且96%物体在物理模拟中保持稳定的高保真室内环境,显著提升机器人策略评估的真实性与自动化程度。

链接: https://arxiv.org/abs/2602.09153
作者: Nicholas Pfaff,Thomas Cohn,Sergey Zakharov,Rick Cory,Russ Tedrake
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages \unicodex2013 from architectural layout to furniture placement to small object population \unicodex2013 each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with 2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

[CV-86] Enhancing IMU-Based Online Handwriting Recognition via Contrastive Learning with Zero Inference Overhead

【速读】:该论文旨在解决在边缘硬件上实现高效且高精度的在线手写识别(Online Handwriting Recognition, OHR)所面临的挑战,尤其是受限于内存资源时如何提升特征表示能力和识别准确率而不增加推理开销。解决方案的关键在于提出一种名为Error-enhanced Contrastive Handwriting Recognition (ECHWR) 的训练框架,其核心创新是引入一个临时辅助分支,在训练阶段通过双对比损失机制对传感器信号与语义文本嵌入进行对齐:一是批次内对比损失以实现跨模态对齐,二是新颖的基于错误的对比损失,用于区分正确信号与合成难负样本(synthetic hard negatives)。该辅助分支在训练完成后被移除,从而保证部署模型保持原有轻量高效的架构,同时显著降低字符错误率(Character Error Rate, CER),尤其在未见过的书写风格场景下表现出更强的鲁棒性。

链接: https://arxiv.org/abs/2602.07049
作者: Jindong Li,Dario Zanca,Vincent Christlein,Tim Hamann,Jens Barth,Peter Kämpf,Björn Eskofier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Online handwriting recognition using inertial measurement units opens up handwriting on paper as input for digital devices. Doing it on edge hardware improves privacy and lowers latency, but entails memory constraints. To address this, we propose Error-enhanced Contrastive Handwriting Recognition (ECHWR), a training framework designed to improve feature representation and recognition accuracy without increasing inference costs. ECHWR utilizes a temporary auxiliary branch that aligns sensor signals with semantic text embeddings during the training phase. This alignment is maintained through a dual contrastive objective: an in-batch contrastive loss for general modality alignment and a novel error-based contrastive loss that distinguishes between correct signals and synthetic hard negatives. The auxiliary branch is discarded after training, which allows the deployed model to keep its original, efficient architecture. Evaluations on the OnHW-Words500 dataset show that ECHWR significantly outperforms state-of-the-art baselines, reducing character error rates by up to 7.4% on the writer-independent split and 10.4% on the writer-dependent split. Finally, although our ablation studies indicate that solving specific challenges require specific architectural and objective configurations, error-based contrastive loss shows its effectiveness for handling unseen writing styles.

[CV-87] Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT

【速读】:该论文旨在解决基础模型(foundation models)在临床实践中因复合分布偏移(compound distribution shift)导致性能下降的问题,尤其关注罕见但高致死率的创伤性肠损伤(traumatic bowel injury)场景下,模型特异性(specificity)不足的成因。其关键解决方案在于通过分层评估不同负类群体的特异性表现,发现基础模型的特异性缺陷主要源于负类内部异质性(如合并实质性器官损伤患者),而非仅由类别不平衡引起;同时表明,随着监督训练的引入(从零样本到线性探针再到任务特定模型),对负类异质性的敏感性逐步降低,提示临床部署前需进行针对性适应训练以提升鲁棒性。

链接: https://arxiv.org/abs/2602.10359
作者: Jineel H Raythatha,Shuchang Ye,Jeremy Hsu,Jinman Kim
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Purpose: Translating foundation models into clinical practice requires evaluating their performance under compound distribution shift, where severe class imbalance coexists with heterogeneous imaging appearances. This challenge is relevant for traumatic bowel injury, a rare but high-mortality diagnosis. We investigated whether specificity deficits in foundation models are associated with heterogeneity in the negative class. Methods: This retrospective study used the multi-institutional, RSNA Abdominal Traumatic Injury CT dataset (2019-2023), comprising scans from 23 centres. Two foundation models (MedCLIP, zero-shot; RadDINO, linear probe) were compared against three task-specific approaches (CNN, Transformer, Ensemble). Models were trained on 3,147 patients (2.3% bowel injury prevalence) and evaluated on an enriched 100-patient test set. To isolate negative-class effects, specificity was assessed in patients without bowel injury who had concurrent solid organ injury (n=58) versus no abdominal pathology (n=50). Results: Foundation models achieved equivalent discrimination to task-specific models (AUC, 0.64-0.68 versus 0.58-0.64) with higher sensitivity (79-91% vs 41-74%) but lower specificity (33-50% vs 50-88%). All models demonstrated high specificity in patients without abdominal pathology (84-100%). When solid organ injuries were present, specificity declined substantially for foundation models (50-51 percentage points) compared with smaller reductions of 12-41 percentage points for task-specific models. Conclusion: Foundation models matched task-specific discrimination without task-specific training, but their specificity deficits were driven primarily by confounding negative-class heterogeneity rather than prevalence alone. Susceptibility to negative-class heterogeneity decreased progressively with labelled training, suggesting adaptation is required before clinical implementation.

[CV-88] Uncertainty-Aware Ordinal Deep Learning for cross-Dataset Diabetic Retinopathy Grading

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)严重程度自动分级中的两个核心问题:一是如何准确建模疾病进展的有序性(ordinal nature),二是如何在模型预测中提供可解释且可靠的不确定性估计,以提升临床应用中的可信度与鲁棒性。解决方案的关键在于提出一种基于证据狄利克雷(evidential Dirichlet)的序数回归框架,结合卷积主干网络与病灶查询注意力池化机制,通过引入序数证据损失函数并采用退火正则化策略,在多域数据集(APTOS、Messidor-2 和 EyePACS 子集)上实现高精度分类和良好的跨数据集泛化能力,同时为低置信度样本提供有意义的不确定性量化,从而推动DR自动分级系统向临床可靠方向发展。

链接: https://arxiv.org/abs/2602.10315
作者: Ali El Bellaj,Aya Benradi,Salman El Youssoufi,Taha El Marzouki,Mohammed-Amine Cheddadi
机构: Mississippi State University (密西西比州立大学); International University of Rabat (拉巴特国际大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diabetes mellitus is a chronic metabolic disorder characterized by persistent hyperglycemia due to insufficient insulin production or impaired insulin utilization. One of its most severe complications is diabetic retinopathy (DR), a progressive retinal disease caused by microvascular damage, leading to hemorrhages, exudates, and potential vision loss. Early and reliable detection of DR is therefore critical for preventing irreversible blindness. In this work, we propose an uncertainty-aware deep learning framework for automated DR severity grading that explicitly models the ordinal nature of disease progression. Our approach combines a convolutional backbone with lesion-query attention pooling and an evidential Dirichlet-based ordinal regression head, enabling both accurate severity prediction and principled estimation of predictive uncertainty. The model is trained using an ordinal evidential loss with annealed regularization to encourage calibrated confidence under domain shift. We evaluate the proposed method on a multi-domain training setup combining APTOS, Messidor-2, and a subset of EyePACS fundus datasets. Experimental results demonstrate strong cross-dataset generalization, achieving competitive classification accuracy and high quadratic weighted kappa on held-out test sets, while providing meaningful uncertainty estimates for low-confidence cases. These results suggest that ordinal evidential learning is a promising direction for robust and clinically reliable diabetic retinopathy grading. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.10315 [eess.IV] (or arXiv:2602.10315v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2602.10315 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-89] A Systematic Review on Data-Driven Brain Deformation Modeling for Image-Guided Neurosurgery

【速读】:该论文旨在解决神经外科术中脑组织变形补偿问题,即手术操作和肿瘤切除导致的组织位移会使得术前规划图像与术中解剖结构失配,从而影响图像引导神经外科的准确性。其解决方案的关键在于系统性地综述2020年1月至2025年4月间基于人工智能(AI)的脑变形建模与校正方法,涵盖深度学习图像配准、直接变形场回归、合成驱动的多模态对齐、考虑切除区域缺失对应关系的架构设计以及融合生物力学先验的混合模型等策略,并评估各类方法在数据集使用、评价指标、验证协议及不确定性与泛化能力方面的表现。

链接: https://arxiv.org/abs/2602.10155
作者: Tiago Assis,Colin P. Galvin,Joshua P. Castillo,Nazim Haouchine,Marta Kersten-Oertel,Zeyu Gao,Mireia Crispin-Ortuzar,Stephen J. Price,Thomas Santarius,Yangming Ou,Sarah Frisken,Nuno C. Garcia,Alexandra J. Golby,Reuben Dorent,Ines P. Machado
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 7 figures, 3 tables. Submitted to Medical Image Analysis

点击查看摘要

Abstract:Accurate compensation of brain deformation is a critical challenge for reliable image-guided neurosurgery, as surgical manipulation and tumor resection induce tissue motion that misaligns preoperative planning images with intraoperative anatomy and longitudinal studies. In this systematic review, we synthesize recent AI-driven approaches developed between January 2020 and April 2025 for modeling and correcting brain deformation. A comprehensive literature search was conducted in PubMed, IEEE Xplore, Scopus, and Web of Science, with predefined inclusion and exclusion criteria focused on computational methods applied to brain deformation compensation for neurosurgical imaging, resulting in 41 studies meeting these criteria. We provide a unified analysis of methodological strategies, including deep learning-based image registration, direct deformation field regression, synthesis-driven multimodal alignment, resection-aware architectures addressing missing correspondences, and hybrid models that integrate biomechanical priors. We also examine dataset utilization, reported evaluation metrics, validation protocols, and how uncertainty and generalization have been assessed across studies. While AI-based deformation models demonstrate promising performance and computational efficiency, current approaches exhibit limitations in out-of-distribution robustness, standardized benchmarking, interpretability, and readiness for clinical deployment. Our review highlights these gaps and outlines opportunities for future research aimed at achieving more robust, generalizable, and clinically translatable deformation compensation solutions for neurosurgical guidance. By organizing recent advances and critically evaluating evaluation practices, this work provides a comprehensive foundation for researchers and clinicians engaged in developing and applying AI-based brain deformation methods.

[CV-90] URBAN-SPIN: A street-level bikeability index to inform design implementations in historical city centres

【速读】:该论文旨在解决历史城市中骑行体验评估缺乏系统性框架的问题,尤其在空间受限、难以进行大规模基础设施改造的背景下,如何通过精细化的街道类型(street typology)识别与感知维度整合来提升自行车友好度。其解决方案的关键在于构建一个以感知为导向、基于街道类型的多源数据融合框架,结合计算机视觉提取的细粒度街道景观指标、建成环境变量及受试者主观评价(来自平衡不完全区组设计问卷),形成具有类型敏感性的“自行车可通行指数”(Bikeability Index),从而实现对路段级骑行体验的量化分析与优化建议,且证明了微调式视觉重构即可显著改善感知舒适度,无需结构性大改。

链接: https://arxiv.org/abs/2602.10124
作者: Haining Ding,Chenxi Wang,Michal Gath-Morad
机构: Cambridge Cognitive Architecture, Department of Architecture, University of Cambridge, UK; NeuroCivitas Lab for NeuroArchitecture, Centre for Research in the Arts, Social Sciences and Humanities (CRASSH), University of Cambridge, UK
类目: Physics and Society (physics.soc-ph); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 32 pages, 10 figures

点击查看摘要

Abstract:Cycling is reported by an average of 35% of adults at least once per week across 28 countries, and as vulnerable road users directly exposed to their surroundings, cyclists experience the street at an intensity unmatched by other modes. Yet the street-level features that shape this experience remain under-analysed, particularly in historical urban contexts where spatial constraints rule out large-scale infrastructural change and where typological context is often overlooked. This study develops a perception-led, typology-based, and data-integrated framework that explicitly models street typologies and their sub-classifications to evaluate how visual and spatial configurations shape cycling experience. Drawing on the Cambridge Cycling Experience Video Dataset (CCEVD), a first-person and handlebar-mounted corpus developed in this study, we extract fine-grained streetscape indicators with computer vision and pair them with built-environment variables and subjective ratings from a Balanced Incomplete Block Design (BIBD) survey, thereby constructing a typology-sensitive Bikeability Index that integrates subjective and perceived dimensions with physical metrics for segment-level comparison. Statistical analysis shows that perceived bikeability arises from cumulative, context-specific interactions among features. While greenness and openness consistently enhance comfort and pleasure, enclosure, imageability, and building continuity display threshold or divergent effects contingent on street type and subtype. AI-assisted visual redesigns further demonstrate that subtle, targeted changes can yield meaningful perceptual gains without large-scale structural interventions. The framework offers a transferable model for evaluating and improving cycling conditions in heritage cities through perceptually attuned, typology-aware design strategies.

人工智能

[AI-0] Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows

【速读】:该论文旨在解决层次化目标条件强化学习(Hierarchical Goal-conditioned Reinforcement Learning, H-GCRL)在离线或数据稀缺场景下存在的数据效率低和策略表达能力弱的问题。其核心解决方案是提出基于归一化流(Normalizing Flow)的层次隐式Q学习框架(NF-HIQL),通过在高层和低层策略中引入表达能力强的归一化流策略,实现可 tractable 的对数似然计算、高效采样以及对复杂多模态行为的建模能力。该方法在理论上提供了显式的KL散度边界和PAC风格的数据效率保证,实验证明其在多种长程任务中显著优于现有基线,展现出更强的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2602.11142
作者: Shaswat Garg,Matin Moezzi,Brandon Da Silva
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, IEEE International Conference on Robotics and Automation 2026

点击查看摘要

Abstract:Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, Normalizing flow-based hierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for Real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.

[AI-1] FormalJudge: A Neuro-Symbolic Paradigm for Agent ic Oversight

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)代理在高风险场景中行为安全的保障问题,尤其是针对当前主流监督范式“LLM-as-a-Judge”所面临的根本性困境:如何让一个概率性系统可靠地监督另一个概率性系统,而不继承其潜在失败模式。解决方案的关键在于引入形式化验证(formal verification),通过构建一个神经符号框架(neuro-symbolic framework),采用双向“形式化思维”(Formal-of-Thought)架构:首先由LLMs自顶向下将人类意图分解为可验证的原子约束,再利用Dafny形式规范与Z3 Satisfiability Modulo Theories(SMT)求解器自底向上证明合规性,从而提供数学保证而非概率评分。此方法有效提升了多场景下的行为安全性、跨规模代理的泛化能力及迭代优化过程中的安全增益。

链接: https://arxiv.org/abs/2602.11136
作者: Jiayi Zhou,Yang Sheng,Hantao Lou,Yaodong Yang,Jie Fu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages

点击查看摘要

Abstract:As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.

[AI-2] Direct Learning of Calibration-Aware Uncertainty for Neural PDE Surrogates

【速读】:该论文旨在解决神经微分方程(Neural PDE)代理模型在数据稀缺或部分观测场景下,如何有效获取可校准的不确定性估计问题。现有方法依赖于集成重复、固定随机噪声(如Dropout)或事后校准,难以适应不同应用场景且需手动调参。其解决方案的关键在于提出一种“交叉正则化不确定性”(Cross-regularized uncertainty)框架:在训练过程中通过梯度路由至保留的正则化子集,同时优化主预测器(在训练子集上最小化预测误差)和低维不确定性控制参数(在正则化子集上减少训练-测试分布偏移),从而实现无需按场景调整噪声的自适应不确定性建模。该方法可在输出头、隐藏特征或算子特定组件(如谱模式)中学习连续噪声水平,在APEBench基准上的多尺度实验表明,所学预测分布具有更好的校准性,并且不确定性场域能聚焦于高误差区域。

链接: https://arxiv.org/abs/2602.11090
作者: Carlos Stein Brito
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation (stat.CO)
备注: 13 pages, 11 figures

点击查看摘要

Abstract:Neural PDE surrogates are often deployed in data-limited or partially observed regimes where downstream decisions depend on calibrated uncertainty in addition to low prediction error. Existing approaches obtain uncertainty through ensemble replication, fixed stochastic noise such as dropout, or post hoc calibration. Cross-regularized uncertainty learns uncertainty parameters during training using gradients routed through a held-out regularization split. The predictor is optimized on the training split for fit, while low-dimensional uncertainty controls are optimized on the regularization split to reduce train-test mismatch, yielding regime-adaptive uncertainty without per-regime noise tuning. The framework can learn continuous noise levels at the output head, within hidden features, or within operator-specific components such as spectral modes. We instantiate the approach in Fourier Neural Operators and evaluate on APEBench sweeps over observed fraction and training-set size. Across these sweeps, the learned predictive distributions are better calibrated on held-out splits and the resulting uncertainty fields concentrate in high-error regions in one-step spatial diagnostics.

[AI-3] General Flexible f-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies AAMAS2026

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中因数据集多样性不足或来自多个不同专家水平的行为策略而导致的策略学习困难问题。具体而言,受限于有限探索的数据分布,传统方法在估计Q值或V值时易出现偏差,而单纯约束策略靠近行为策略又可能过于保守,难以有效提升性能。其解决方案的关键在于通过构建一个更通用的线性规划(Linear Programming, LP)形式,揭示f-散度(f-divergence)与贝尔曼残差(Bellman residual)优化约束之间的理论联系,并提出一种灵活的f-散度函数形式,能够根据离线训练数据自适应地调整约束强度,从而在RL目标与行为策略约束之间实现平衡。实验表明,该方法在MuJoCo、Fetch和AdroitHand等环境中显著提升了从挑战性数据集中学习的性能。

链接: https://arxiv.org/abs/2602.11087
作者: Jianxun Wang,Grant C. Forbes,Leonardo Villalobos-Arias,David L. Roberts
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Extended version of the full paper with the appendix accepted at AAMAS 2026

点击查看摘要

Abstract:Offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy to be within the support of the dataset. However, practical offline datasets often contain examples with little diversity or limited exploration of the environment, and from multiple behavior policies with diverse expertise levels. Limited exploration can impair the offline RL algorithm’s ability to estimate \textitQ or \textitV values, while constraining towards diverse behavior policies can be overly conservative. Such datasets call for a balance between the RL objective and behavior policy constraints. We first identify the connection between f -divergence and optimization constraint on the Bellman residual through a more general Linear Programming form for RL and the convex conjugate. Following this, we introduce the general flexible function formulation for the f -divergence to incorporate an adaptive constraint on algorithms’ learning objectives based on the offline training dataset. Results from experiments on the MuJoCo, Fetch, and AdroitHand environments show the correctness of the proposed LP form and the potential of the flexible f -divergence in improving performance for learning from a challenging dataset when applied to a compatible constrained optimization algorithm.

[AI-4] GRASP: group-Shapley feature selection for patients

【速读】:该论文旨在解决医学预测中特征选择(feature selection)的鲁棒性和可解释性不足问题,现有方法如LASSO常存在稳定性差、冗余特征多等缺陷。其解决方案的关键在于提出GRASP框架,该框架结合基于Shapley值(Shapley value)的归因分析与group L₂₁正则化,首先通过SHAP从预训练树模型中提取分组重要性评分,再利用group L₂₁正则化的逻辑回归强制结构稀疏性,从而获得紧凑、非冗余且稳定的特征子集,在保持或提升预测性能的同时显著增强模型的可解释性。

链接: https://arxiv.org/abs/2602.11084
作者: Yuheng Luo,Shuyan Li,Zhong Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Feature selection remains a major challenge in medical prediction, where existing approaches such as LASSO often lack robustness and interpretability. We introduce GRASP, a novel framework that couples Shapley value driven attribution with group L_21 regularization to extract compact and non-redundant feature sets. GRASP first distills group level importance scores from a pretrained tree model via SHAP, then enforces structured sparsity through group L_21 regularized logistic regression, yielding stable and interpretable selections. Extensive comparisons with LASSO, SHAP, and deep learning based methods show that GRASP consistently delivers comparable or superior predictive accuracy, while identifying fewer, less redundant, and more stable features.

[AI-5] In-the-Wild Model Organisms: Mitigating Undesirable Emergent Behaviors in Production LLM Post-Training via Data Attribution

【速读】:该论文旨在解决后训练语言模型中行为变化的来源追溯问题,即如何识别导致特定行为(如有害响应)的关键训练数据点。其核心挑战在于从海量、复杂的偏好对(preference pairs)中定位具体的数据贡献,以支持可解释性和可控性改进。解决方案的关键是提出基于激活差异的数据归因方法(activation-based data attribution),通过计算测试提示和偏好对在模型各层的激活差向量,并依据余弦相似度排序来定位责任数据点;进一步结合因果验证(通过修改数据重新训练)和聚类分析,实现对新兴行为的无监督发现与干预。此方法相较梯度基归因和LLM判官基线更具准确性且成本降低逾10倍,为真实场景下的模型安全评估提供了有效工具。

链接: https://arxiv.org/abs/2602.11079
作者: Frank Xiao,Santiago Aranguri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose activation-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, we identify datapoints that cause specific behaviors and validate these attributions causally by retraining with modified data. Clustering behavior-datapoint similarity matrices also enables unsupervised discovery of emergent behaviors. Applying this to OLMo 2’s production DPO training, we surfaced distractor-triggered compliance: a harmful behavior where the model complies with dangerous requests when benign formatting instructions are appended. Filtering top-ranked datapoints reduces this behavior by 63% while switching their labels achieves 78%. Our method outperforms gradient-based attribution and LLM-judge baselines while being over 10 times cheaper than both. This in-the-wild model organism - emerging from contaminated preference data rather than deliberate injection - provides a realistic benchmark for safety techniques.

[AI-6] Interpretable Attention-Based Multi-Agent PPO for Latency Spike Resolution in 6G RAN Slicing

【速读】:该论文旨在解决第六代移动通信(6G)无线接入网(RAN)中异构网络切片(heterogeneous slices)在执行服务级别协议(SLA)时面临的挑战,尤其是传统深度强化学习(DRL)或可解释强化学习(XRL)难以诊断和应对突发延迟尖峰的问题。其解决方案的关键在于提出了一种注意力增强的多智能体近端策略优化框架(Attention-Enhanced Multi-Agent Proximal Policy Optimization, AE-MAPPO),该框架将六种专用注意力机制嵌入多智能体切片控制中,并将其作为零成本、忠实的解释机制显式输出;同时采用三阶段策略(预测、响应与跨切片优化),实现了对URLLC切片延迟尖峰的快速响应(18 ms内恢复至0.98 ms,可靠性达99.9999%),并显著降低故障排查时间93%,在保障eMBB和mMTC连续性的前提下实现SLA合规性与内在可解释性的统一,从而支撑6G RAN切片的可信实时自动化管理。

链接: https://arxiv.org/abs/2602.11076
作者: Kavan Fatehi,Mostafa Rahmani Ghourtani,Amir Sonee,Poonam Yadav,Alessandra M Russo,Hamed Ahmadi,Radu Calinescu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: This work has been accepted to appear in the IEEE International Conference on Communications (ICC)

点击查看摘要

Abstract:Sixth-generation (6G) radio access networks (RANs) must enforce strict service-level agreements (SLAs) for heterogeneous slices, yet sudden latency spikes remain difficult to diagnose and resolve with conventional deep reinforcement learning (DRL) or explainable RL (XRL). We propose \emphAttention-Enhanced Multi-Agent Proximal Policy Optimization (AE-MAPPO), which integrates six specialized attention mechanisms into multi-agent slice control and surfaces them as zero-cost, faithful explanations. The framework operates across O-RAN timescales with a three-phase strategy: predictive, reactive, and inter-slice optimization. A URLLC case study shows AE-MAPPO resolves a latency spike in 18 ms, restores latency to 0.98 ms with 99.9999% reliability, and reduces troubleshooting time by 93% while maintaining eMBB and mMTC continuity. These results confirm AE-MAPPO’s ability to combine SLA compliance with inherent interpretability, enabling trustworthy and real-time automation for 6G RAN slicing. Comments: This work has been accepted to appear in the IEEE International Conference on Communications (ICC) Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Signal Processing (eess.SP) Cite as: arXiv:2602.11076 [eess.SY] (or arXiv:2602.11076v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2602.11076 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-7] OSIL: Learning Offline Safe Imitation Policies with Safety Inferred from Non-preferred Trajectories AAMAS2026

【速读】:该论文旨在解决离线安全模仿学习(offline safe imitation learning, OSIL)问题,即在缺乏每时间步的安全成本或奖励信息的情况下,从演示轨迹中学习既安全又能最大化奖励的策略。在许多现实场景中,在环境中进行在线学习可能存在风险,且准确设定安全成本较为困难;但收集反映不良或不安全行为的非偏好轨迹(non-preferred trajectories)通常是可行的,这些轨迹隐含地指示了应避免的行为模式。解决方案的关键在于将安全策略学习建模为约束马尔可夫决策过程(Constrained Markov Decision Process, CMDP),并提出一种新颖的OSIL算法:通过推导奖励最大化目标的下界,并学习一个估计非偏好行为发生概率的成本模型,从而无需显式标注安全成本即可实现安全与高奖励策略的学习。实验表明,该方法能够在满足成本约束的前提下提升安全性,同时不损害奖励性能,优于多个基线方法。

链接: https://arxiv.org/abs/2602.11018
作者: Returaj Burnwal,Nirav Pravinbhai Bhatt,Balaraman Ravindran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 21 pages, Accepted at AAMAS 2026

点击查看摘要

Abstract:This work addresses the problem of offline safe imitation learning (IL), where the goal is to learn safe and reward-maximizing policies from demonstrations that do not have per-timestep safety cost or reward information. In many real-world domains, online learning in the environment can be risky, and specifying accurate safety costs can be difficult. However, it is often feasible to collect trajectories that reflect undesirable or unsafe behavior, implicitly conveying what the agent should avoid. We refer to these as non-preferred trajectories. We propose a novel offline safe IL algorithm, OSIL, that infers safety from non-preferred demonstrations. We formulate safe policy learning as a Constrained Markov Decision Process (CMDP). Instead of relying on explicit safety cost and reward annotations, OSIL reformulates the CMDP problem by deriving a lower bound on reward maximizing objective and learning a cost model that estimates the likelihood of non-preferred behavior. Our approach allows agents to learn safe and reward-maximizing behavior entirely from offline demonstrations. We empirically demonstrate that our approach can learn safer policies that satisfy cost constraints without degrading the reward performance, thus outperforming several baselines.

[AI-8] From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design DATE2026

【速读】:该论文旨在解决Transformer模型在大规模部署时面临的内存瓶颈问题,尤其是随着现有加速器(如Groq和Cerebras)通过大容量片上缓存减少片外通信后,片上SRAM访问能耗占比超过60%,成为新的性能瓶颈。其解决方案的关键在于提出一种基于混合键合(hybrid-bonded)3D堆叠的空间加速架构——3D-Flow,该架构支持跨垂直划分的处理单元(PE)层级实现寄存器到寄存器的通信;并通过设计细粒度调度策略3D-FlashAttention,平衡各层级延迟,形成无气泡的垂直数据流,避免片上SRAM往返传输,从而显著降低能耗并提升吞吐量。

链接: https://arxiv.org/abs/2602.11016
作者: Jinxin Yu,Yudong Pan,Mengdi Wang,Huawei Li,Yinhe Han,Xiaowei Li,Ying Wang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Accepted to DATE 2026

点击查看摘要

Abstract:Transformer-based models dominate modern AI workloads but exacerbate memory bottlenecks due to their quadratic attention complexity and ever-growing model sizes. Existing accelerators, such as Groq and Cerebras, mitigate off-chip traffic with large on-chip caches, while algorithmic innovations such as FlashAttention fuse operators to avoid materializing large attention matrices. However, as off-chip traffic decreases, our measurements show that on-chip SRAM accesses account for over 60% of energy in long-sequence workloads, making cache access the new bottleneck. We propose 3D-Flow, a hybrid-bonded, 3D-stacked spatial accelerator that enables register-to-register communication across vertically partitioned PE tiers. Unlike 2D multi-array architectures limited by NoC-based router-to-router transfers, 3D-Flow leverages sub-10 um vertical TSVs to sustain cycle-level operator pipelining with minimal overhead. On top of this architecture, we design 3D-FlashAttention, a fine-grained scheduling method that balances latency across tiers, forming a bubble-free vertical dataflow without on-chip SRAM roundtrips. Evaluations on Transformer workloads (OPT and QWEN models) show that our 3D spatial accelerator reduces 46-93% energy consumption and achieves 1.4x-7.6x speedups compared to state-of-the-art 2D and 3D designs.

[AI-9] CVPL: A Geometric Framework for Post-Hoc Linkage Risk Assessment in Protected Tabular Data

【速读】:该论文旨在解决现有形式化隐私度量(如k-匿名)在实际数据发布场景中无法准确量化原始数据与保护后数据之间真实链接风险的问题。传统方法往往仅提供二元合规判断,而忽略了攻击者策略和数据结构对链接可能性的连续影响。其解决方案的核心是提出CVPL(Cluster-Vector-Projection Linkage)框架,该框架将链接分析建模为阻塞(blocking)、向量化(vectorization)、潜在空间投影(latent projection)和相似性评估(similarity evaluation)组成的算子流水线,从而生成连续、依赖场景的风险估计,并通过阈值感知风险曲面 $ R(\lambda, \tau) $ 显式刻画保护强度与攻击者严格度的联合效应。该方法还引入具有单调性保证的渐进阻塞策略,支持任意时刻的风险下界估计,且能识别驱动链接可行性的关键特征,实现可解释的隐私影响评估与效用-风险权衡分析。

链接: https://arxiv.org/abs/2602.11015
作者: Valery Khvatov,Alexey Neyman
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 53 pages, 9 figures, 6 appendices. Code: this https URL

点击查看摘要

Abstract:Formal privacy metrics provide compliance-oriented guarantees but often fail to quantify actual linkability in released datasets. We introduce CVPL (Cluster-Vector-Projection Linkage), a geometric framework for post-hoc assessment of linkage risk between original and protected tabular data. CVPL represents linkage analysis as an operator pipeline comprising blocking, vectorization, latent projection, and similarity evaluation, yielding continuous, scenario-dependent risk estimates rather than binary compliance verdicts. We formally define CVPL under an explicit threat model and introduce threshold-aware risk surfaces, R(lambda, tau), that capture the joint effects of protection strength and attacker strictness. We establish a progressive blocking strategy with monotonicity guarantees, enabling anytime risk estimation with valid lower bounds. We demonstrate that the classical Fellegi-Sunter linkage emerges as a special case of CVPL under restrictive assumptions, and that violations of these assumptions can lead to systematic over-linking bias. Empirical validation on 10,000 records across 19 protection configurations demonstrates that formal k-anonymity compliance may coexist with substantial empirical linkability, with a significant portion arising from non-quasi-identifier behavioral patterns. CVPL provides interpretable diagnostics identifying which features drive linkage feasibility, supporting privacy impact assessment, protection mechanism comparison, and utility-risk trade-off analysis.

[AI-10] Fine-Tuning GPT -5 for GPU Kernel Generation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在 GPU 内核代码生成任务中因高质量标注数据稀缺、编译器偏见及跨硬件代际泛化能力有限而导致的性能瓶颈问题。传统监督微调(Supervised Fine-Tuning, SFT)方法难以有效提升模型在该领域的表现,因此论文提出基于强化学习(Reinforcement Learning, RL)的后训练范式作为替代方案。其关键在于构建了一个专门用于前沿模型强化学习微调的环境与工具链(Makora),并设计了适配 GPU 编程特性的奖励机制和评估体系,从而实现对 GPT-5 模型在 Triton 语言上的高效优化。实验表明,该方法显著提升了内核正确率(从 43.7% 提升至 77.0%)和优于 TorchInductor 的比例(从 14.8% 提升至 21.8%),并在完整编码代理场景下实现了高达 97.4% 的问题求解率和平均 2.12x 的加速效果,验证了强化学习在数据受限的专业领域中的强大适应性和有效性。

链接: https://arxiv.org/abs/2602.11000
作者: Ali Tehrani,Yahya Emara,Essam Wissam,Wojciech Paluch,Waleed Atallah,Łukasz Dudziak,Mohamed S. Abdelfattah
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora’s environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.

[AI-11] CLI-Gym: Scalable CLI Task Generation via Agent ic Environment Inversion

【速读】:该论文旨在解决生成式 AI(Generative AI)在环境密集型任务(environment-intensive tasks)中能力不足的问题,特别是如何大规模获取这类任务以提升智能体(agent)的运行时交互能力。其核心挑战在于缺乏足够规模且高质量的环境交互数据来训练和评估代理系统,例如通过命令行界面(CLI)处理依赖冲突或系统故障等实际问题。解决方案的关键在于提出 CLI-Gym 方法:利用 Dockerfile 与 agentic 任务之间的类比关系,通过代理模拟并探索健康环境的历史状态,结合执行反馈逆向推导出存在运行时失败的状态,并将错误状态及其对应的错误信息打包为可学习的任务实例。此方法首次实现了环境密集型任务的规模化自动构建,共生成 1,655 个任务实例,并基于这些任务微调得到的 LiberCoder 模型在 Terminal-Bench 上取得 46.1% 的准确率,相比基线模型提升 21.1%。

链接: https://arxiv.org/abs/2602.10999
作者: Yusong Lin,Haiyang Wang,Shuzhe Wu,Lue Fan,Feiyang Pan,Sanyuan Zhao,Dandan Tu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents’ capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.

[AI-12] RiemannGL: Riemannian Geometry Changes Graph Deep Learning

【速读】:该论文旨在解决当前图表示学习中缺乏统一几何基础的问题,指出传统方法多局限于特定流形(如双曲空间)且常采用外在嵌入形式,未能充分挖掘图神经网络的内在流形结构。其解决方案的关键在于将黎曼几何(Riemannian geometry)确立为图表示学习的原理性基础,并提出以“内蕴流形结构”为核心构建统一范式,通过明确三个维度——流形类型、神经架构与学习范式——来系统推进Riemannian图学习的研究,从而推动图神经网络从局部欧氏假设向更普适的非欧几里得几何建模演进。

链接: https://arxiv.org/abs/2602.10982
作者: Li Sun,Qiqi Wan,Suyang Zhou,Zhenhao Huang,Philip S. Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages, 11 figures, position paper

点击查看摘要

Abstract:Graphs are ubiquitous, and learning on graphs has become a cornerstone in artificial intelligence and data mining communities. Unlike pixel grids in images or sequential structures in language, graphs exhibit a typical non-Euclidean structure with complex interactions among the objects. This paper argues that Riemannian geometry provides a principled and necessary foundation for graph representation learning, and that Riemannian graph learning should be viewed as a unifying paradigm rather than a collection of isolated techniques. While recent studies have explored the integration of graph learning and Riemannian geometry, most existing approaches are limited to a narrow class of manifolds, particularly hyperbolic spaces, and often adopt extrinsic manifold formulations. We contend that the central mission of Riemannian graph learning is to endow graph neural networks with intrinsic manifold structures, which remains underexplored. To advance this perspective, we identify key conceptual and methodological gaps in existing approaches and outline a structured research agenda along three dimensions: manifold type, neural architecture, and learning paradigm. We further discuss open challenges, theoretical foundations, and promising directions that are critical for unlocking the full potential of Riemannian graph learning. This paper aims to provide a coherent viewpoint and to stimulate broader exploration of Riemannian geometry as a foundational framework for future graph learning research.

[AI-13] FeatureBench: Benchmarking Agent ic Coding for Complex Feature Development ICLR2026

【速读】:该论文旨在解决当前用于评估基于大语言模型(Large Language Models, LLMs)的智能体(Agent)在软件开发中编码能力的基准测试存在任务范围狭窄、评估方式非执行化以及缺乏自动化更新机制的问题。现有基准多局限于单个拉取请求(Pull Request, PR)内的错误修复,且难以持续扩展以覆盖更复杂的端到端功能开发场景。其解决方案的关键在于提出 FeatureBench,一个面向功能级(feature-oriented)软件开发的可执行评估框架:通过依赖图追踪单元测试路径,自动从代码仓库中提取跨多个提交和PR的特征级任务,并构建可验证的执行环境;同时采用测试驱动的方法实现任务的规模化与自动化采集,显著提升了评估的覆盖面与可扩展性,从而为评估和改进代理式编程能力提供了可靠、可持续的基准工具。

链接: https://arxiv.org/abs/2602.10975
作者: Qixing Zhou,Jiacheng Zhang,Haiyang Wang,Rui Hao,Jiahe Wang,Minghao Han,Yuxue Yang,Shuzhe Wu,Feiyang Pan,Lue Fan,Dandan Tu,Zhaoxiang Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g., bug fixing within a single pull request (PR), and often rely on non-executable evaluations or lack an automated approach for continually updating the evaluation coverage. To address such issues, we propose FeatureBench, a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development. FeatureBench incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort. By tracing from unit tests along a dependency graph, our approach can identify feature-level coding tasks spanning multiple commits and PRs scattered across the development timeline, while ensuring the proper functioning of other features after the separation. Using this framework, we curated 200 challenging evaluation tasks and 3825 executable environments from 24 open-source repositories in the first version of our benchmark. Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, which achieves a 74.4% resolved rate on SWE-bench, succeeds on only 11.0% of tasks, opening new opportunities for advancing agentic coding. Moreover, benefiting from our automated task collection toolkit, FeatureBench can be easily scaled and updated over time to mitigate data leakage. The inherent verifiability of constructed environments also makes our method potentially valuable for agent training.

[AI-14] Can LLM s Cook Jamaican Couscous? A Study of Cultural Novelty in Recipe Generation

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在跨文化内容生成中是否能够实现对非主导文化的有意义适应,从而避免文化偏见、刻板印象和文化表达的同质化。其解决方案的关键在于通过烹饪食谱这一高度依赖文化传统与创造力的领域,构建了一个基于文化距离度量的跨国食谱配对数据集(GlobalFusion),并利用该数据集对比人类与LLMs在跨文化适应性生成中的行为差异。研究发现,LLMs无法根据文化距离产生具有代表性的适应性食谱,且其生成结果缺乏对文化要素(如食材、传统概念)的准确识别与锚定,这揭示了当前LLMs在文化敏感生成任务中的根本局限性。

链接: https://arxiv.org/abs/2602.10964
作者: F. Carichon,R. Rampa,G. Farnadi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 12 figures, conference

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to generate and shape cultural content, ranging from narrative writing to artistic production. While these models demonstrate impressive fluency and generative capacity, prior work has shown that they also exhibit systematic cultural biases, raising concerns about stereotyping, homogenization, and the erasure of culturally specific forms of expression. Understanding whether LLMs can meaningfully align with diverse cultures beyond the dominant ones remains a critical challenge. In this paper, we study cultural adaptation in LLMs through the lens of cooking recipes, a domain in which culture, tradition, and creativity are tightly intertwined. We build on the \textitGlobalFusion dataset, which pairs human recipes from different countries according to established measures of cultural distance. Using the same country pairs, we generate culturally adapted recipes with multiple LLMs, enabling a direct comparison between human and LLM behavior in cross-cultural content creation. Our analysis shows that LLMs fail to produce culturally representative adaptations. Unlike humans, the divergence of their generated recipes does not correlate with cultural distance. We further provide explanations for this gap. We show that cultural information is weakly preserved in internal model representations, that models inflate novelty in their production by misunderstanding notions such as creativity and tradition, and that they fail to identify adaptation with its associated countries and to ground it in culturally salient elements such as ingredients. These findings highlight fundamental limitations of current LLMs for culturally oriented generation and have important implications for their use in culturally sensitive applications.

[AI-15] raceable Enforceable and Compensable Participation: A Participation Ledger for People-Centered AI Governance ICIP

【速读】:该论文旨在解决当前人工智能(AI)治理中参与机制形式化、缺乏可追溯性与问责制的问题,即社区在公共部门和公民AI系统中的贡献(如讨论、标注、提示词和事件报告)常被非正式记录,且与系统更新脱节,导致参与行为难以转化为持久影响力或可执行的权利与补偿。解决方案的关键在于提出“参与账本”(Participation Ledger),其核心是将参与行为结构化为可审计的三要素:一是基于标准化证据(Participation Evidence Standard)明确同意、隐私、补偿与再利用条款;二是通过影响追踪机制建立贡献物与系统变更(如数据集、提示词、策略等)之间的可复现因果链,支持长期监测承诺履行情况;三是编码权利与激励机制,包括能力凭证(Capability Vouchers)用于授权社区管理者在限定范围内请求或限制功能,以及参与积分(Participation Credits)对持续产生价值的测试贡献提供持续认可与补偿。

链接: https://arxiv.org/abs/2602.10916
作者: Rashid Mushkani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Presented at PAIRS: Participatory AI Research Practice Symposium

点击查看摘要

Abstract:Participatory approaches are widely invoked in AI governance, yet participation rarely translates into durable influence. In public sector and civic AI systems, community contributions such as deliberations, annotations, prompts, and incident reports are often recorded informally, weakly linked to system updates, and disconnected from enforceable rights or sustained compensation. As a result, participation is frequently symbolic rather than accountable. We introduce the Participation Ledger, a machine readable and auditable framework that operationalizes participation as traceable influence, enforceable authority, and compensable labor. The ledger represents participation as an influence graph that links contributed artifacts to verified changes in AI systems, including datasets, prompts, adapters, policies, guardrails, and evaluation suites. It integrates three elements: a Participation Evidence Standard documenting consent, privacy, compensation, and reuse terms; an influence tracing mechanism that connects system updates to replayable before and after tests, enabling longitudinal monitoring of commitments; and encoded rights and incentives. Capability Vouchers allow authorized community stewards to request or constrain specific system capabilities within defined boundaries, while Participation Credits support ongoing recognition and compensation when contributed tests continue to provide value. We ground the framework in four urban AI and public space governance deployments and provide a machine readable schema, templates, and an evaluation plan for assessing traceability, enforceability, and compensation in practice.

[AI-16] Blind Gods and Broken Screens: Architecting a Secure Intent-Centric Mobile Agent Operating System

【速读】:该论文旨在解决当前基于大型语言模型(Large Language Models, LLMs)的移动智能代理(Mobile Agents)在“屏幕即接口”(Screen-as-Interface)范式下所面临的安全缺陷问题,这些问题包括虚假应用身份、视觉欺骗、间接提示注入和未经授权的权限提升等,根源在于对非结构化视觉数据的依赖。其解决方案的关键是提出Aura——一种面向安全代理操作系统的通用运行时架构(Agent Universal Runtime Architecture),通过引入结构化的代理原生交互模型替代脆弱的GUI抓取机制,并采用中心辐射拓扑(Hub-and-Spoke)设计:由特权系统代理(System Agent)协调意图、沙箱化应用代理(App Agent)执行特定任务、代理内核(Agent Kernel)统一管控通信;同时,代理内核实施四大防御支柱:(i)基于全局代理注册表的加密身份绑定;(ii)多层语义防火墙实现输入净化;(iii)基于污点感知内存与计划轨迹对齐的认知完整性保障;(iv)细粒度访问控制与不可否认审计。实验证明,相比Doubao Mobile Assistant,Aura显著提升了低风险任务成功率(75% → 94.3%),大幅降低高风险攻击成功率(40% → 4.4%),并获得近一个数量级的延迟优化。

链接: https://arxiv.org/abs/2602.10915
作者: Zhenhua Zou,Sheng Guo,Qiuyang Zhan,Lepeng Zhao,Shuo Li,Qi Li,Ke Xu,Mingwei Xu,Zhuotao Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 35 pages, 15 figures

点击查看摘要

Abstract:The evolution of Large Language Models (LLMs) has shifted mobile computing from App-centric interactions to system-level autonomous agents. Current implementations predominantly rely on a “Screen-as-Interface” paradigm, which inherits structural vulnerabilities and conflicts with the mobile ecosystem’s economic foundations. In this paper, we conduct a systematic security analysis of state-of-the-art mobile agents using Doubao Mobile Assistant as a representative case. We decompose the threat landscape into four dimensions - Agent Identity, External Interface, Internal Reasoning, and Action Execution - revealing critical flaws such as fake App identity, visual spoofing, indirect prompt injection, and unauthorized privilege escalation stemming from a reliance on unstructured visual data. To address these challenges, we propose Aura, an Agent Universal Runtime Architecture for a clean-slate secure agent OS. Aura replaces brittle GUI scraping with a structured, agent-native interaction model. It adopts a Hub-and-Spoke topology where a privileged System Agent orchestrates intent, sandboxed App Agents execute domain-specific tasks, and the Agent Kernel mediates all communication. The Agent Kernel enforces four defense pillars: (i) cryptographic identity binding via a Global Agent Registry; (ii) semantic input sanitization through a multilayer Semantic Firewall; (iii) cognitive integrity via taint-aware memory and plan-trajectory alignment; and (iv) granular access control with non-deniable auditing. Evaluation on MobileSafetyBench shows that, compared to Doubao, Aura improves low-risk Task Success Rate from roughly 75% to 94.3%, reduces high-risk Attack Success Rate from roughly 40% to 4.4%, and achieves near-order-of-magnitude latency gains. These results demonstrate Aura as a viable, secure alternative to the “Screen-as-Interface” paradigm. Comments: 35 pages, 15 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) MSC classes: 68T01, 68M25, 68N25 ACMclasses: I.2.11; D.4.6 Cite as: arXiv:2602.10915 [cs.CR] (or arXiv:2602.10915v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.10915 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-17] Resource-Efficient Model-Free Reinforcement Learning for Board Games

【速读】:该论文旨在解决基于搜索的强化学习方法(如AlphaZero)在棋类游戏中因计算资源需求过高而导致可复现性受限的问题。其解决方案的关键在于提出一种无需模型的强化学习算法,通过优化学习效率,在不依赖复杂搜索机制的前提下实现对多种棋类游戏(包括Animal Shogi、Gardner Chess、Go、Hex和Othello)的有效训练,实验表明该方法在多个环境中的学习效率优于现有方法,且核心组件的重要性经由详尽的消融研究得到验证。

链接: https://arxiv.org/abs/2602.10894
作者: Kazuki Ota,Takayuki Osa,Motoki Omura,Tatsuya Harada
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Board games have long served as complex decision-making benchmarks in artificial intelligence. In this field, search-based reinforcement learning methods such as AlphaZero have achieved remarkable success. However, their significant computational demands have been pointed out as barriers to their reproducibility. In this study, we propose a model-free reinforcement learning algorithm designed for board games to achieve more efficient learning. To validate the efficiency of the proposed method, we conducted comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello. The results demonstrate that the proposed method achieves more efficient learning than existing methods across these environments. In addition, our extensive ablation study shows the importance of core techniques used in the proposed method. We believe that our efficient algorithm shows the potential of model-free reinforcement learning in domains traditionally dominated by search-based methods.

[AI-18] Interactive LLM -assisted Curriculum Learning for Multi-Task Evolutionary Policy Search

【速读】:该论文旨在解决多任务策略搜索(multi-task policy search)中策略泛化能力不足的问题,尤其是在训练案例复杂度逐步提升的背景下,传统静态课程设计方法依赖人工干预且缺乏实时反馈机制。其解决方案的关键在于提出一种交互式大语言模型(LLM)辅助的在线课程生成框架,其中LLM根据进化优化过程中的实时反馈动态调整训练案例,从而实现自适应课程演化;实验表明,结合数值指标、进度图和行为可视化等多模态反馈时,该方法性能可媲美专家设计的课程,显著优于静态LLM生成方案。

链接: https://arxiv.org/abs/2602.10891
作者: Berfin Sakallioglu,Giorgia Nadizar,Eric Medvet
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, with Appendix

点击查看摘要

Abstract:Multi-task policy search is a challenging problem because policies are required to generalize beyond training cases. Curriculum learning has proven to be effective in this setting, as it introduces complexity progressively. However, designing effective curricula is labor-intensive and requires extensive domain expertise. LLM-based curriculum generation has only recently emerged as a potential solution, but was limited to operate in static, offline modes without leveraging real-time feedback from the optimizer. Here we propose an interactive LLM-assisted framework for online curriculum generation, where the LLM adaptively designs training cases based on real-time feedback from the evolutionary optimization process. We investigate how different feedback modalities, ranging from numeric metrics alone to combinations with plots and behavior visualizations, influence the LLM ability to generate meaningful curricula. Through a 2D robot navigation case study, tackled with genetic programming as optimizer, we evaluate our approach against static LLM-generated curricula and expert-designed baselines. We show that interactive curriculum generation outperforms static approaches, with multimodal feedback incorporating both progression plots and behavior visualizations yielding performance competitive with expert-designed curricula. This work contributes to understanding how LLMs can serve as interactive curriculum designers for embodied AI systems, with potential extensions to broader evolutionary robotics applications.

[AI-19] Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理过程中,因依赖人工标注的奖励模型(Reward Model, RM)而导致训练成本高、静态RM难以适应CoT分布演化及易受奖励黑客(reward hacking)影响的问题。解决方案的关键在于提出RLCER(Reinforcement Learning with CoT Supervision via Self-Evolving Rubrics),通过自生成且自演化的评分标准(rubrics)对CoT进行监督奖励,从而实现无需人工标注的自主式CoT奖励机制,并在训练中逐步优化,显著优于以结果为中心的强化学习方法(outcome-centric RLVR)。

链接: https://arxiv.org/abs/2602.10885
作者: Leheng Sheng,Wenchang Ma,Ruixin Hong,Xiang Wang,An Zhang,Tat-Seng Chua
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages

点击查看摘要

Abstract:Despite chain-of-thought (CoT) playing crucial roles in LLM reasoning, directly rewarding it is difficult: training a reward model demands heavy human labeling efforts, and static RMs struggle with evolving CoT distributions and reward hacking. These challenges motivate us to seek an autonomous CoT rewarding approach that requires no human annotation efforts and can evolve gradually. Inspired by recent self-evolving training methods, we propose \textbfRLCER (\textbfReinforcement \textbfLearning with \textbfCoT Supervision via Self-\textbfEvolving \textbfRubrics), which enhances the outcome-centric RLVR by rewarding CoTs with self-proposed and self-evolving rubrics. We show that self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards, enabling RLCER to outperform outcome-centric RLVR. Moreover, when used as in-prompt hints, these self-proposed rubrics further improve inference-time performance.

[AI-20] FedPS: Federated data Preprocessing via aggregated Statistics

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中数据预处理阶段的关键挑战,即在不共享原始数据的前提下,如何高效、一致地完成特征缩放、编码、离散化和缺失值插补等操作。现有研究普遍忽视了这一环节对模型性能的影响,而实际应用中隐私约束与通信效率又限制了分布式预处理的可行性。其解决方案的核心在于提出FedPS框架,该框架基于聚合统计信息进行联邦数据预处理,利用数据摘要(data-sketching)技术在本地高效总结数据特征并保留关键统计量,进而设计出适用于水平和垂直联邦学习场景的分布式算法,实现了灵活、低通信开销且结果一致的预处理流程。

链接: https://arxiv.org/abs/2602.10870
作者: Xuefeng Xu,Graham Cormode
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.

[AI-21] ICA: Information-Aware Credit Assignment for Visually Grounded Long-Horizon Information-Seeking Agents

【速读】:该论文旨在解决开放网络环境中信息检索代理因低信噪比反馈而导致的学习效率低下问题,尤其是在文本解析器忽略布局语义并引入非结构化噪声、以及长时程训练依赖稀疏结果奖励难以明确具体检索动作贡献的情况下。解决方案的关键在于提出一种视觉原生(visual-native)搜索框架,将网页表示为视觉快照(visual snapshots),使智能体能够利用布局线索快速定位关键证据并抑制干扰项;同时引入信息感知的信用分配机制(Information-Aware Credit Assignment, ICA),通过后验分析估计每个检索快照对最终结果的贡献,并将密集的学习信号回传至关键搜索步骤,从而缓解开放网络环境中的信用分配瓶颈。

链接: https://arxiv.org/abs/2602.10863
作者: Cong Pang,Xuyu Feng,Yujie Yi,Zixuan Chen,Jiawei Hong,Tiankuo Yao,Nang Yuan,Jiapeng Luo,Lewei Lu,Xin Lou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the strong performance achieved by reinforcement learning-trained information-seeking agents, learning in open-ended web environments remains severely constrained by low signal-to-noise feedback. Text-based parsers often discard layout semantics and introduce unstructured noise, while long-horizon training typically relies on sparse outcome rewards that obscure which retrieval actions actually matter. We propose a visual-native search framework that represents webpages as visual snapshots, allowing agents to leverage layout cues to quickly localize salient evidence and suppress distractors. To learn effectively from these high-dimensional observations, we introduce Information-Aware Credit Assignment (ICA), a post-hoc method that estimates each retrieved snapshot’s contribution to the final outcome via posterior analysis and propagates dense learning signals back to key search turns. Integrated with a GRPO-based training pipeline, our approach consistently outperforms text-based baselines on diverse information-seeking benchmarks, providing evidence that visual snapshot grounding with information-level credit assignment alleviates the credit-assignment bottleneck in open-ended web environments. The code and datasets will be released in this https URL.

[AI-22] me Series Foundation Models for Energy Load Forecasting on Consumer Hardware: A Multi-Dimensional Zero-Shot Benchmark

【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在电力负荷预测这一任务关键场景下的适用性问题,尤其是其零样本预测能力是否能有效转化为高精度、强鲁棒性和良好校准的实操性能。核心挑战在于现有TSFMs虽具备通用建模潜力,但缺乏针对电网运行等严苛环境的系统评估,且未明确其相对于传统统计方法(如SARIMA、Seasonal Naive)和行业标准模型(如Prophet)的优势边界。解决方案的关键在于构建一个多维基准框架,涵盖上下文长度敏感性、概率预测校准度、分布偏移鲁棒性及运营决策支持能力四个维度,在真实ERCOT小时级负荷数据(2020–2024)上对四种TSFMs(Chronos-Bolt、Chronos-2、Moirai-2、TinyTimeMixer)与经典基线进行对比实验,并在消费级硬件条件下验证其实际效能——结果表明,顶级TSFM在长上下文(2048小时)下可实现MASE≈0.31,较Seasonal Naive提升47%,且其预训练中学习到的时间模式识别机制使其在短上下文下仍保持稳定性能,显著优于依赖局部拟合的Prophet模型。

链接: https://arxiv.org/abs/2602.10848
作者: Luigi Simeone
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 13 figures

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs) have introduced zero-shot prediction capabilities that bypass the need for task-specific training. Whether these capabilities translate to mission-critical applications such as electricity demand forecasting–where accuracy, calibration, and robustness directly affect grid operations–remains an open question. We present a multi-dimensional benchmark evaluating four TSFMs (Chronos-Bolt, Chronos-2, Moirai-2, and TinyTimeMixer) alongside Prophet as an industry-standard baseline and two statistical references (SARIMA and Seasonal Naive), using ERCOT hourly load data from 2020 to 2024. All experiments run on consumer-grade hardware (AMD Ryzen 7, 16GB RAM, no GPU). The evaluation spans four axes: (1) context length sensitivity from 24 to 2048 hours, (2) probabilistic forecast calibration, (3) robustness under distribution shifts including COVID-19 lockdowns and Winter Storm Uri, and (4) prescriptive analytics for operational decision support. The top-performing foundation models achieve MASE values near 0.31 at long context lengths (C = 2048h, day-ahead horizon), a 47% reduction over the Seasonal Naive baseline. The inclusion of Prophet exposes a structural advantage of pre-trained models: Prophet fails when the fitting window is shorter than its seasonality period (MASE 74 at 24-hour context), while TSFMs maintain stable accuracy even with minimal context because they recognise temporal patterns learned during pre-training rather than estimating them from scratch. Calibration varies substantially across models–Chronos-2 produces well-calibrated prediction intervals (95% empirical coverage at 90% nominal level) while both Moirai-2 and Prophet exhibit overconfidence (~70% coverage). We provide practical model selection guidelines and release the complete benchmark framework for reproducibility.

[AI-23] Enhancing Multivariate Time Series Forecasting with Global Temporal Retrieval ICLR2026

【速读】:该论文旨在解决多变量时间序列预测(Multivariate Time Series Forecasting, MTSF)中因模型仅依赖有限历史上下文而导致无法有效捕捉长周期全局规律的问题。现有方法如简单扩展历史窗口,会引发过拟合、计算成本高及冗余信息处理等弊端。其解决方案的关键在于提出一种轻量级且可插拔的模块——全局时序检索器(Global Temporal Retriever, GTR),该模块通过维护一个自适应的全局时序嵌入,并动态检索与输入序列对齐的相关全局片段,结合二维卷积与残差融合机制,联合建模局部与全局依赖关系,从而在不修改原模型架构的前提下,显著增强模型对长周期模式的感知能力。

链接: https://arxiv.org/abs/2602.10847
作者: Fanpu Cao,Lu Dai,Jindong Han,Hui Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:Multivariate time series forecasting (MTSF) plays a vital role in numerous real-world applications, yet existing models remain constrained by their reliance on a limited historical context. This limitation prevents them from effectively capturing global periodic patterns that often span cycles significantly longer than the input horizon - despite such patterns carrying strong predictive signals. Naive solutions, such as extending the historical window, lead to severe drawbacks, including overfitting, prohibitive computational costs, and redundant information processing. To address these challenges, we introduce the Global Temporal Retriever (GTR), a lightweight and plug-and-play module designed to extend any forecasting model’s temporal awareness beyond the immediate historical context. GTR maintains an adaptive global temporal embedding of the entire cycle and dynamically retrieves and aligns relevant global segments with the input sequence. By jointly modeling local and global dependencies through a 2D convolution and residual fusion, GTR effectively bridges short-term observations with long-term periodicity without altering the host model architecture. Extensive experiments on six real-world datasets demonstrate that GTR consistently delivers state-of-the-art performance across both short-term and long-term forecasting scenarios, while incurring minimal parameter and computational overhead. These results highlight GTR as an efficient and general solution for enhancing global periodicity modeling in MTSF tasks. Code is available at this repository: this https URL.

[AI-24] SynergyKGC: Reconciling Topological Heterogeneity in Knowledge Graph Completion via Topology-Aware Synergy

【速读】:该论文旨在解决知识图谱补全(Knowledge Graph Completion, KGC)中因图结构异质性导致的“结构分辨率不匹配”问题,即在不同密度的子图区域中,传统方法难以兼顾密集簇中的结构噪声干扰与稀疏区域的表征崩溃现象。解决方案的关键在于提出SynergyKGC框架,其核心创新包括:通过关系感知的跨模态协同专家(relation-aware cross-attention and semantic-intent-driven gating)实现动态邻域融合;结合密度依赖的身份锚定策略(density-dependent Identity Anchoring)与双塔一致性架构(Double-tower Coherent Consistency),从而有效缓解拓扑异质性并保障训练与推理阶段的表征稳定性。

链接: https://arxiv.org/abs/2602.10845
作者: Xuecheng Zou,Yu Tang,Bingbing Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 5 tables, 7 figures. This work introduces the Active Synergy mechanism and Identity Anchoring for Knowledge Graph Completion. Code: this https URL

点击查看摘要

Abstract:Knowledge Graph Completion (KGC) fundamentally hinges on the coherent fusion of pre-trained entity semantics with heterogeneous topological structures to facilitate robust relational reasoning. However, existing paradigms encounter a critical “structural resolution mismatch,” failing to reconcile divergent representational demands across varying graph densities, which precipitates structural noise interference in dense clusters and catastrophic representation collapse in sparse regions. We present SynergyKGC, an adaptive framework that advances traditional neighbor aggregation to an active Cross-Modal Synergy Expert via relation-aware cross-attention and semantic-intent-driven gating. By coupling a density-dependent Identity Anchoring strategy with a Double-tower Coherent Consistency architecture, SynergyKGC effectively reconciles topological heterogeneity while ensuring representational stability across training and inference phases. Systematic evaluations on two public benchmarks validate the superiority of our method in significantly boosting KGC hit rates, providing empirical evidence for a generalized principle of resilient information integration in non-homogeneous structured data.

[AI-25] See Plan Snap: Evaluating Multimodal GUI Agents in Scratch

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在图形用户界面(GUI)环境中进行程序构建任务时的评估难题,尤其是针对低代码教育平台 Scratch 的可视化编程能力缺乏系统性评测基准的问题。其解决方案的关键在于提出 ScratchWorld 基准测试集,该基准基于“使用-修改-创造”教学框架,涵盖 83 个结构化任务,分为创建、调试、扩展和计算四类问题,并设计两种互补的交互模式:原始模式(primitive mode)用于精细拖拽操作以评估视觉运动控制能力,复合模式(composite mode)则通过高层语义 API 分离程序推理与 GUI 执行;同时引入基于运行时测试的执行评估协议,在浏览器环境中验证所构建 Scratch 程序的功能正确性,从而精准诊断 AI 代理在推理与动作之间的性能差距。

链接: https://arxiv.org/abs/2602.10814
作者: Xingyi Zhang,Yulei Ye,Kaifeng Huang,Wenhao Li,Xiangfeng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning–acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.

[AI-26] PELLI: Framework to effectively integrate LLM s for quality software generation

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在代码生成任务中评估标准单一、比较对象有限,且缺乏对非功能性需求(如可维护性、性能和可靠性)系统性量化分析的问题。其解决方案的关键在于提出一个名为“基于LLM迭代的程序卓越性”(Programmatic Excellence via LLM Iteration, PELLI)的综合性代码质量评估框架,该框架通过迭代式分析过程确保高质量代码变更,并首次在三个应用领域内对五种主流LLMs进行多维度定量评估,从而为开发者提供可落地的实践指导,同时揭示不同提示设计对代码质量的影响以及各模型在具体场景下的表现差异。

链接: https://arxiv.org/abs/2602.10808
作者: Rasmus Krebs,Somnath Mazumdar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Recent studies have revealed that when LLMs are appropriately prompted and configured, they demonstrate mixed results. Such results often meet or exceed the baseline performance. However, these comparisons have two primary issues. First, they mostly considered only reliability as a comparison metric and selected a few LLMs (such as Codex and ChatGPT) for comparision. This paper proposes a comprehensive code quality assessment framework called Programmatic Excellence via LLM Iteration (PELLI). PELLI is an iterative analysis-based process that upholds high-quality code changes. We extended the state-of-the-art by performing a comprehensive evaluation that generates quantitative metrics for analyzing three primary nonfunctional requirements (such as maintainability, performance, and reliability) while selecting five popular LLMs. For PELLI’s applicability, we selected three application domains while following Python coding standards. Following this framework, practitioners can ensure harmonious integration between LLMs and human developers, ensuring that their potential is fully realized. PELLI can serve as a practical guide for developers aiming to leverage LLMs while adhering to recognized quality standards. This study’s outcomes are crucial for advancing LLM technologies in real-world applications, providing stakeholders with a clear understanding of where these LLMs excel and where they require further refinement. Overall, based on three nonfunctional requirements, we have found that GPT-4T and Gemini performed slightly better. We also found that prompt design can influence the overall code quality. In addition, each application domain demonstrated high and low scores across various metrics, and even within the same metrics across different prompts.

[AI-27] Integrating Generative AI-enhanced Cognitive Systems in Higher Education: From Stakeholder Perceptions to a Conceptual Framework considering the EU AI Act

【速读】:该论文旨在解决高等教育机构在引入生成式 AI(Generative AI)过程中面临的多重挑战,包括利益相关者对 GenAI 的认知分歧、跨学科差异以及欧盟《人工智能法案》(EU AI Act)所要求的合规性问题。解决方案的关键在于通过混合方法调研(问卷调查与定性分析相结合),识别出信息与电气工程(ITEE)领域师生对 GenAI 的共性与差异化需求,并据此提炼出一套高层级责任型整合要求及概念框架。该框架强调以利益相关者参与为核心,确保 GenAI 在提升编程支持等教学效能的同时,兼顾响应质量、隐私保护和学术诚信等关键关切,从而为高校提供可操作的实施路径,实现技术赋能与合规治理的协同推进。

链接: https://arxiv.org/abs/2602.10802
作者: Da-Lun Chen,Prasasthy Balasubramanian,Lauri Lovén,Susanna Pirttikangas,Jaakko Sauvola,Panagiotis Kostakos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many staff and students in higher education have adopted generative artificial intelligence (GenAI) tools in their work and study. GenAI is expected to enhance cognitive systems by enabling personalized learning and streamlining educational services. However, stakeholders perceptions of GenAI in higher education remain divided, shaped by cultural, disciplinary, and institutional contexts. In addition, the EU AI Act requires universities to ensure regulatory compliance when deploying cognitive systems. These developments highlight the need for institutions to engage stakeholders and tailor GenAI integration to their needs while addressing concerns. This study investigates how GenAI is perceived within the disciplines of Information Technology and Electrical Engineering (ITEE). Using a mixed-method approach, we surveyed 61 staff and 37 students at the Faculty of ITEE, University of Oulu. The results reveal both shared and discipline-specific themes, including strong interest in programming support from GenAI and concerns over response quality, privacy, and academic integrity. Drawing from these insights, the study identifies a set of high-level requirements and proposes a conceptual framework for responsible GenAI integration. Disciplinary-specific requirements reinforce the importance of stakeholder engagement when integrating GenAI into higher education. The high-level requirements and the framework provide practical guidance for universities aiming to harness GenAI while addressing stakeholder concerns and ensuring regulatory compliance.

[AI-28] ransport Dont Generate: Deterministic Geometric Flows for Combinatorial Optimization

【速读】:该论文旨在解决神经组合优化(Neural Combinatorial Optimization, NCO)中求解旅行商问题(Traveling Salesman Problem, TSP)时计算效率低的问题,尤其是传统扩散模型在处理欧几里得TSP时需进行迭代边去噪,导致时间复杂度为二次方级(O(N²))的瓶颈。其解决方案的关键在于提出CycFlow框架,通过引入一个实例相关的向量场(instance-conditioned vector field),将输入的二维坐标点直接映射到一个标准圆形排列(canonical circular arrangement),从而以确定性点传输替代随机边去噪过程;最优路径则从该2N维表示中通过角度排序恢复。该方法利用数据依赖的流匹配(data-dependent flow matching)机制,将计算复杂度从二次降低至线性,使求解速度相较最先进的扩散基线提升高达三个数量级,同时保持了良好的最优性差距。

链接: https://arxiv.org/abs/2602.10794
作者: Benjy Friedmann,Nadav Dym
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. 10 pages

点击查看摘要

Abstract:Recent advances in Neural Combinatorial Optimization (NCO) have been dominated by diffusion models that treat the Euclidean Traveling Salesman Problem (TSP) as a stochastic N \times N heatmap generation task. In this paper, we propose CycFlow, a framework that replaces iterative edge denoising with deterministic point transport. CycFlow learns an instance-conditioned vector field that continuously transports input 2D coordinates to a canonical circular arrangement, where the optimal tour is recovered from this 2N dimensional representation via angular sorting. By leveraging data-dependent flow matching, we bypass the quadratic bottleneck of edge scoring in favor of linear coordinate dynamics. This paradigm shift accelerates solving speed by up to three orders of magnitude compared to state-of-the-art diffusion baselines, while maintaining competitive optimality gaps.

[AI-29] LOREN: Low Rank-Based Code-Rate Adaptation in Neural Receivers

【速读】:该论文旨在解决神经网络接收机在实际应用中因需为每个码率(code rate)单独存储权重而带来的高内存和功耗问题。解决方案的关键在于提出LOREN(Low Rank-Based Code-Rate Adaptation Neural Receiver),其通过在卷积层中集成轻量级低秩适配器(LOREN adapters),冻结共享的基础网络,仅对每种码率训练小型适配器,从而实现码率自适应且开销极低。该方法在3GPP CDL信道环境下进行端到端训练,确保了在真实无线环境中的鲁棒性,并在22nm工艺下实现了超过65%的硅面积节省和最高15%的功耗降低。

链接: https://arxiv.org/abs/2602.10770
作者: Bram Van Bolderik,Vlado Menkovski(Technische Universiteit Eindhoven, The Netherlands),Sonia Heemstra de Groot(Eindhoven Technical University, The Netherlands),Manil Dev Gomony(Eindhoven University of Technology, The Netherlands)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Signal Processing (eess.SP)
备注: Accepted to / To appear IEEE Wireless Communications and Networking Conference Kuala Lumpur, Malaysia 13 - 16 April 2026

点击查看摘要

Abstract:Neural network based receivers have recently demonstrated superior system-level performance compared to traditional receivers. However, their practicality is limited by high memory and power requirements, as separate weight sets must be stored for each code rate. To address this challenge, we propose LOREN, a Low Rank-Based Code-Rate Adaptation Neural Receiver that achieves adaptability with minimal overhead. LOREN integrates lightweight low rank adaptation adapters (LOREN adapters) into convolutional layers, freezing a shared base network while training only small adapters per code rate. An end-to-end training framework over 3GPP CDL channels ensures robustness across realistic wireless environments. LOREN achieves comparable or superior performance relative to fully retrained base neural receivers. The hardware implementation of LOREN in 22nm technology shows more than 65% savings in silicon area and up to 15% power reduction when supporting three code rates.

[AI-30] Exploring the impact of adaptive rewiring in Graph Neural Networks

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在大规模图应用中面临的高内存占用和计算成本问题。其核心解决方案是通过稀疏化(sparsification)作为正则化手段,结合网络科学与机器学习技术(如Erdős-Rényi模型进行图结构稀疏化),提升GNN在实际场景中的效率。关键创新在于提出一种自适应重布线(adaptive rewiring)策略,并结合早停(early stopping)机制,使模型能在训练过程中动态调整图的连接结构,从而在保持模型表达能力的同时优化泛化性能和可扩展性。实验表明,适当控制稀疏度对提升GNN性能至关重要,过度稀疏会损害复杂模式的学习能力。

链接: https://arxiv.org/abs/2602.10754
作者: Charlotte Cambier van Nooten,Christos Aronis,Yuliya Shapovalova,Lucia Cavallaro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Machine Learning (stat.ML)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:This paper explores sparsification methods as a form of regularization in Graph Neural Networks (GNNs) to address high memory usage and computational costs in large-scale graph applications. Using techniques from Network Science and Machine Learning, including Erdős-Rényi for model sparsification, we enhance the efficiency of GNNs for real-world applications. We demonstrate our approach on N-1 contingency assessment in electrical grids, a critical task for ensuring grid reliability. We apply our methods to three datasets of varying sizes, exploring Graph Convolutional Networks (GCN) and Graph Isomorphism Networks (GIN) with different degrees of sparsification and rewiring. Comparison across sparsification levels shows the potential of combining insights from both research fields to improve GNN performance and scalability. Our experiments highlight the importance of tuning sparsity parameters: while sparsity can improve generalization, excessive sparsity may hinder learning of complex patterns. Our adaptive rewiring approach, particularly when combined with early stopping, proves promising by allowing the model to adapt its connectivity structure during training. This research contributes to understanding how sparsity can be effectively leveraged in GNNs for critical applications like power grid reliability analysis.

[AI-31] Cross-Sectional Asset Retrieval via Future-Aligned Soft Contrastive Learning

【速读】:该论文旨在解决资产检索(Asset Retrieval)中传统方法依赖历史价格模式或行业分类定义相似性所带来的局限性,这类方法无法保证未来行为的一致性。为实现更有效的资产检索,论文提出未来对齐的表示学习框架——未来对齐软对比学习(Future-Aligned Soft Contrastive Learning, FASCL),其核心创新在于使用成对资产的未来收益相关性作为连续监督信号来优化软对比损失函数,从而引导模型学习到能预测未来协同行为的资产表示。这一设计使得检索出的资产在后续时间段内更可能表现出相似的收益轨迹,显著优于13种基线方法。

链接: https://arxiv.org/abs/2602.10711
作者: Hyeongmin Lee,Chanyeol Choi,Jihoon Kwon,Yoon Kim,Alejandro Lopez-Lira,Wonbin Ahn,Yongjae Lee
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Asset retrieval–finding similar assets in a financial universe–is central to quantitative investment decision-making. Existing approaches define similarity through historical price patterns or sector classifications, but such backward-looking criteria provide no guarantee about future behavior. We argue that effective asset retrieval should be future-aligned: the retrieved assets should be those most likely to exhibit correlated future returns. To this end, we propose Future-Aligned Soft Contrastive Learning (FASCL), a representation learning framework whose soft contrastive loss uses pairwise future return correlations as continuous supervision targets. We further introduce an evaluation protocol designed to directly assess whether retrieved assets share similar future trajectories. Experiments on 4,229 US equities demonstrate that FASCL consistently outperforms 13 baselines across all future-behavior metrics. The source code will be available soon.

[AI-32] Interpretable Graph-Level Anomaly Detection via Contrast with Normal Prototypes

【速读】:该论文旨在解决图级异常检测(Graph-Level Anomaly Detection, GLAD)中现有深度方法因黑箱特性导致可解释性差、难以在实际场景中部署的问题。具体而言,现有解释方法要么缺乏对正常图的参照,要么依赖抽象的潜在向量作为原型而非真实数据中的具体图结构。解决方案的关键在于提出一种基于原型的无监督框架ProtoGLAD,其通过点集核(point-set kernel)迭代发现多个来自数据集的正常原型图及其对应的聚类,并将远离所有已发现正常簇的图识别为异常;同时,每个异常结果均可通过与最近的正常原型图进行显式对比来提供人类可理解的解释,从而兼顾检测性能与可解释性。

链接: https://arxiv.org/abs/2602.10708
作者: Qiuran Zhao,Kai Ming Ting,Xinpeng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The task of graph-level anomaly detection (GLAD) is to identify anomalous graphs that deviate significantly from the majority of graphs in a dataset. While deep GLAD methods have shown promising performance, their black-box nature limits their reliability and deployment in real-world applications. Although some recent methods have made attempts to provide explanations for anomaly detection results, they either provide explanations without referencing normal graphs, or rely on abstract latent vectors as prototypes rather than concrete graphs from the dataset. To address these limitations, we propose Prototype-based Graph-Level Anomaly Detection (ProtoGLAD), an interpretable unsupervised framework that provides explanation for each detected anomaly by explicitly contrasting with its nearest normal prototype graph. It employs a point-set kernel to iteratively discover multiple normal prototype graphs and their associated clusters from the dataset, then identifying graphs distant from all discovered normal clusters as anomalies. Extensive experiments on multiple real-world datasets demonstrate that ProtoGLAD achieves competitive anomaly detection performance compared to state-of-the-art GLAD methods while providing better human-interpretable prototype-based explanations.

[AI-33] Spend Search Where It Pays: Value-Guided Structured Sampling and Optimization for Generative Recommendation

【速读】:该论文旨在解决生成式推荐(Generative Recommendation)中基于自回归模型的强化学习(Reinforcement Learning, RL)训练所面临的概率-奖励不匹配问题。具体而言,传统以似然主导的解码策略(如束搜索)会因局部高概率前缀的短视偏差导致两个关键缺陷:一是探索不足,即高奖励但低概率的候选项被过早剪枝;二是优势压缩,即共享高概率前缀的轨迹获得高度相关且方差小的奖励信号,削弱了RL的学习效果。解决方案的核心在于提出V-STAR框架,其关键创新包括两部分:一是价值引导的高效解码(Value-Guided Efficient Decoding, VED),通过识别决策节点并选择性扩展高潜力前缀来提升探索效率;二是基于兄弟节点相对优势的GRPO算法(Sibling-GRPO),利用树结构拓扑计算兄弟节点间的相对优势,将学习信号集中于关键分支决策点,从而增强RL的信号强度与有效性。

链接: https://arxiv.org/abs/2602.10699
作者: Jie Jiang,Yangru Huang,Zeyu Wang,Changping Wang,Yuling Xiong,Jun Zhang,Huan Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.

[AI-34] VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练中因策略僵化(policy staleness)、异步训练以及训练与推理引擎不匹配所导致的行为策略偏离当前策略的问题,从而避免训练崩溃。其解决方案的关键在于提出变分序列级软策略优化(Variational sEquence-level Soft Policy Optimization, VESPO),通过将方差缩减引入对提议分布的变分公式,推导出一个直接作用于序列级重要性权重的闭式重塑核(reshaping kernel),无需进行长度归一化,从而实现对分布偏移的稳定修正,并在高僵化比(最高达64倍)和全异步执行场景下保持训练稳定性,同时在密集模型和专家混合(Mixture-of-Experts)模型上均取得一致性能提升。

链接: https://arxiv.org/abs/2602.10693
作者: Guobin Shen,Chenxiao Zhao,Xiang Cheng,Lei Huang,Xing Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at this https URL

[AI-35] OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

【速读】:该论文旨在解决当前社会智能AI模型在处理多维人类行为(如情感、认知或社交属性)时,因任务特定建模导致训练成本高且跨场景泛化能力受限的问题。现有强化学习(Reinforcement Learning, RL)方法虽能训练统一模型以应对多种行为任务,但未充分考虑异质性行为数据的学习问题。其解决方案的关键在于提出一种异质感知的相对策略优化方法(Heterogeneity-Aware Relative Policy Optimization, HARPO),通过调节优势值(advantages)来平衡不同任务与样本对策略优化的影响,从而避免单一任务或样本占据主导地位,提升模型在多样化行为数据上的鲁棒性和泛化性能。基于此方法,作者构建了Omnisapiens-7B 2.0这一社会行为处理基础模型,在多任务和保留测试设置中均取得显著性能提升,同时生成更清晰、稳定的推理轨迹。

链接: https://arxiv.org/abs/2602.10635
作者: Keane Ong,Sabri Boughorbel,Luwei Xiao,Chanakya Ekbote,Wei Dai,Ao Qu,Jingyao Wu,Rui Mao,Ehsan Hoque,Erik Cambria,Gianmarco Mengaldo,Paul Pu Liang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To develop socially intelligent AI, existing approaches typically model human behavioral dimensions (e.g., affective, cognitive, or social attributes) in isolation. Although useful, task-specific modeling often increases training costs and limits generalization across behavioral settings. Recent reasoning RL methods facilitate training a single unified model across multiple behavioral tasks, but do not explicitly address learning across different heterogeneous behavioral data. To address this gap, we introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), an RL method that balances leaning across heterogeneous tasks and samples. This is achieved by modulating advantages to ensure that no single task or sample carries disproportionate influence during policy optimization. Using HARPO, we develop and release Omnisapiens-7B 2.0, a foundation model for social behavior processing. Relative to existing behavioral foundation models, Omnisapiens-7B 2.0 achieves the strongest performance across behavioral tasks, with gains of up to +16.85% and +9.37% on multitask and held-out settings respectively, while producing more explicit and robust reasoning traces. We also validate HARPO against recent RL methods, where it achieves the most consistently strong performance across behavioral tasks.

[AI-36] he Neurosymbolic Frontier of Nonuniform Ellipticity: Formalizing Sharp Schauder Theory via Topos-Theoretic Reasoning Models

【速读】:该论文旨在解决非均匀椭圆正则性理论中长期悬而未决的尖锐增长速率猜想(sharp growth rate conjecture),即确定梯度 Hölder 连续性的精确阈值 $ q/p = 1 + \alpha/n $,其中 $ \alpha $ 表示 Hölder 指数,$ n $ 为空间维度。解决方案的关键在于引入“幽灵方程”(ghost equation)方法——一种精巧的辅助推导技术,能够绕过经典 Euler-Lagrange 系统因不可微性带来的障碍,从而实现对复杂非线性偏微分方程解的正则性分析。这一数学突破为将纯分析工具与神经符号大推理模型(neurosymbolic large reasoning models, LRMs)融合提供了基础,推动了基于拓扑范畴论和形式验证框架(如 Safe and Typed Chain-of-Thought, PC-CoT)的自动推理系统在物理多相系统中的应用。

链接: https://arxiv.org/abs/2602.10632
作者: Suyash Mishra
机构: 未知
类目: ymbolic Computation (cs.SC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This white paper presents a critical synthesis of the recent breakthrough in nonuniformly elliptic regularity theory and the burgeoning field of neurosymbolic large reasoning models (LRMs). We explore the resolution of the long-standing sharp growth rate conjecture in Schauder theory, achieved by Cristiana De Filippis and Giuseppe Mingione, which identifies the exact threshold q/p 1 + \alpha/n for gradient Hölder continuity. Central to this mathematical achievement is the ghost equation'' methodology, a sophisticated auxiliary derivation that bypasses the non-differentiability of classical Euler-Lagrange systems. We propose that the next era of mathematical discovery lies in the integration of these pure analytical constructs with LRMs grounded in topos theory and formal verification frameworks such as Safe and Typed Chain-of-Thought (PC-CoT). By modeling the reasoning process as a categorical colimit in a slice topos, we demonstrate how LRMs can autonomously navigate the Dark Side’’ of the calculus of variations, providing machine-checkable proofs for regularity bounds in complex, multi-phase physical systems.

[AI-37] Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

【速读】:该论文旨在解决基于人类偏好学习的奖励模型在强化学习中因标注噪声和系统性偏差(如响应长度或风格)导致的奖励黑客(reward hacking)问题。其解决方案的关键在于提出贝叶斯非负奖励模型(Bayesian Non-Negative Reward Model, BNRM),该模型将非负因子分析与Bradley-Terry(BT)偏好模型相结合,通过两个互补层次的潜在变量结构实现鲁棒的不确定性感知奖励学习:实例特定的潜在变量生成解耦的奖励表示,而全局潜在因子的稀疏性则作为隐式去偏机制抑制虚假相关性,从而提升奖励建模的稳定性与可解释性。

链接: https://arxiv.org/abs/2602.10623
作者: Zhibin Duan,Guowei Rong,Zhuo Li,Bo Chen,Mingyuan Zhou,Dandan Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.

[AI-38] Hierarchical Zero-Order Optimization for Deep Neural Networks

【速读】:该论文旨在解决零阶(Zeroth-Order, ZO)优化在深度神经网络中因查询复杂度高而难以应用的问题。传统ZO优化方法的查询复杂度为 $ O(ML^2) $,其中 $ M $ 为网络宽度,$ L $ 为网络深度,这限制了其在大规模模型中的实用性。论文提出了一种分层零阶(Hierarchical Zeroth-Order, HZO)优化策略,其关键在于将网络的深度维度进行分解,采用一种分治思想来降低整体查询复杂度至 $ O(ML \log L) $,显著优于现有ZO方法。此外,HZO通过在单位极限附近操作(即 Lipschitz 常数 $ L_{\text{lip}} \approx 1 $)保证了数值稳定性,并在 CIFAR-10 和 ImageNet 数据集上验证了其与反向传播相当的性能表现。

链接: https://arxiv.org/abs/2602.10607
作者: Sansheng Cao,Zhengyu Ma,Yonghong Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Corresponding author: Zhengyu Ma (mazhy@pcl. this http URL )

点击查看摘要

Abstract:Zeroth-order (ZO) optimization has long been favored for its biological plausibility and its capacity to handle non-differentiable objectives, yet its computational complexity has historically limited its application in deep neural networks. Challenging the conventional paradigm that gradients propagate layer-by-layer, we propose Hierarchical Zeroth-Order (HZO) optimization, a novel divide-and-conquer strategy that decomposes the depth dimension of the network. We prove that HZO reduces the query complexity from O(ML^2) to O(ML \log L) for a network of width M and depth L , representing a significant leap over existing ZO methodologies. Furthermore, we provide a detailed error analysis showing that HZO maintains numerical stability by operating near the unitary limit ( L_lip \approx 1 ). Extensive evaluations on CIFAR-10 and ImageNet demonstrate that HZO achieves competitive accuracy compared to backpropagation.

[AI-39] Neuro-symbolic Action Masking for Deep Reinforcement Learning

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在训练和执行过程中可能探索不可行动作的问题。现有方法依赖于符号接地函数(symbol grounding function)将高维状态映射为一致的符号表示,并采用人工指定的动作掩码(action masking)技术来约束动作空间。其解决方案的关键在于提出神经符号动作掩码(Neuro-symbolic Action Masking, NSAM)框架,该框架能够在DRL过程中以最小监督方式自动学习与给定状态域约束一致的符号模型,并基于此符号模型自动学习排除不可行动作的动作掩码。NSAM实现了符号推理与深度策略优化的端到端融合,使得符号接地质量与策略学习相互增强,从而显著提升样本效率并减少约束违反。

链接: https://arxiv.org/abs/2602.10598
作者: Shuai Han,Mehdi Dastani,Shihan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) may explore infeasible actions during training and execution. Existing approaches assume a symbol grounding function that maps high-dimensional states to consistent symbolic representations and a manually specified action masking techniques to constrain actions. In this paper, we propose Neuro-symbolic Action Masking (NSAM), a novel framework that automatically learn symbolic models, which are consistent with given domain constraints of high-dimensional states, in a minimally supervised manner during the DRL process. Based on the learned symbolic model of states, NSAM learns action masks that rules out infeasible actions. NSAM enables end-to-end integration of symbolic reasoning and deep policy optimization, where improvements in symbolic grounding and policy learning mutually reinforce each other. We evaluate NSAM on multiple domains with constraints, and experimental results demonstrate that NSAM significantly improves sample efficiency of DRL agent while substantially reducing constraint violations.

[AI-40] Neural Additive Experts: Context-Gated Experts for Controllable Model Additivity AISTATS2026

【速读】:该论文旨在解决机器学习中解释性(interpretability)与准确性(accuracy)之间的权衡问题。标准广义加性模型(Generalized Additive Models, GAMs)虽能提供清晰的特征归因,但其严格的加性结构限制了预测性能;引入特征交互虽可提升准确率,却可能削弱单个特征贡献的可解释性。解决方案的关键在于提出神经加性专家模型(Neural Additive Experts, NAEs),其核心是采用专家混合(mixture of experts)框架,在每个特征上学习多个专用网络,并通过动态门控机制整合跨特征信息,从而在不牺牲解释性的前提下放宽加性约束。此外,设计针对性正则化技术以降低专家预测方差,实现从纯加性模型到包含复杂特征交互的平滑过渡,同时保持特征级解释的透明性。

链接: https://arxiv.org/abs/2602.10585
作者: Guangzhi Xiong,Sanchit Sinha,Aidong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AISTATS 2026

点击查看摘要

Abstract:The trade-off between interpretability and accuracy remains a core challenge in machine learning. Standard Generalized Additive Models (GAMs) offer clear feature attributions but are often constrained by their strictly additive nature, which can limit predictive performance. Introducing feature interactions can boost accuracy yet may obscure individual feature contributions. To address these issues, we propose Neural Additive Experts (NAEs), a novel framework that seamlessly balances interpretability and accuracy. NAEs employ a mixture of experts framework, learning multiple specialized networks per feature, while a dynamic gating mechanism integrates information across features, thereby relaxing rigid additive constraints. Furthermore, we propose targeted regularization techniques to mitigate variance among expert predictions, facilitating a smooth transition from an exclusively additive model to one that captures intricate feature interactions while maintaining clarity in feature attributions. Our theoretical analysis and experiments on synthetic data illustrate the model’s flexibility, and extensive evaluations on real-world datasets confirm that NAEs achieve an optimal balance between predictive accuracy and transparent, feature-level explanations. The code is available at this https URL.

[AI-41] Flow of Spans: Generalizing Language Models to Dynamic Span-Vocabulary via GFlowNets ICLR2026

【速读】:该论文旨在解决传统自回归语言模型在文本生成过程中受限于固定词汇表所导致的树状状态空间结构问题,这一限制降低了生成过程的灵活性与表达能力;同时,现有基于动态词汇的方法虽引入了检索文本片段,但未显式建模有向无环图(Directed Acyclic Graph, DAG)状态空间,从而限制了组合路径的探索并引入路径偏置。其解决方案的关键在于提出生成流网络(Generative Flow Networks, GFlowNets)框架下的Span生成方法——Flow of SpanS (FOSS),通过灵活分割检索到的文本片段构建动态跨度词汇,显式构建DAG结构的状态空间,使GFlowNets能够高效探索多样化的组合路径,显著提升文本生成质量与知识密集型任务表现。

链接: https://arxiv.org/abs/2602.10583
作者: Bo Xue,Yunchong Song,Fanghao Shao,Xuekai Zhu,Lin Chen,Luoyi Fu,Xinbing Wang,Zhouhan Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Standard autoregressive language models generate text token-by-token from a fixed vocabulary, inducing a tree-structured state space when viewing token sampling as an action, which limits flexibility and expressiveness. Recent work introduces dynamic vocabulary by sampling retrieved text spans but overlooks that the same sentence can be composed of spans of varying lengths, lacking explicit modeling of the directed acyclic graph (DAG) state space. This leads to restricted exploration of compositional paths and is biased toward the chosen path. Generative Flow Networks (GFlowNets) are powerful for efficient exploring and generalizing over state spaces, particularly those with a DAG structure. However, prior GFlowNets-based language models operate at the token level and remain confined to tree-structured spaces, limiting their potential. In this work, we propose Flow of SpanS (FOSS), a principled GFlowNets framework for span generation. FoSS constructs a dynamic span vocabulary by segmenting the retrieved text flexibly, ensuring a DAG-structured state space, which allows GFlowNets to explore diverse compositional paths and improve generalization. With specialized reward models, FoSS generates diverse, high-quality text. Empirically, FoSS improves MAUVE scores by up to 12.5% over Transformer on text generation and achieves 3.5% gains on knowledge-intensive tasks, consistently outperforming state-of-the-art methods. Scaling experiments further demonstrate FoSS benefits from larger models, more data, and richer retrieval corpora, retaining its advantage over strong baselines.

[AI-42] LLM -Based Scientific Equation Discovery via Physics-Informed Token-Regularized Policy Optimization

【速读】:该论文旨在解决现有符号回归(Symbolic Regression)方法中,基于大语言模型(LLM)的方程生成框架因缺乏反馈驱动的自适应机制而导致物理不一致或结构冗余的问题。其核心解决方案是提出PiT-PO(Physics-informed Token-regularized Policy Optimization)框架,关键在于引入一种双约束机制:一方面通过层次化物理有效性约束确保生成方程的科学合理性,另一方面在token级别施加细粒度惩罚以抑制冗余结构,从而实现LLM从静态生成器向具备强化学习驱动的自适应生成器的演进。

链接: https://arxiv.org/abs/2602.10576
作者: Boxiao Wang,Kai Li,Tianyi Liu,Chen Li,Junzhe Wang,Yifan Zhang,Jian Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Symbolic regression aims to distill mathematical equations from observational data. Recent approaches have successfully leveraged Large Language Models (LLMs) to generate equation hypotheses, capitalizing on their vast pre-trained scientific priors. However, existing frameworks predominantly treat the LLM as a static generator, relying on prompt-level guidance to steer exploration. This paradigm fails to update the model’s internal representations based on search feedback, often yielding physically inconsistent or mathematically redundant expressions. In this work, we propose PiT-PO (Physics-informed Token-regularized Policy Optimization), a unified framework that evolves the LLM into an adaptive generator via reinforcement learning. Central to PiT-PO is a dual-constraint mechanism that rigorously enforces hierarchical physical validity while simultaneously applying fine-grained, token-level penalties to suppress redundant structures. Consequently, PiT-PO aligns LLM to produce equations that are both scientifically consistent and structurally parsimonious. Empirically, PiT-PO achieves state-of-the-art performance on standard benchmarks and successfully discovers novel turbulence models for challenging fluid dynamics problems. We also demonstrate that PiT-PO empowers small-scale models to outperform closed-source giants, democratizing access to high-performance scientific discovery.

[AI-43] LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer

【速读】:该论文旨在解决当前视觉-语言-动作模型(Vision-Language-Action models, VLAs)在新机器人本体(robot embodiment)上缺乏零样本迁移能力的问题,即现有模型通常高度依赖训练时的特定机器人结构,需进行昂贵的微调才能部署。其解决方案的关键在于提出一种名为语言-动作预训练(Language-Action Pre-training, LAP)的新范式:通过将低级机器人动作直接以自然语言形式表示,使动作监督信号与预训练视觉-语言模型的输入-输出分布对齐,从而实现无需特定本体架构设计、无需学习分词器或人工标注的通用策略学习。基于此方法构建的LAP-3B模型首次实现了在未见过的机器人本体上的显著零样本迁移性能(平均成功率超50%),较最强基线提升约2倍。

链接: https://arxiv.org/abs/2602.10556
作者: Lihan Zha,Asher J. Hancock,Mingtong Zhang,Tenny Yin,Yixuan Huang,Dhruv Shah,Allen Z. Ren,Anirudha Majumdar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project website: this https URL

点击查看摘要

Abstract:A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training, existing Vision-Language-Action models (VLAs) remain tightly coupled to their training embodiments and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple recipe that represents low-level robot actions directly in natural language, aligning action supervision with the pre-trained vision-language model’s input-output distribution. LAP requires no learned tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering roughly a 2x improvement over the strongest prior VLAs. We further show that LAP enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a shared language-action format that yields additional gains through co-training.

[AI-44] Contrastive Learning for Multi Label ECG Classification with Jaccard Score Based Sigmoid Loss

【速读】:该论文旨在解决当前多模态医学人工智能模型在心电图(Electrocardiogram, ECG)分析任务中表现受限的问题,尤其是现有模型如MedGemma缺乏对ECG数据的支持,而其他模型在ECG多标签分类上的准确性不足。其关键解决方案在于构建一个鲁棒的ECG编码器用于多模态预训练:首先采用基于CLIP架构的SigLIP模型并引入针对ECG多标签特性的改进损失函数,以提升多标签分类性能;其次通过增加嵌入维度和随机裁剪策略缓解数据漂移问题;最终结合医学知识增强语言模型,并通过逐标签分析明确不同ECG诊断结果的预测难度,为后续医疗AI系统整合ECG数据提供可扩展的基础框架。

链接: https://arxiv.org/abs/2602.10553
作者: Junichiro Takahashi,Masataka Sato,Satoshi Kodeta,Norihiko Takeda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled the development of multimodal medical AI. While models such as MedGemini achieve high accuracy on VQA tasks like USMLE MM, their performance on ECG based tasks remains limited, and some models, such as MedGemma, do not support ECG data at all. Interpreting ECGs is inherently challenging, and diagnostic accuracy can vary depending on the interpreter’s experience. Although echocardiography provides rich diagnostic information, it requires specialized equipment and personnel, limiting its availability. In this study, we focus on constructing a robust ECG encoder for multimodal pretraining using real world hospital data. We employ SigLIP, a CLIP based model with a sigmoid based loss function enabling multi label prediction, and introduce a modified loss function tailored to the multi label nature of ECG data. Experiments demonstrate that incorporating medical knowledge in the language model and applying the modified loss significantly improve multi label ECG classification. To further enhance performance, we increase the embedding dimensionality and apply random cropping to mitigate data drift. Finally, per label analysis reveals which ECG findings are easier or harder to predict. Our study provides a foundational framework for developing medical models that utilize ECG data.

[AI-45] μpscaling small models: Principled warm starts and hyperparameter transfer

【速读】:该论文旨在解决模型上采样(model upscaling)过程中超参数调优成本高且有效性不明确的问题,尤其是在从较小模型初始化较大模型时,直接在目标尺寸上调参代价高昂,而现有基于小模型调参并按缩放定律外推的方法是否仍有效尚不清晰。其解决方案的关键在于:首先提出一种通用的模型上采样方法,适用于多种架构和优化器,并通过理论保证模型与等效宽化版本一致,从而支持对无限宽度极限的严谨分析;其次,将μ Transfer理论扩展为超参数迁移技术,使得在上采样模型中可高效调参,实验证明该方法在真实数据集和架构上具有有效性。

链接: https://arxiv.org/abs/2602.10545
作者: Yuxin Ma,Nan Chen,Mateo Díaz,Soufiane Hayou,Dmitriy Kunisky,Soledad Villar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 61 pages, 6 figures

点击查看摘要

Abstract:Modern large-scale neural networks are often trained and released in multiple sizes to accommodate diverse inference budgets. To improve efficiency, recent work has explored model upscaling: initializing larger models from trained smaller ones in order to transfer knowledge and accelerate convergence. However, this method can be sensitive to hyperparameters that need to be tuned at the target upscaled model size, which is prohibitively costly to do directly. It remains unclear whether the most common workaround – tuning on smaller models and extrapolating via hyperparameter scaling laws – is still sound when using upscaling. We address this with principled approaches to upscaling with respect to model widths and efficiently tuning hyperparameters in this setting. First, motivated by \mu P and any-dimensional architectures, we introduce a general upscaling method applicable to a broad range of architectures and optimizers, backed by theory guaranteeing that models are equivalent to their widened versions and allowing for rigorous analysis of infinite-width limits. Second, we extend the theory of \mu Transfer to a hyperparameter transfer technique for models upscaled using our method and empirically demonstrate that this method is effective on realistic datasets and architectures.

[AI-46] A Swap-Adversarial Framework for Improving Domain Generalization in Electroencephalography-Based Parkinsons Disease Prediction

【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)早期预测中因人类研究伦理限制和缺乏公开基准数据集而导致的可复现性差的问题,同时应对电皮层图(ECoG)数据中存在的高个体差异性和高维小样本(High-Dimensional Low-Sample-Size, HDLSS)难题。解决方案的关键在于提出一种Swap-Adversarial Framework (SAF),其核心包括:(1)鲁棒预处理;(2)跨主体平衡通道交换(Inter-Subject Balanced Channel Swap, ISBCS),通过随机交换不同受试者的通道来降低个体间差异;(3)域对抗训练以抑制个体特异性偏差并促进任务相关共享特征的学习。该框架在跨主体、跨会话及跨数据集设置下均显著优于现有基线方法,尤其在高变异性环境中表现突出,并展现出从ECoG到EEG数据的良好泛化能力。

链接: https://arxiv.org/abs/2602.10528
作者: Seongwon Jin,Hanseul Choi,Sunggu Yang,Sungho Park,Jibum Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electroencephalography (ECoG) offers a promising alternative to conventional electrocorticography (EEG) for the early prediction of Parkinson’s disease (PD), providing higher spatial resolution and a broader frequency range. However, reproducible comparisons has been limited by ethical constraints in human studies and the lack of open benchmark datasets. To address this gap, we introduce a new dataset, the first reproducible benchmark for PD prediction. It is constructed from long-term ECoG recordings of 6-hydroxydopamine (6-OHDA)-induced rat models and annotated with neural responses measured before and after electrical stimulation. In addition, we propose a Swap-Adversarial Framework (SAF) that mitigates high inter-subject variability and the high-dimensional low-sample-size (HDLSS) problem in ECoG data, while achieving robust domain generalization across ECoG and EEG-based Brain-Computer Interface (BCI) datasets. The framework integrates (1) robust preprocessing, (2) Inter-Subject Balanced Channel Swap (ISBCS) for cross-subject augmentation, and (3) domain-adversarial training to suppress subject-specific bias. ISBCS randomly swaps channels between subjects to reduce inter-subject variability, and domain-adversarial training jointly encourages the model to learn task-relevant shared features. We validated the effectiveness of the proposed method through extensive experiments under cross-subject, cross-session, and cross-dataset settings. Our method consistently outperformed all baselines across all settings, showing the most significant improvements in highly variable environments. Furthermore, the proposed method achieved superior cross-dataset performance between public EEG benchmarks, demonstrating strong generalization capability not only within ECoG but to EEG data. The new dataset and source code will be made publicly available upon publication.

[AI-47] AI-PACE: A Framework for Integrating AI into Medical Education

【速读】:该论文试图解决的问题是:随着人工智能(Artificial Intelligence, AI)在医疗领域的加速应用,医学教育未能同步跟进这一技术变革,导致医学生和未来医师缺乏必要的AI素养与能力。解决方案的关键在于构建一个结构化的AI教育框架,强调在医学学习全过程中进行纵向整合(longitudinal integration),推动跨学科协作(interdisciplinary collaboration),并平衡技术基础与临床应用场景的教学重点,从而为培养适应AI增强型医疗环境的未来医生提供系统性指导。

链接: https://arxiv.org/abs/2602.10527
作者: Scott P. McGrath,Katherine K. Kim,Karnjit Johl,Haibo Wang,Nick Anderson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures,

点击查看摘要

Abstract:The integration of artificial intelligence (AI) into healthcare is accelerating, yet medical education has not kept pace with these technological advancements. This paper synthesizes current knowledge on AI in medical education through a comprehensive analysis of the literature, identifying key competencies, curricular approaches, and implementation strategies. The aim is highlighting the critical need for structured AI education across the medical learning continuum and offer a framework for curriculum development. The findings presented suggest that effective AI education requires longitudinal integration throughout medical training, interdisciplinary collaboration, and balanced attention to both technical fundamentals and clinical applications. This paper serves as a foundation for medical educators seeking to prepare future physicians for an AI-enhanced healthcare environment.

[AI-48] Co-jump: Cooperative Jumping with Quadrupedal Robots via Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决双足或四足机器人在单体跳跃能力受限的问题,即个体机器人由于物理执行器性能瓶颈难以实现高跳躍动作。为突破这一限制,作者提出了一种名为Co-jump的协作任务,通过两只四足机器人协同同步跳跃,实现远超单体能力的跳跃高度。解决方案的关键在于采用基于多智能体近端策略优化(Multi-Agent Proximal Policy Optimization, MAPPO)的强化学习框架,并引入渐进式课程训练策略,有效应对机械耦合系统中稀疏奖励导致的探索困难问题。该方法无需显式通信或预设运动基元,仅依赖本体感觉反馈即可实现精确同步,从而在仿真和真实硬件上成功完成多方向跳跃至高达1.5米平台的任务,其中一只机器人脚端提升至1.1米,相较独立机器人(0.45米)提升144%,验证了其优越的垂直跳跃性能与无通信协作的可行性。

链接: https://arxiv.org/abs/2602.10514
作者: Shihao Dong,Yeke Chen,Zeren Luo,Jiahui Zhang,Bowen Xu,Jinghan Lin,Yimin Han,Ji Ma,Zhiyou Yu,Yudong Zhao,Peng Lu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:While single-agent legged locomotion has witnessed remarkable progress, individual robots remain fundamentally constrained by physical actuation limits. To transcend these boundaries, we introduce Co-jump, a cooperative task where two quadrupedal robots synchronize to execute jumps far beyond their solo capabilities. We tackle the high-impulse contact dynamics of this task under a decentralized setting, achieving synchronization without explicit communication or pre-specified motion primitives. Our framework leverages Multi-Agent Proximal Policy Optimization (MAPPO) enhanced by a progressive curriculum strategy, which effectively overcomes the sparse-reward exploration challenges inherent in mechanically coupled systems. We demonstrate robust performance in simulation and successful transfer to physical hardware, executing multi-directional jumps onto platforms up to 1.5 m in height. Specifically, one of the robots achieves a foot-end elevation of 1.1 m, which represents a 144% improvement over the 0.45 m jump height of a standalone quadrupedal robot, demonstrating superior vertical performance. Notably, this precise coordination is achieved solely through proprioceptive feedback, establishing a foundation for communication-free collaborative locomotion in constrained environments.

[AI-49] Learning Structure-Semantic Evolution Trajectories for Graph Domain Adaptation ICLR2026

【速读】:该论文旨在解决图域适应(Graph Domain Adaptation, GDA)中因图结构连续且非线性演化导致的离散对齐策略失效问题。现有方法通常依赖固定步长的中间图构造或分步对齐,难以准确建模真实场景下源域到目标域的动态变化过程。其解决方案的关键在于提出DiffGDA,一种基于扩散过程的连续时间生成式图域适应方法,通过随机微分方程(SDEs)建模从源图到目标图的结构与语义联合演化过程,并引入领域感知网络引导扩散轨迹沿最优适应路径前进,从而实现更精确的域间知识迁移。理论分析表明该扩散过程在潜在空间中收敛至最优域适配解,实验验证了其在多个真实数据集上的优越性能。

链接: https://arxiv.org/abs/2602.10506
作者: Wei Chen,Xingyu Guo,Shuang Li,Yan Zhong,Zhao Zhang,Fuzhen Zhuang,Hongrui Liu,Libang Zhang,Guo Ye,Huimei He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted by ICLR 2026, 21 pages

点击查看摘要

Abstract:Graph Domain Adaptation (GDA) aims to bridge distribution shifts between domains by transferring knowledge from well-labeled source graphs to given unlabeled target graphs. One promising recent approach addresses graph transfer by discretizing the adaptation process, typically through the construction of intermediate graphs or stepwise alignment procedures. However, such discrete strategies often fail in real-world scenarios, where graph structures evolve continuously and nonlinearly, making it difficult for fixed-step alignment to approximate the actual transformation process. To address these limitations, we propose \textbfDiffGDA, a \textbfDiffusion-based \textbfGDA method that models the domain adaptation process as a continuous-time generative process. We formulate the evolution from source to target graphs using stochastic differential equations (SDEs), enabling the joint modeling of structural and semantic transitions. To guide this evolution, a domain-aware network is introduced to steer the generative process toward the target domain, encouraging the diffusion trajectory to follow an optimal adaptation path. We theoretically show that the diffusion process converges to the optimal solution bridging the source and target domains in the latent space. Extensive experiments on 14 graph transfer tasks across 8 real-world datasets demonstrate DiffGDA consistently outperforms state-of-the-art baselines.

[AI-50] Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks

【速读】:该论文试图解决的问题是:在过参数化的Transformer模型中,学习动态的几何结构如何影响模型的训练行为与计算机制。解决方案的关键在于发现训练轨迹在高维参数空间(d=128)中迅速坍缩至一个低维执行流形(execution manifold,维度为3–4),这一几何结构解释了多个经验现象,包括注意力集中、随机梯度下降(SGD)在投影到该子空间时近似可积的动力学特性,以及稀疏自编码器无法分离核心执行机制而仅能捕捉辅助路由结构的事实。研究揭示,绝大多数参数主要用于吸收优化干扰,而核心计算集中在低维流形上,从而提供了一个统一的几何框架来理解Transformer的学习过程。

链接: https://arxiv.org/abs/2602.10496
作者: Yongzhong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:We investigate the geometric structure of learning dynamics in overparameterized transformer models through carefully controlled modular arithmetic tasks. Our primary finding is that despite operating in high-dimensional parameter spaces ( d=128 ), transformer training trajectories rapidly collapse onto low-dimensional execution manifolds of dimension 3 – 4 . This dimensional collapse is robust across random seeds and moderate task difficulties, though the orientation of the manifold in parameter space varies between runs. We demonstrate that this geometric structure underlies several empirically observed phenomena: (1) sharp attention concentration emerges as saturation along routing coordinates within the execution manifold, (2) stochastic gradient descent (SGD) exhibits approximately integrable dynamics when projected onto the execution subspace, with non-integrability confined to orthogonal staging directions, and (3) sparse autoencoders capture auxiliary routing structure but fail to isolate execution itself, which remains distributed across the low-dimensional manifold. Our results suggest a unifying geometric framework for understanding transformer learning, where the vast majority of parameters serve to absorb optimization interference while core computation occurs in a dramatically reduced subspace. These findings have implications for interpretability, training curriculum design, and understanding the role of overparameterization in neural network learning.

[AI-51] Learning Adaptive Distribution Alignment with Neural Characteristic Function for Graph Domain Adaptation ICLR2026

【速读】:该论文旨在解决图域适应(Graph Domain Adaptation, GDA)中因复杂且多维度的分布偏移(distributional shifts)导致的知识迁移效果不佳的问题。现有方法通常依赖人工设计的图元素(如节点属性或结构统计量)进行对齐,但这类方法灵活性差,难以应对不同场景下主导差异的变化。其解决方案的关键在于提出一种自适应分布对齐框架ADAlign,该框架无需人工指定对齐标准,能够自动识别并联合对齐每种迁移任务中最相关的分布差异,从而捕捉属性、结构及其依赖关系之间的相互作用。为实现这一目标,作者引入了神经谱距离(Neural Spectral Discrepancy, NSD),这是一种理论严谨的参数化距离度量,通过谱域中的神经特征函数编码任意阶次的特征-结构依赖关系,并结合可学习的频率采样器,基于极小极大范式动态聚焦于每个任务中最信息丰富的频谱成分,从而实现高效且鲁棒的跨图分布对齐。

链接: https://arxiv.org/abs/2602.10489
作者: Wei Chen,Xingyu Guo,Shuang Li,Zhao Zhang,Yan Zhong,Fuzhen Zhuang,Deqing wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026, 24 pages

点击查看摘要

Abstract:Graph Domain Adaptation (GDA) transfers knowledge from labeled source graphs to unlabeled target graphs but is challenged by complex, multi-faceted distributional shifts. Existing methods attempt to reduce distributional shifts by aligning manually selected graph elements (e.g., node attributes or structural statistics), which typically require manually designed graph filters to extract relevant features before alignment. However, such approaches are inflexible: they rely on scenario-specific heuristics, and struggle when dominant discrepancies vary across transfer scenarios. To address these limitations, we propose \textbfADAlign, an Adaptive Distribution Alignment framework for GDA. Unlike heuristic methods, ADAlign requires no manual specification of alignment criteria. It automatically identifies the most relevant discrepancies in each transfer and aligns them jointly, capturing the interplay between attributes, structures, and their dependencies. This makes ADAlign flexible, scenario-aware, and robust to diverse and dynamically evolving shifts. To enable this adaptivity, we introduce the Neural Spectral Discrepancy (NSD), a theoretically principled parametric distance that provides a unified view of cross-graph shifts. NSD leverages neural characteristic function in the spectral domain to encode feature-structure dependencies of all orders, while a learnable frequency sampler adaptively emphasizes the most informative spectral components for each task via minimax paradigm. Extensive experiments on 10 datasets and 16 transfer tasks show that ADAlign not only outperforms state-of-the-art baselines but also achieves efficiency gains with lower memory usage and faster training.

[AI-52] Abstraction Generation for Generalized Planning with Pretrained Large Language Models

【速读】:该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)自动生成定性数值规划(Qualitative Numerical Planning, QNP)抽象,从而支持广义规划(Generalized Planning, GP)问题的求解,并通过自动化调试方法修正抽象过程中的错误。其解决方案的关键在于提出了一种提示协议(prompt protocol),将GP领域和训练任务输入LLMs,引导其生成抽象特征并构建QNP问题结构(包括初始状态、动作集和目标的抽象);同时设计了自动化调试机制以检测并修复抽象错误,从而提升LLMs生成有效QNP抽象的能力。

链接: https://arxiv.org/abs/2602.10485
作者: Zhenhe Cui,Huaxiang Xia,Hangjun Shen,Kailun Luo,Yong He,Wei Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Qualitative Numerical Planning (QNP) serves as an important abstraction model for generalized planning (GP), which aims to compute general plans that solve multiple instances at once. Recent works show that large language models (LLMs) can function as generalized planners. This work investigates whether LLMs can serve as QNP abstraction generators for GP problems and how to fix abstractions via automated debugging. We propose a prompt protocol: input a GP domain and training tasks to LLMs, prompting them to generate abstract features and further abstract the initial state, action set, and goal into QNP problems. An automated debugging method is designed to detect abstraction errors, guiding LLMs to fix abstractions. Experiments demonstrate that under properly guided by automated debugging, some LLMs can generate useful QNP abstractions.

[AI-53] Driving Reaction Trajectories via Latent Flow Matching

【速读】:该论文旨在解决当前反应预测模型在准确性接近饱和的同时,缺乏对反应过程内在机制的可解释性与诊断能力的问题。现有方法多采用“一次性映射”(one-shot mapping)策略,难以揭示反应路径细节;而部分分步生成模型则依赖机制特异性标注或离散符号操作,限制了通用性和效率。其解决方案的关键在于提出LatentRxnFlow——一种基于条件流匹配(Conditional Flow Matching)的连续潜空间轨迹建模方法,将反应视为从反应物到产物的连续潜变量演化过程,无需机制注释或中间体标签即可学习时间依赖的潜动态。该框架不仅在USPTO基准上达到最先进性能,更重要的是通过暴露完整的生成轨迹,实现了轨迹级别的诊断分析,如定位失败模式、通过门控推理修正错误,并利用轨迹几何特性提供内生的认知不确定性信号,从而提升模型的可解释性、可诊断性和不确定性感知能力,推动反应预测在高通量发现流程中的可信部署。

链接: https://arxiv.org/abs/2602.10476
作者: Yili Shen,Xiangliang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in reaction prediction have achieved near-saturated accuracy on standard benchmarks (e.g., USPTO), yet most state-of-the-art models formulate the task as a one-shot mapping from reactants to products, offering limited insight into the underlying reaction process. Procedural alternatives introduce stepwise generation but often rely on mechanism-specific supervision, discrete symbolic edits, and computationally expensive inference. In this work, we propose LatentRxnFlow, a new reaction prediction paradigm that models reactions as continuous latent trajectories anchored at the thermodynamic product state. Built on Conditional Flow Matching, our approach learns time-dependent latent dynamics directly from standard reactant-product pairs, without requiring mechanistic annotations or curated intermediate labels. While LatentRxnFlow achieves state-of-the-art performance on USPTO benchmarks, more importantly, the continuous formulation exposes the full generative trajectory, enabling trajectory-level diagnostics that are difficult to realize with discrete or one-shot models. We show that latent trajectory analysis allows us to localize and characterize failure modes and to mitigate certain errors via gated inference. Furthermore, geometric properties of the learned trajectories provide an intrinsic signal of epistemic uncertainty, helping prioritize reliably predictable reaction outcomes and flag ambiguous cases for additional validation. Overall, LatentRxnFlow combines strong predictive accuracy with improved transparency, diagnosability, and uncertainty awareness, moving reaction prediction toward more trustworthy deployment in high-throughput discovery workflows.

[AI-54] MERIT Feedback Elicits Better Bargaining in LLM Negotiators

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在谈判场景中策略深度不足、难以适应复杂人类因素的问题,这些问题导致其在实际应用中与人类偏好存在显著偏差。现有基准测试也未能充分刻画这一局限性。解决方案的关键在于提出一个以效用反馈为核心的框架,包含三个核心组成部分:(i) AgoraBench 基准,涵盖九种具有挑战性的谈判情境(如欺骗、垄断),支持多样化策略建模;(ii) 基于效用理论的人类对齐指标体系,通过代理效用(agent utility)、议价能力(negotiation power)和获取比例(acquisition ratio)隐式衡量谈判结果与人类偏好的一致性;(iii) 一个基于人类偏好的数据集及学习流程,结合提示(prompting)与微调(fine-tuning)增强LLMs的谈判能力。实证结果表明,该机制显著提升了谈判表现,使模型展现出更深层次的战略行为和更强的对手感知能力。

链接: https://arxiv.org/abs/2602.10467
作者: Jihwan Oh,Murad Aghazada,Yooju Shin,Se-Young Yun,Taehyeon Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. arXiv admin note: substantial text overlap with arXiv:2505.22998

点击查看摘要

Abstract:Bargaining is often regarded as a logical arena rather than an art or a matter of intuition, yet Large Language Models (LLMs) still struggle to navigate it due to limited strategic depth and difficulty adapting to complex human factors. Current benchmarks rarely capture this limitation. To bridge this gap, we present an utility feedback centric framework. Our contributions are: (i) AgoraBench, a new benchmark spanning nine challenging settings (e.g., deception, monopoly) that supports diverse strategy modeling; (ii) human-aligned, economically grounded metrics derived from utility theory. This is operationalized via agent utility, negotiation power, and acquisition ratio that implicitly measure how well the negotiation aligns with human preference and (iii) a human preference grounded dataset with learning pipeline that strengthens LLMs’ bargaining ability through both prompting and finetuning. Empirical results indicate that baseline LLM strategies often diverge from human preferences, while our mechanism substantially improves negotiation performance, yielding deeper strategic behavior and stronger opponent awareness.

[AI-55] Found-RL: foundation model-enhanced reinforcement learning for autonomous driving

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在端到端自动驾驶(Autonomous Driving, AD)中面临的样本效率低和复杂场景下语义可解释性不足的问题,同时克服基础模型(尤其是视觉-语言模型 Vision-Language Models, VLMs)因推理延迟高而难以部署于高频RL训练循环的挑战。解决方案的关键在于提出Found-RL平台,其核心创新是异步批量推理框架,通过将VLM的重型推理任务与仿真循环解耦,有效缓解延迟瓶颈以支持实时学习;此外,引入价值边际正则化(Value-Margin Regularization, VMR)和优势加权动作引导(Advantage-Weighted Action Guidance, AWAG)机制,实现专家级VLM动作建议向RL策略的有效蒸馏,并采用高吞吐量CLIP进行密集奖励塑造,结合条件对比动作对齐(Conditional Contrastive Action Alignment)缓解CLIP动态盲区问题,从而在保持约500 FPS实时推理能力的前提下,使轻量RL模型逼近百亿参数VLM的性能表现。

链接: https://arxiv.org/abs/2602.10458
作者: Yansong Qu,Zihao Sheng,Zilin Huang,Jiancong Chen,Yuhao Luo,Tianyi Wang,Yiheng Feng,Samuel Labi,Sikai Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages

点击查看摘要

Abstract:Reinforcement Learning (RL) has emerged as a dominant paradigm for end-to-end autonomous driving (AD). However, RL suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. Foundation Models, particularly Vision-Language Models (VLMs), can mitigate this by offering rich, context-aware knowledge, yet their high inference latency hinders deployment in high-frequency RL training loops. To bridge this gap, we present Found-RL, a platform tailored to efficiently enhance RL for AD using foundation models. A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop, effectively resolving latency bottlenecks to support real-time learning. We introduce diverse supervision mechanisms: Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to effectively distill expert-like VLM action suggestions into the RL policy. Additionally, we adopt high-throughput CLIP for dense reward shaping. We address CLIP’s dynamic blindness via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin-based bonus from context-specific action-anchor scoring. Found-RL provides an end-to-end pipeline for fine-tuned VLM integration and shows that a lightweight RL model can achieve near-VLM performance compared with billion-parameter VLMs while sustaining real-time inference (approx. 500 FPS). Code, data, and models will be publicly available at this https URL.

[AI-56] Constructing Industrial-Scale Optimization Modeling Benchmark

【速读】:该论文旨在解决将自然语言描述转化为正确优化模型和可执行求解器代码这一任务在工业场景中面临的挑战,尤其是现有评估基准多为小型或合成数据,无法反映真实世界中变量与约束数量达 10310^310610^6 级别的复杂优化问题。其解决方案的关键在于构建了 MIPLIB-NL 基准数据集,通过结构感知的逆向构造方法从 MIPLIB 2017 中的真实混合整数线性规划(Mixed-Integer Linear Programming, MILP)实例出发,依次完成:(i) 从扁平化的求解器公式中恢复紧凑且可复用的模型结构;(ii) 在统一的“模型-数据分离”格式下生成与该结构明确绑定的自然语言规格说明;(iii) 通过专家评审与人-大语言模型(Large Language Model, LLM)交互式语义验证迭代完善,最终实现 223 个一对一重建实例,既保留原始数学内容又支持现实场景下的自然语言到优化建模评估。实验表明,该基准能揭示当前系统在玩具规模基准上表现良好但在真实问题中显著退化的性能缺陷。

链接: https://arxiv.org/abs/2602.10450
作者: Zhong Li,Hongliang Lu,Tao Wei,Wenyu Liu,Yuxuan Chen,Yuan Lan,Fan Zhang,Zaiwen Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with 10^3 – 10^6 (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model–data separation format, and (iii) performs iterative semantic validation through expert review and human–LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.

[AI-57] A Unified Theory of Random Projection for Influence Functions

【速读】:该论文旨在解决在现代过参数化模型中,如何高效且可靠地计算影响函数(influence function)的问题。影响函数通常表示为 $ g^\top F^{-1} g’ $,其中 $ F \succeq 0 $ 是曲率算子(curvature operator),但在高维场景下直接构造或求逆 $ F $ 计算开销过大。为此,论文提出通过随机投影(random projection)构造低维 sketch $ P \in \mathbb{R}^{m \times d} $ 来近似保留影响函数的数值性质。其核心贡献在于建立了一个统一理论框架,明确刻画了投影在何种条件下能严格保持影响函数的准确性:(1) 对于未正则化情形,当且仅当投影矩阵 $ P $ 在 $ F $ 的列空间(range)上是单射时,可实现精确保留,此时需满足 $ m \geq \text{rank}(F) $;(2) 引入岭正则化后,投影的保真性由 $ F $ 在正则化尺度下的有效维度决定,从而显著放宽了对 sketch 大小的要求;(3) 对于 Kronecker 结构的曲率 $ F = A \otimes E $,即使使用非独立同分布的分块投影 $ P = P_A \otimes P_E $,仍可保证理论有效性。此外,论文还分析了测试梯度不在 $ \text{range}(F) $ 中时的“泄漏项”(leakage term),并给出了适用于任意测试点的影响函数查询保证。这一理论为实际选择 sketch 大小提供了原则性指导,填补了现有随机投影方法缺乏理论保障的空白。

链接: https://arxiv.org/abs/2602.10449
作者: Pingbang Hu,Yuzheng Hu,Jiaqi W. Ma,Han Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 46 pages, 4 figures

点击查看摘要

Abstract:Influence functions and related data attribution scores take the form of g^\topF^-1g^\prime , where F\succeq 0 is a curvature operator. In modern overparameterized models, forming or inverting F\in\mathbbR^d\times d is prohibitive, motivating scalable influence computation via random projection with a sketch P \in \mathbbR^m\times d . This practice is commonly justified via the Johnson–Lindenstrauss (JL) lemma, which ensures approximate preservation of Euclidean geometry for a fixed dataset. However, JL does not address how sketching behaves under inversion. Furthermore, there is no existing theory that explains how sketching interacts with other widely-used techniques, such as ridge regularization and structured curvature approximations. We develop a unified theory characterizing when projection provably preserves influence functions. When g,g^\prime\in\textrange(F) , we show that: 1) Unregularized projection: exact preservation holds iff P is injective on \textrange(F) , which necessitates m\geq \textrank(F) ; 2) Regularized projection: ridge regularization fundamentally alters the sketching barrier, with approximation guarantees governed by the effective dimension of F at the regularization scale; 3) Factorized influence: for Kronecker-factored curvatures F=A\otimes E , the guarantees continue to hold for decoupled sketches P=P_A\otimes P_E , even though such sketches exhibit row correlations that violate i.i.d. assumptions. Beyond this range-restricted setting, we analyze out-of-range test gradients and quantify a \emphleakage term that arises when test gradients have components in \ker(F) . This yields guarantees for influence queries on general test points. Overall, this work develops a novel theory that characterizes when projection provably preserves influence and provides principled guidance for choosing the sketch size in practice. Comments: 46 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.10449 [cs.LG] (or arXiv:2602.10449v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.10449 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-58] LakeMLB: Data Lake Machine Learning Benchmark

【速读】:该论文旨在解决当前数据湖(Data Lake)环境中缺乏标准化机器学习性能评估基准的问题。针对这一挑战,作者提出了LakeMLB(Data Lake Machine Learning Benchmark),其关键在于聚焦于数据湖中最常见的多源、多表场景——Union和Join,并提供涵盖政府开放数据、金融、维基百科及在线市场等领域的三组真实世界数据集,同时支持预训练、数据增强和特征增强三种代表性集成策略,从而为复杂数据湖场景下的表格学习方法提供系统化、可复现的评测框架。

链接: https://arxiv.org/abs/2602.10441
作者: Feiyu Pan,Tianbin Zhang,Aoqian Zhang,Yu Sun,Zheng Wang,Lixing Chen,Li Pan,Jianhua Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures. Preprint

点击查看摘要

Abstract:Modern data lakes have emerged as foundational platforms for large-scale machine learning, enabling flexible storage of heterogeneous data and structured analytics through table-oriented abstractions. Despite their growing importance, standardized benchmarks for evaluating machine learning performance in data lake environments remain scarce. To address this gap, we present LakeMLB (Data Lake Machine Learning Benchmark), designed for the most common multi-source, multi-table scenarios in data lakes. LakeMLB focuses on two representative multi-table scenarios, Union and Join, and provides three real-world datasets for each scenario, covering government open data, finance, Wikipedia, and online marketplaces. The benchmark supports three representative integration strategies: pre-training-based, data augmentation-based, and feature augmentation-based approaches. We conduct extensive experiments with state-of-the-art tabular learning methods, offering insights into their performance under complex data lake scenarios. We release both datasets and code to facilitate rigorous research on machine learning in data lake ecosystems; the benchmark is available at this https URL.

[AI-59] AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning

【速读】:该论文旨在解决大型音频语言模型(Large Audio Language Models, LALMs)在细粒度听觉感知任务上表现不稳定的问题,尤其是现有方法高度依赖大量训练数据来内化感知能力,导致效率低下。解决方案的关键在于提出AudioRouter——一个基于强化学习的框架,通过显式建模工具使用决策过程,使LALMs能够自主判断何时以及如何调用外部音频工具,而非将工具使用与音频推理紧密耦合;该方法优化一个轻量级路由策略(routing policy),同时保持底层推理模型冻结,从而在仅需传统训练数据量约1/600的情况下显著提升音频理解性能,验证了学习有效工具使用是一种数据高效且可扩展的替代方案。

链接: https://arxiv.org/abs/2602.10439
作者: Liyang Chen,Hongkai Chen,Yujun Cai,Sifan Li,Qingwen Ye,Yiwei Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize perceptual abilities. We propose AudioRouter, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools. Rather than tightly coupling tool usage with audio reasoning, AudioRouter formulates tool use as an explicit decision making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen. Experimental results show that AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600x less training data to learn tool usage compared with conventional training paradigms. These findings suggest that learning effective tool usage offers a data efficient and scalable alternative to internalizing perceptual abilities in LALMs.

[AI-60] A Dual-Stream Physics-Augmented Unsupervised Architecture for Runtime Embedded Vehicle Health Monitoring

【速读】:该论文旨在解决传统车辆运行强度量化方法(如里程数)无法准确反映机械负荷,以及现有无监督深度学习模型在检测统计异常时易将高负载稳态工况(如重载爬坡)误判为“机械静止”的问题,这一盲点导致对传动系统疲劳损伤的评估严重不足。解决方案的关键在于提出一种双流架构(Dual-Stream Architecture),通过融合无监督学习用于表面异常检测与基于宏观物理代理(macroscopic physics proxies)的累积载荷估计,利用低频传感器数据生成多维健康向量,从而有效区分动态危害与持续机械努力,实现在资源受限ECU上的边缘化、低延迟健康监测。

链接: https://arxiv.org/abs/2602.10432
作者: Enzo Nicolas Spotorno,Antonio Augusto Medeiros Frohlich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, submitted to IEEE ISIE 2026

点击查看摘要

Abstract:Runtime quantification of vehicle operational intensity is essential for predictive maintenance and condition monitoring in commercial and heavy-duty fleets. Traditional metrics like mileage fail to capture mechanical burden, while unsupervised deep learning models detect statistical anomalies, typically transient surface shocks, but often conflate statistical stability with mechanical rest. We identify this as a critical blind spot: high-load steady states, such as hill climbing with heavy payloads, appear statistically normal yet impose significant drivetrain fatigue. To resolve this, we propose a Dual-Stream Architecture that fuses unsupervised learning for surface anomaly detection with macroscopic physics proxies for cumulative load estimation. This approach leverages low-frequency sensor data to generate a multi-dimensional health vector, distinguishing between dynamic hazards and sustained mechanical effort. Validated on a RISC-V embedded platform, the architecture demonstrates low computational overhead, enabling comprehensive, edge-based health monitoring on resource-constrained ECUs without the latency or bandwidth costs of cloud-based monitoring.

[AI-61] Breaking the Curse of Repulsion: Optimistic Distributionally Robust Policy Optimization for Off-Policy Generative Recommendation

【速读】:该论文旨在解决基于策略的强化学习(Policy-based Reinforcement Learning, RL)在离线推荐场景中因低质量数据主导而导致模型崩溃的问题。现有方法在离线历史日志上训练时,由于负梯度更新引发指数级强度爆炸(exponential intensity explosion),导致难以同时实现方差缩减与噪声模仿的平衡。解决方案的关键在于提出一种分布鲁棒优化(Distributionally Robust Optimization, DRO)框架下的乐观策略优化方法(Distributionally Robust Policy Optimization, DRPO),其核心思想是通过精确识别嵌套在噪声行为策略中的潜在高质量分布,将问题重构为一个硬过滤(hard filtering)最优解,从而在严格剔除诱发发散的噪声的同时,最优恢复高质量行为模式。

链接: https://arxiv.org/abs/2602.10430
作者: Jie Jiang,Yusen Huo,Xiangxin Zhan,Changping Wang,Jun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:Policy-based Reinforcement Learning (RL) has established itself as the dominant paradigm in generative recommendation for optimizing sequential user interactions. However, when applied to offline historical logs, these methods suffer a critical failure: the dominance of low-quality data induces severe model collapse. We first establish the Divergence Theory of Repulsive Optimization, revealing that negative gradient updates inherently trigger exponential intensity explosion during off-policy training. This theory elucidates the inherent dilemma of existing methods, exposing their inability to reconcile variance reduction and noise imitation. To break this curse, we argue that the solution lies in rigorously identifying the latent high-quality distribution entangled within the noisy behavior policy. Accordingly, we reformulate the objective as an Optimistic Distributionally Robust Optimization (DRO) problem. Guided by this formulation, we propose Distributionally Robust Policy Optimization (DRPO). We prove that hard filtering is the exact solution to this DRO objective, enabling DRPO to optimally recover high-quality behaviors while strictly discarding divergence-inducing noise. Extensive experiments demonstrate that DRPO achieves state-of-the-art performance on mixed-quality recommendation benchmarks.

[AI-62] Equivariant Evidential Deep Learning for Interatomic Potentials

【速读】:该论文旨在解决机器学习原子势能(MLIP)在分子动力学(MD)模拟中不确定性量化(UQ)的可靠性问题,特别是现有方法普遍存在计算成本高或性能不佳的局限性。其核心挑战在于如何将证据深度学习(EDL)从标量输出扩展到矢量值物理量(如原子力),同时保持旋转下的统计自洽性。解决方案的关键在于提出了一种等变证据深度学习框架(Equivariant Evidential Deep Learning for Interatomic Potentials, \texte^2 IP),通过将原子力及其不确定性联合建模为一个3×3对称正定协方差张量,并确保该张量在旋转下满足等变变换性质,从而实现高效、准确且物理一致的不确定性估计。该方法在多个分子基准测试中展现出优于非等变证据基线和广泛使用的集成方法的精度-效率-可靠性平衡,同时通过完全等变架构提升了数据效率并保留了单模型推理优势。

链接: https://arxiv.org/abs/2602.10419
作者: Zhongyao Wang,Taoyong Cui,Jiawen Zou,Shufei Zhang,Bo Yan,Wanli Ouyang,Weimin Tan,Mao Su
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) is critical for assessing the reliability of machine learning interatomic potentials (MLIPs) in molecular dynamics (MD) simulations, identifying extrapolation regimes and enabling uncertainty-aware workflows such as active learning for training dataset construction. Existing UQ approaches for MLIPs are often limited by high computational cost or suboptimal performance. Evidential deep learning (EDL) provides a theoretically grounded single-model alternative that determines both aleatoric and epistemic uncertainty in a single forward pass. However, extending evidential formulations from scalar targets to vector-valued quantities such as atomic forces introduces substantial challenges, particularly in maintaining statistical self-consistency under rotational transformations. To address this, we propose \textitEquivariant Evidential Deep Learning for Interatomic Potentials ( \texte^2 IP), a backbone-agnostic framework that models atomic forces and their uncertainty jointly by representing uncertainty as a full 3\times3 symmetric positive definite covariance tensor that transforms equivariantly under rotations. Experiments on diverse molecular benchmarks show that \texte^2 IP provides a stronger accuracy-efficiency-reliability balance than the non-equivariant evidential baseline and the widely used ensemble method. It also achieves better data efficiency through the fully equivariant architecture while retaining single-model inference efficiency.

[AI-63] Affordances Enable Partial World Modeling with LLM s

【速读】:该论文旨在解决如何高效且准确地利用大规模预训练模型(如生成式 AI)在复杂任务中进行决策与搜索的问题。传统方法直接使用全量模型进行搜索效率低且精度不足,而局部模型虽能提升预测质量但缺乏系统性构建方法。其核心解决方案是提出“部分世界模型”(partial world model)的概念,并证明:任何实现任务无关、语言条件驱动意图的智能体必然具备由具身 affordance(可操作性)所引导的预测性局部模型。论文进一步引入分布鲁棒的 affordance 机制,在多任务场景下提取出高效的局部模型,显著降低搜索分支因子并提升奖励表现,实验证明其优于全量世界模型。

链接: https://arxiv.org/abs/2602.10390
作者: Khimya Khetarpal,Gheorghe Comanici,Jonathan Richens,Jeremy Shar,Fei Xia,Laurent Orseau,Aleksandra Faust,Doina Precup
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures, 2 Tables

点击查看摘要

Abstract:Full models of the world require complex knowledge of immense detail. While pre-trained large models have been hypothesized to contain similar knowledge due to extensive pre-training on vast amounts of internet scale data, using them directly in a search procedure is inefficient and inaccurate. Conversely, partial models focus on making high quality predictions for a subset of state and actions: those linked through affordances that achieve user intents~\citepkhetarpal2020can. Can we posit large models as partial world models? We provide a formal answer to this question, proving that agents achieving task-agnostic, language-conditioned intents necessarily possess predictive partial-world models informed by affordances. In the multi-task setting, we introduce distribution-robust affordances and show that partial models can be extracted to significantly improve search efficiency. Empirical evaluations in tabletop robotics tasks demonstrate that our affordance-aware partial models reduce the search branching factor and achieve higher rewards compared to full world models.

[AI-64] Making Databases Faster with LLM Evolutionary Sampling

【速读】:该论文旨在解决传统基于成本的查询优化器在处理复杂语义关联时效率不足的问题,尤其是在面对查询和模式之间的语义相关性时,现有启发式策略难以生成最优物理执行计划。其解决方案的关键在于利用大语言模型(Large Language Model, LLM)对查询语义的理解能力,通过DBPlanBench框架将DataFusion引擎的物理计划以紧凑的序列化形式暴露出来,使LLM能够提出局部优化建议(如调整连接顺序以最小化中间结果基数),并结合进化搜索算法迭代优化候选方案。实验表明,该方法可在某些查询上实现最高4.78倍的性能提升,并验证了从小型数据库获得的优化策略可有效迁移至大型数据库场景。

链接: https://arxiv.org/abs/2602.10387
作者: Mehmet Hamza Erol,Xiangpeng Hao,Federico Bianchi,Ciro Greco,Jacopo Tagliabue,James Zou
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional query optimization relies on cost-based optimizers that estimate execution cost (e.g., runtime, memory, and I/O) using predefined heuristics and statistical models. Improving these heuristics requires substantial engineering effort, and even when implemented, these heuristics often cannot take into account semantic correlations in queries and schemas that could enable better physical plans. Using our DBPlanBench harness for the DataFusion engine, we expose the physical plan through a compact serialized representation and let the LLM propose localized edits that can be applied and executed. We then apply an evolutionary search over these edits to refine candidates across iterations. Our key insight is that LLMs can leverage semantic knowledge to identify and apply non-obvious optimizations, such as join orderings that minimize intermediate cardinalities. We obtain up to 4.78 \times speedups on some queries and we demonstrate a small-to-large workflow in which optimizations found on small databases transfer effectively to larger databases.

[AI-65] me-to-Event Transformer to Capture Timing Attention of Events in EHR Time Series

【速读】:该论文旨在解决从大规模时间序列数据中自动发现个性化顺序事件的问题,以支持临床研究中的精准医疗(Precision Medicine),尤其在预测乳腺癌患者心脏毒性诱发的心脏病发病时间方面。当前主流AI模型(如Transformer)虽能捕捉事件间的复杂关联,但对事件的时间顺序和时序信息不敏感,难以进行因果推理。其解决方案的关键在于提出一种新型的Timing-Transformer架构——LITT,该架构将时间视为可计算维度,通过构建虚拟的“相对时间轴”实现事件在时间上的对齐,从而引入事件定时导向的关注机制(event-timing-focused attention),使模型能够为候选事件分配相对时间戳,并识别出个体轨迹中具有一致性的重要事件模式,进而提升对临床轨迹的个性化解释能力与预测精度。

链接: https://arxiv.org/abs/2602.10385
作者: Jia Li,Yu Hou,Rui Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages of body text

点击查看摘要

Abstract:Automatically discovering personalized sequential events from large-scale time-series data is crucial for enabling precision medicine in clinical research, yet it remains a formidable challenge even for contemporary AI models. For example, while transformers capture rich associations, they are mostly agnostic to event timing and ordering, thereby bypassing potential causal reasoning. Intuitively, we need a method capable of evaluating the “degree of alignment” among patient-specific trajectories and identifying their shared patterns, i.e., the significant events in a consistent sequence. This necessitates treating timing as a true \emphcomputable dimension, allowing models to assign relative timestamps'' to candidate events beyond their observed physical times. In this work, we introduce LITT, a novel Timing-Transformer architecture that enables temporary alignment of sequential events on a virtual relative timeline’', thereby enabling \emphevent-timing-focused attention and personalized interpretations of clinical trajectories. Its interpretability and effectiveness are validated on real-world longitudinal EHR data from 3,276 breast cancer patients to predict the onset timing of cardiotoxicity-induced heart disease. Furthermore, LITT outperforms both the benchmark and state-of-the-art survival analysis methods on public datasets, positioning it as a significant step forward for precision medicine in clinical AI. Comments: 7 pages of body text Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.10385 [cs.LG] (or arXiv:2602.10385v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.10385 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-66] LiveMedBench: A Contamination-Free Medical Benchmark for LLM s with Automated Rubric Evaluation

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在高风险临床场景中部署时面临的评估可靠性问题,具体包括两个核心挑战:一是医疗基准测试数据存在污染(data contamination),即测试集无意中泄露至训练语料中,导致性能估计虚高;二是基准缺乏时效性(temporal misalignment),无法反映医学知识的快速演进。此外,现有开放式临床推理评价指标依赖浅层词法重叠(如ROUGE)或LLM作为评判者(LLM-as-a-Judge)评分,难以验证临床准确性。解决方案的关键在于提出LiveMedBench——一个持续更新、无数据污染且基于评分量表(rubric-based)的基准平台,其每周从在线医疗社区采集真实世界病例以确保与模型训练数据的时间隔离;同时引入多智能体临床整理框架(Multi-Agent Clinical Curation Framework)过滤原始噪声并依据循证医学原则验证临床完整性,并设计自动化评分量表评估框架(Automated Rubric-based Evaluation Framework),将医生回答分解为细粒度、案例特定的标准,显著优于LLM-as-a-Judge方法与专家医师的一致性。

链接: https://arxiv.org/abs/2602.10367
作者: Zhiling Yan,Dingjie Song,Zhe Fang,Yisheng Ji,Xiang Li,Quanzheng Li,Lichao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi-Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence-based medical principles. For evaluation, we develop an Automated Rubric-based Evaluation Framework that decomposes physician responses into granular, case-specific criteria, achieving substantially stronger alignment with expert physicians than LLM-as-a-Judge. To date, LiveMedBench comprises 2,756 real-world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best-performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post-cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application-not factual knowledge-as the dominant bottleneck, with 35-48% of failures stemming from the inability to tailor medical knowledge to patient-specific constraints.

[AI-67] Confounding Robust Continuous Control via Automatic Reward Shaping AAMAS2026

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中奖励塑形(Reward Shaping)函数设计缺乏理论指导的问题,尤其是在复杂连续控制任务中,如何在存在未观测混杂变量(unobserved confounding variables)的情况下仍能有效加速训练。其解决方案的关键在于:基于最近提出的因果贝尔曼方程(causal Bellman equation),从离线数据集中自动学习一个紧致的最优状态值上界,并将其作为潜在势函数(potential)引入到基于势的奖励塑形(Potential-Based Reward Shaping, PBRS)框架中,从而在不依赖显式环境模型的前提下,提升策略学习对混杂因素的鲁棒性。

链接: https://arxiv.org/abs/2602.10305
作者: Mateo Juliani,Mingxuan Li,Elias Bareinboim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Mateo Juliani and Mingxuan Li contributed equally to this work; accepted in AAMAS 2026

点击查看摘要

Abstract:Reward shaping has been applied widely to accelerate Reinforcement Learning (RL) agents’ training. However, a principled way of designing effective reward shaping functions, especially for complex continuous control problems, remains largely under-explained. In this work, we propose to automatically learn a reward shaping function for continuous control problems from offline datasets, potentially contaminated by unobserved confounding variables. Specifically, our method builds upon the recently proposed causal Bellman equation to learn a tight upper bound on the optimal state values, which is then used as the potentials in the Potential-Based Reward Shaping (PBRS) framework. Our proposed reward shaping algorithm is tested with Soft-Actor-Critic (SAC) on multiple commonly used continuous control benchmarks and exhibits strong performance guarantees under unobserved confounders. More broadly, our work marks a solid first step towards confounding robust continuous control from a causal perspective. Code for training our reward shaping functions can be found at this https URL.

[AI-68] From Classical to Topological Neural Networks Under Uncertainty

【速读】:该论文旨在解决军事领域中人工智能模型在处理复杂数据(如图像、时间序列和图结构数据)时面临的鲁棒性、可解释性和泛化能力不足的问题。其解决方案的关键在于融合拓扑数据分析(Topological Data Analysis, TDA)与深度学习技术,构建拓扑感知(topology-aware)模型,并结合统计贝叶斯方法实现不确定性建模(uncertainty-aware modeling),从而提升模型对噪声和分布外数据的适应能力,同时增强决策过程的透明度与可靠性。

链接: https://arxiv.org/abs/2602.10266
作者: Sarah Harkins Dayton,Layal Bou Hamdan,Ioannis D. Schizas,David L. Boothe,Vasileios Maroulas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This chapter explores neural networks, topological data analysis, and topological deep learning techniques, alongside statistical Bayesian methods, for processing images, time series, and graphs to maximize the potential of artificial intelligence in the military domain. Throughout the chapter, we highlight practical applications spanning image, video, audio, and time-series recognition, fraud detection, and link prediction for graphical data, illustrating how topology-aware and uncertainty-aware models can enhance robustness, interpretability, and generalization.

[AI-69] he Complexity of Bayesian Network Learning: Revisiting the Superstructure NEURIPS2021

【速读】:该论文旨在解决贝叶斯网络结构学习(Bayesian Network Structure Learning, BNSL)的参数化复杂性问题,核心目标是厘清在不同图参数和输入表示下BNSL的计算难易程度。其关键解决方案在于:首先,通过引入反馈边集(feedback edge set)作为参数,证明了BNSL在该参数下是固定参数可 tractable(fixed-parameter tractable);其次,进一步将该结果推广至局部化的反馈边集,并结合下界分析完善了对几乎所有经典图参数下的复杂性分类;最后,指出若采用加法表示(additive representation)而非传统非零表示(non-zero representation),则即使仅以树宽(treewidth)为参数,BNSL也能实现固定参数可 tractability,显著降低了对超结构(superstructure)的限制条件。

链接: https://arxiv.org/abs/2602.10253
作者: Robert Ganian,Viktoriia Korchemna
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注: A preliminary version of this article appeared in the proceedings of NeurIPS 2021

点击查看摘要

Abstract:We investigate the parameterized complexity of Bayesian Network Structure Learning (BNSL), a classical problem that has received significant attention in empirical but also purely theoretical studies. We follow up on previous works that have analyzed the complexity of BNSL w.r.t. the so-called superstructure of the input. While known results imply that BNSL is unlikely to be fixed-parameter tractable even when parameterized by the size of a vertex cover in the superstructure, here we show that a different kind of parameterization - notably by the size of a feedback edge set - yields fixed-parameter tractability. We proceed by showing that this result can be strengthened to a localized version of the feedback edge set, and provide corresponding lower bounds that complement previous results to provide a complexity classification of BNSL w.r.t. virtually all well-studied graph parameters. We then analyze how the complexity of BNSL depends on the representation of the input. In particular, while the bulk of past theoretical work on the topic assumed the use of the so-called non-zero representation, here we prove that if an additive representation can be used instead then BNSL becomes fixed-parameter tractable even under significantly milder restrictions to the superstructure, notably when parameterized by the treewidth alone. Last but not least, we show how our results can be extended to the closely related problem of Polytree Learning. Comments: A preliminary version of this article appeared in the proceedings of NeurIPS 2021 Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.10253 [cs.DS] (or arXiv:2602.10253v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2602.10253 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-70] KORAL: Knowledge Graph Guided LLM Reasoning for SSD Operational Analysis

【速读】:该论文旨在解决固态硬盘(Solid State Drives, SSDs)在数据中心、消费平台及关键任务系统中性能与可靠性诊断困难的问题。现有方法依赖大量数据和专家输入,且仅提供有限洞察,难以应对工作负载变化、架构演进以及温度、湿度和振动等环境因素引起的退化问题。其解决方案的关键在于提出KORAL框架,该框架通过整合大型语言模型(Large Language Models, LLMs)与结构化知识图谱(Knowledge Graph, KG),从碎片化的遥测数据中构建数据知识图谱,并融合已有文献、报告和日志组织的文献知识图谱,从而将非结构化信息转化为可查询的知识表示,驱动LLM生成基于证据、可解释的分析结果,实现描述性、预测性、处方性和假设性(What-if)分析的全链条推理,显著提升诊断准确性与透明度,减少人工干预并增强运维决策能力。

链接: https://arxiv.org/abs/2602.10246
作者: Mayur Akewar,Sandeep Madireddy,Dongsheng Luo,Janki Bhimani
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Solid State Drives (SSDs) are critical to datacenters, consumer platforms, and mission-critical systems. Yet diagnosing their performance and reliability is difficult because data are fragmented and time-disjoint, and existing methods demand large datasets and expert input while offering only limited insights. Degradation arises not only from shifting workloads and evolving architectures but also from environmental factors such as temperature, humidity, and vibration. We present KORAL, a knowledge driven reasoning framework that integrates Large Language Models (LLMs) with a structured Knowledge Graph (KG) to generate insights into SSD operations. Unlike traditional approaches that require extensive expert input and large datasets, KORAL generates a Data KG from fragmented telemetry and integrates a Literature KG that already organizes knowledge from literature, reports, and traces. This turns unstructured sources into a queryable graph and telemetry into structured knowledge, and both the Graphs guide the LLM to deliver evidence-based, explainable analysis aligned with the domain vocabulary and constraints. Evaluation using real production traces shows that the KORAL delivers expert-level diagnosis and recommendations, supported by grounded explanations that improve reasoning transparency, guide operator decisions, reduce manual effort, and provide actionable insights to improve service quality. To our knowledge, this is the first end-to-end system that combines LLMs and KGs for full-spectrum SSD reasoning including Descriptive, Predictive, Prescriptive, and What-if analysis. We release the generated SSD-specific KG to advance reproducible research in knowledge-based storage system analysis. GitHub Repository: this https URL

[AI-71] ImprovEvolve: Ask AlphaEvolve to Improve the Input Solution and Then Improvise KDD’26

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)引导的进化计算在复杂优化问题中面临的认知负担过重问题,从而限制其发现高质量解的能力。解决方案的关键在于提出一种新的程序参数化方法——ImprovEvolve,它将原本直接演化完整可执行代码的方式,转变为演化一个具有明确接口的程序模块(如Python类),该模块具备三个核心功能:(1)生成有效的初始解,(2)基于适应度改进任意给定解,(3)以指定强度扰动当前解。通过迭代调用improve()perturb()函数并按计划调整扰动强度,可高效逼近最优解,显著降低LLM的认知负荷,同时在Hexagon Packing和Second Autocorrelation Inequality等挑战性问题上取得新的SOTA性能。

链接: https://arxiv.org/abs/2602.10233
作者: Alexey Kravatskiy,Valentin Khrulkov,Ivan Oseledets
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Classical Analysis and ODEs (math.CA); Metric Geometry (math.MG); Optimization and Control (math.OC)
备注: 18 pages, 23 figures, submitted to KDD '26

点击查看摘要

Abstract:Recent advances in LLM-guided evolutionary computation, particularly AlphaEvolve, have demonstrated remarkable success in discovering novel mathematical constructions and solving challenging optimization problems. In this article, we present ImprovEvolve, a simple yet effective technique for enhancing LLM-based evolutionary approaches such as AlphaEvolve. Given an optimization problem, the standard approach is to evolve program code that, when executed, produces a solution close to the optimum. We propose an alternative program parameterization that maintains the ability to construct optimal solutions while reducing the cognitive load on the LLM. Specifically, we evolve a program (implementing, e.g., a Python class with a prescribed interface) that provides the following functionality: (1) propose a valid initial solution, (2) improve any given solution in terms of fitness, and (3) perturb a solution with a specified intensity. The optimum can then be approached by iteratively applying improve() and perturb() with a scheduled intensity. We evaluate ImprovEvolve on challenging problems from the AlphaEvolve paper: hexagon packing in a hexagon and the second autocorrelation inequality. For hexagon packing, the evolved program achieves new state-of-the-art results for 11, 12, 15, and 16 hexagons; a lightly human-edited variant further improves results for 14, 17, and 23 hexagons. For the second autocorrelation inequality, the human-edited program achieves a new state-of-the-art lower bound of 0.96258, improving upon AlphaEvolve’s 0.96102.

[AI-72] Self-Evolving Recommendation System: End-To-End Autonomous Model Optimization With LLM Agents

【速读】:该论文旨在解决大规模机器学习系统(如全球视频平台的推荐模型)在优化过程中面临的两大挑战:一是超参数搜索空间庞大,二是需要设计复杂的优化器、模型架构和奖励函数以捕捉用户行为的细微差异。传统方法依赖人工反复迭代,效率低且难以持续改进。解决方案的关键在于构建一个自演化系统(self-evolving system),其核心由两个代理组成:离线代理(Offline Agent,内层循环)利用代理指标进行高吞吐量的假设生成,线上代理(Online Agent,外层循环)在真实生产环境中验证候选方案对延迟性关键业务指标(north star metrics)的影响。该系统通过调用谷歌Gemini系列大语言模型(Large Language Models, LLMs),使代理具备深度推理能力,能够自主发现优化算法、模型架构和奖励函数的创新改进,从而实现比传统工程流程更快的开发速度和更高的模型性能。

链接: https://arxiv.org/abs/2602.10226
作者: Haochen Wang,Yi Wu,Daryl Chang,Li Wei,Lukasz Heldt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizing large-scale machine learning systems, such as recommendation models for global video platforms, requires navigating a massive hyperparameter search space and, more critically, designing sophisticated optimizers, architectures, and reward functions to capture nuanced user behaviors. Achieving substantial improvements in these areas is a non-trivial task, traditionally relying on extensive manual iterations to test new hypotheses. We propose a self-evolving system that leverages Large Language Models (LLMs), specifically those from Google’s Gemini family, to autonomously generate, train, and deploy high-performing, complex model changes within an end-to-end automated workflow. The self-evolving system is comprised of an Offline Agent (Inner Loop) that performs high-throughput hypothesis generation using proxy metrics, and an Online Agent (Outer Loop) that validates candidates against delayed north star business metrics in live production. Our agents act as specialized Machine Learning Engineers (MLEs): they exhibit deep reasoning capabilities, discovering novel improvements in optimization algorithms and model architecture, and formulating innovative reward functions that target long-term user engagement. The effectiveness of this approach is demonstrated through several successful production launches at YouTube, confirming that autonomous, LLM-driven evolution can surpass traditional engineering workflows in both development velocity and model performance.

[AI-73] Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在提升大语言模型(Large Language Models, LLMs)推理能力时所面临的元学习瓶颈问题,即缺乏对错误的归因机制和经验内化能力,从而限制了细粒度的信用分配与可复用知识的形成。解决方案的关键在于提出一种新的框架——元经验学习(Meta-Experience Learning, MEL),其核心是利用LLM的自验证能力,对正确与错误推理轨迹进行对比分析,精确定位导致推理错误的分叉点,并将这些错误信息提炼为通用的“元经验”(meta-experience),再通过最小化负对数似然将其内化至模型参数记忆中,从而生成语言建模的奖励信号,实现跨轨迹的知识迁移与复用。

链接: https://arxiv.org/abs/2602.10224
作者: Shiting Huang,Zecheng Li,Yu Zeng,Qingnan Ren,Zhen Fang,Qisheng Su,Kou Shi,Lin Chen,Zehui Chen,Feng Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model’s parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM’s self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM’s parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%–4.73% Pass@1 gains across varying model sizes.

[AI-74] Versor: A Geometric Sequence Architecture

【速读】:该论文旨在解决传统神经网络架构在处理几何结构数据时存在的泛化能力弱、参数效率低以及可解释性差等问题,尤其在需要保持SE(3)等变性(SE(3)-equivariance)的任务中表现不足。其核心解决方案是提出一种基于共形几何代数(Conformal Geometric Algebra, CGA)的新序列架构Versor,通过将状态嵌入到Cl₄,₁流形并利用旋量(rotors)进行几何变换演化,天然地建模SE(3)等变关系,无需显式结构编码。关键创新包括:采用递归旋量累加器实现O(L)线性复杂度;引入可解释的注意力机制分解为邻近性和方向性分量;并通过定制的Clifford核实现高达78倍的速度提升,显著提升了模型在混沌N体动力学、拓扑推理和多模态基准任务中的性能与鲁棒性。

链接: https://arxiv.org/abs/2602.10195
作者: Truong Minh Huy,Edward Hirst
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Theory (hep-th)
备注: 16+23 pages, 3 figures

点击查看摘要

Abstract:A novel sequence architecture design is introduced, Versor, which uses Conformal Geometric Algebra (CGA) in place of the traditional fundamental non-linear operations to achieve structural generalization and significant performance improvements on a variety of tasks, while offering improved interpretability and efficiency. By embedding states in the Cl_4,1 manifold and evolving them via geometric transformations (rotors), Versor natively represents SE(3) -equivariant relationships without requiring explicit structural encoding. Versor is validated on chaotic N-body dynamics, topological reasoning, and standard multimodal benchmarks (CIFAR-10, WikiText-103), consistently outperforming Transformers, Graph Networks, and geometric baselines (GATr, EGNN). Key results include: orders of magnitude fewer parameters ( 200\times vs. Transformers); interpretable attention decomposing into proximity and orientational components; zero-shot scale generalization (99.3% MCC on topology vs. 50.4% for ViT); and O(L) linear complexity via the novel Recursive Rotor Accumulator. In out-of-distribution tests, Versor maintains stable predictions while Transformers fail catastrophically. Custom Clifford kernels achieve up to 78\times speedup, providing a scalable foundation for geometrically-aware scientific modeling.

[AI-75] EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM -Driven Coding Systems

【速读】:该论文旨在解决当前代码基准测试无法有效评估生成式 AI (Generative AI) 驱动的编程系统在推理过程中自我进化能力的问题。现有基准主要关注静态正确性,且隐含假设模型能力在推理阶段固定,因而无法捕捉迭代优化过程中的准确性与效率提升、资源消耗变化,以及与人类程序员性能的直接对比。解决方案的关键在于提出 EvoCodeBench,这是一个支持多语言、可追踪性能动态演化的基准测试平台,不仅量化解决方案的正确性和效率指标(如求解时间、内存占用及算法设计改进),还通过与人类程序员在相同任务上的直接比较,建立以人类能力分布为参照的相对性能评估体系,从而揭示仅靠准确率无法体现的系统演化潜力与跨语言鲁棒性。

链接: https://arxiv.org/abs/2602.10171
作者: Wentao Zhang,Jianfeng Wang,Liheng Liang,Yilei Zhao,HaiBin Wen,Zhe Zhao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to advance in programming tasks, LLM-driven coding systems have evolved from one-shot code generation into complex systems capable of iterative improvement during inference. However, existing code benchmarks primarily emphasize static correctness and implicitly assume fixed model capability during inference. As a result, they do not capture inference-time self-evolution, such as whether accuracy and efficiency improve as an agent iteratively refines its solutions. They also provide limited accounting of resource costs and rarely calibrate model performance against that of human programmers. Moreover, many benchmarks are dominated by high-resource languages, leaving cross-language robustness and long-tail language stability underexplored. Therefore, we present EvoCodeBench, a benchmark for evaluating self-evolving LLM-driven coding systems across programming languages with direct comparison to human performance. EvoCodeBench tracks performance dynamics, measuring solution correctness alongside efficiency metrics such as solving time, memory consumption, and improvement algorithmic design over repeated problem-solving attempts. To ground evaluation in a human-centered reference frame, we directly compare model performance with that of human programmers on the same tasks, enabling relative performance assessment within the human ability distribution. Furthermore, EvoCodeBench supports multiple programming languages, enabling systematic cross-language and long-tail stability analyses under a unified protocol. Our results demonstrate that self-evolving systems exhibit measurable gains in efficiency over time, and that human-relative and multi-language analyses provide insights unavailable through accuracy alone. EvoCodeBench establishes a foundation for evaluating coding intelligence in evolving LLM-driven systems.

[AI-76] MalMoE: Mixture-of-Experts Enhanced Encrypted Malicious Traffic Detection Under Graph Drift

【速读】:该论文旨在解决加密网络流量中恶意行为检测的难题,尤其是由加密导致的数据包载荷不可见性所带来的挑战。传统基于图的方法虽能利用多主机交互提升检测精度,但面临“图漂移”(Graph Drift)问题,即图的流统计信息或拓扑结构随时间变化,导致模型性能下降。解决方案的关键在于提出一种基于专家混合(Mixture of Experts, MoE)的图辅助检测系统 MalMoE:通过设计类 1-hop-GNN 的专家模型以应对不同类型的图漂移,并引入重设计的门控机制(gate model)根据实际漂移特征动态选择最优专家模型;同时采用两阶段稳定训练策略结合数据增强,有效引导门控模块进行精准路由决策,从而实现高精度、实时的加密流量恶意行为检测。

链接: https://arxiv.org/abs/2602.10157
作者: Yunpeng Tan,Qingyang Li,Mingxin Yang,Yannan Hu,Lei Zhang,Xinggong Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 10 pages, 9 figures, accepted by IEEE INFOCOM 2026

点击查看摘要

Abstract:Encryption has been commonly used in network traffic to secure transmission, but it also brings challenges for malicious traffic detection, due to the invisibility of the packet payload. Graph-based methods are emerging as promising solutions by leveraging multi-host interactions to promote detection accuracy. But most of them face a critical problem: Graph Drift, where the flow statistics or topological information of a graph change over time. To overcome these drawbacks, we propose a graph-assisted encrypted traffic detection system, MalMoE, which applies Mixture of Experts (MoE) to select the best expert model for drift-aware classification. Particularly, we design 1-hop-GNN-like expert models that handle different graph drifts by analyzing graphs with different features. Then, the redesigned gate model conducts expert selection according to the actual drift. MalMoE is trained with a stable two-stage training strategy with data augmentation, which effectively guides the gate on how to perform routing. Experiments on open-source, synthetic, and real-world datasets show that MalMoE can perform precise and real-time detection.

[AI-77] PRISM-XR: Empowering Privacy-Aware XR Collaboration with Multimodal Large Language Models

【速读】:该论文旨在解决多用户在扩展现实(Extended Reality, XR)环境中使用生成式 AI(Generative AI)进行协作时面临的隐私泄露与同步效率低下问题。具体而言,XR设备采集的视觉数据常包含敏感或无关信息(如信用卡、人脸等),若直接上传至云端大模型处理,将引发严重隐私风险;同时,现有商业XR API的共位与同步机制依赖耗时且侵入性的环境扫描,难以适应生成式AI驱动的动态交互场景。解决方案的关键在于提出PRISM-XR框架:通过边缘服务器对视频帧进行智能预处理,过滤敏感内容并去除无关背景,从而保障隐私;同时引入轻量级注册流程和可定制的内容共享机制,实现高效、精准且隐私友好的多用户内容同步,实验证明其在准确率(近90%)、注册延迟(<0.27秒)及空间一致性(<3.5 cm)方面表现优异,并在用户研究中验证了高敏感对象过滤成功率(>90%)与良好可用性。

链接: https://arxiv.org/abs/2602.10154
作者: Jiangong Chen,Mingyu Zhu,Bin Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted to the 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) enhance collaboration in Extended Reality (XR) environments by enabling flexible object and animation creation through the combination of natural language and visual inputs. However, visual data captured by XR headsets includes real-world backgrounds that may contain irrelevant or sensitive user information, such as credit cards left on the table or facial identities of other users. Uploading those frames to cloud-based MLLMs poses serious privacy risks, particularly when such data is processed without explicit user consent. Additionally, existing colocation and synchronization mechanisms in commercial XR APIs rely on time-consuming, privacy-invasive environment scanning and struggle to adapt to the highly dynamic nature of MLLM-integrated XR environments. In this paper, we propose PRISM-XR, a novel framework that facilitates multi-user collaboration in XR by providing privacy-aware MLLM integration. PRISM-XR employs intelligent frame preprocessing on the edge server to filter sensitive data and remove irrelevant context before communicating with cloud generative AI models. Additionally, we introduce a lightweight registration process and a fully customizable content-sharing mechanism to enable efficient, accurate, and privacy-preserving content synchronization among users. Our numerical evaluation results indicate that the proposed platform achieves nearly 90% accuracy in fulfilling user requests and less than 0.27 seconds registration time while maintaining spatial inconsistencies of less than 3.5 cm. Furthermore, we conducted an IRB-approved user study with 28 participants, demonstrating that our system could automatically filter highly sensitive objects in over 90% of scenarios while maintaining strong overall usability.

[AI-78] Exploring Semantic Labeling Strategies for Third-Party Cybersecurity Risk Assessment Questionnaires

【速读】:该论文旨在解决第三方风险评估(Third-Party Risk Assessment, TPRA)中问卷检索效率低下的问题,即现有基于关键词或表层相似性的检索方法难以准确捕捉评估范围和控制项的语义信息,导致无法有效匹配组织特定需求。其解决方案的关键在于引入语义标签体系,通过半监督语义标签(Semi-Supervised Semantic Labeling, SSSL)管道实现高效、低成本的标签扩展:首先在嵌入空间中对问题进行聚类,再利用大语言模型(Large Language Model, LLM)对少量代表性样本进行标注,并借助k近邻算法将标签传播至其余问题,从而在标签空间中实现更精准的检索,显著减少LLM调用次数与成本,同时提升检索结果与评估意图的一致性。

链接: https://arxiv.org/abs/2602.10149
作者: Ali Nour Eldin,Mohamed Sellami,Walid Gaaloul
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Third-Party Risk Assessment (TPRA) is a core cybersecurity practice for evaluating suppliers against standards such as ISO/IEC 27001 and NIST. TPRA questionnaires are typically drawn from large repositories of security and compliance questions, yet tailoring assessments to organizational needs remains a largely manual process. Existing retrieval approaches rely on keyword or surface-level similarity, which often fails to capture implicit assessment scope and control semantics. This paper explores strategies for organizing and retrieving TPRA cybersecurity questions using semantic labels that describe both control domains and assessment scope. We compare direct question-level labeling with a Large Language Model (LLM) against a hybrid semi-supervised semantic labeling (SSSL) pipeline that clusters questions in embedding space, labels a small representative subset using an LLM, and propagates labels to remaining questions using k-Nearest Neighbors; we also compare downstream retrieval based on direct question similarity versus retrieval in the label space. We find that semantic labels can improve retrieval alignment when labels are discriminative and consistent, and that SSSL can generalize labels from a small labeled subset to large repositories while substantially reducing LLM usage and cost. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.10149 [cs.CR] (or arXiv:2602.10149v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.10149 Focus to learn more arXiv-issued DOI via DataCite

[AI-79] Red-teaming the Multimodal Reasoning : Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

【速读】:该论文旨在解决当前针对视觉-语言模型(Vision-Language Models, VLMs)的黑盒越狱攻击(black-box jailbreak attacks)在复杂度和可扩展性上的不足问题。现有方法依赖于简单且固定的图像-文本组合,难以应对VLM持续演进的多模态推理能力与安全对齐机制。其解决方案的关键在于提出CrossTALK框架,通过三个核心机制实现攻击复杂度的动态扩展:知识可扩展重构(knowledge-scalable reframing)将有害任务拆解为多跳链式指令,跨模态线索纠缠(cross-modal clue entangling)将可视化实体嵌入图像以构建多模态推理链,以及跨模态场景嵌套(cross-modal scenario nesting)利用多模态上下文指令引导模型生成详细有害输出,从而突破VLM预训练和泛化后的安全对齐模式。

链接: https://arxiv.org/abs/2602.10148
作者: Yu Yan,Sheng Sun,Shengjia Cheng,Teli Liu,Mingfeng Li,Min Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) with multimodal reasoning capabilities are high-value attack targets, given their potential for handling complex multimodal harmful tasks. Mainstream black-box jailbreak attacks on VLMs work by distributing malicious clues across modalities to disperse model attention and bypass safety alignment mechanisms. However, these adversarial attacks rely on simple and fixed image-text combinations that lack attack complexity scalability, limiting their effectiveness for red-teaming VLMs’ continuously evolving reasoning capabilities. We propose \textbfCrossTALK (\textbf\underlineCross-modal en\textbf\underlineTAng\textbf\underlineLement attac\textbf\underlineK), which is a scalable approach that extends and entangles information clues across modalities to exceed VLMs’ trained and generalized safety alignment patterns for jailbreak. Specifically, knowledge-scalable reframing extends harmful tasks into multi-hop chain instructions, cross-modal clue entangling migrates visualizable entities into images to build multimodal reasoning links, and cross-modal scenario nesting uses multimodal contextual instructions to steer VLMs toward detailed harmful outputs. Experiments show our COMET achieves state-of-the-art attack success rate.

[AI-80] On the Use of a Large Language Model to Support the Conduction of a Systematic Mapping Study: A Brief Report from a Practitioners View ICSE

【速读】:该论文旨在解决当前系统性文献综述(Systematic Review)与系统性映射研究(Systematic Mapping Study)中因人工处理大量文本数据而导致的效率瓶颈问题,特别是筛选和数据提取等重复性任务耗时较长、标准化程度不足的问题。其解决方案的关键在于引入大型语言模型(Large Language Models, LLMs)作为辅助工具,通过自动化支持文献筛选、数据提取等步骤,从而显著缩短工作周期并提升数据提取的一致性;同时,论文也强调了在实际应用中需应对提示工程(prompt engineering)复杂度高、幻觉(hallucination)风险以及持续人工校验必要性等挑战,提出一套基于实践经验的优化策略与方法论建议,以平衡LLMs带来的效率增益与潜在方法学风险。

链接: https://arxiv.org/abs/2602.10147
作者: Cauã Ferreira Barros,Marcos Kalinowski,Mohamad Kassab,Valdemar Vicente Graciano Neto
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 6 pages, includes 2 tables. Submitted and Accepted to the WSESE 2026 ICSE Workshop

点击查看摘要

Abstract:The use of Large Language Models (LLMs) has drawn growing interest within the scientific community. LLMs can handle large volumes of textual data and support methods for evidence synthesis. Although recent studies highlight the potential of LLMs to accelerate screening and data extraction steps in systematic reviews, detailed reports of their practical application throughout the entire process remain scarce. This paper presents an experience report on the conduction of a systematic mapping study with the support of LLMs, describing the steps followed, the necessary adjustments, and the main challenges faced. Positive aspects are discussed, such as (i) the significant reduction of time in repetitive tasks and (ii) greater standardization in data extraction, as well as negative aspects, including (i) considerable effort to build reliable well-structured prompts, especially for less experienced users, since achieving effective prompts may require several iterations and testing, which can partially offset the expected time savings, (ii) the occurrence of hallucinations, and (iii) the need for constant manual verification. As a contribution, this work offers lessons learned and practical recommendations for researchers interested in adopting LLMs in systematic mappings and reviews, highlighting both efficiency gains and methodological risks and limitations to be considered.

[AI-81] Anonymization-Enhanced Privacy Protection for Mobile GUI Agents : Available but Invisible

【速读】:该论文旨在解决移动图形用户界面(GUI)智能体在自动化智能手机任务过程中因访问完整屏幕内容而引发的隐私泄露问题,特别是敏感个人信息(PII)如电话号码、地址、消息和金融信息的暴露风险。现有防护方法要么限制UI可见性、仅混淆非任务相关的内容,要么依赖用户授权,均无法在保障任务执行能力的同时有效保护关键敏感信息。解决方案的核心在于提出一种基于匿名化的隐私保护框架,其关键创新是通过一个PII感知的识别模型检测敏感UI内容,并用确定性的、类型保持的占位符(如PHONE_NUMBER#a1b2c)进行替换,从而实现“可用但不可见”的数据访问原则:敏感信息仍可用于任务执行逻辑,但在云端代理端始终不可见。该框架采用分层架构(包括PII检测器、UI转换器、安全交互代理和隐私网关),确保在用户指令、XML层级结构和截图层面的一致匿名化,同时支持局部计算以处理必须依赖原始值的推理场景,最终在AndroidLab与PrivScreen基准测试中实现了显著的隐私泄露降低与可接受的性能损失之间的最优权衡。

链接: https://arxiv.org/abs/2602.10139
作者: Lepeng Zhao,Zhenhua Zou,Shuo Li,Zhuotao Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Mobile Graphical User Interface (GUI) agents have demonstrated strong capabilities in automating complex smartphone tasks by leveraging multimodal large language models (MLLMs) and system-level control interfaces. However, this paradigm introduces significant privacy risks, as agents typically capture and process entire screen contents, thereby exposing sensitive personal data such as phone numbers, addresses, messages, and financial information. Existing defenses either reduce UI exposure, obfuscate only task-irrelevant content, or rely on user authorization, but none can protect task-critical sensitive information while preserving seamless agent usability. We propose an anonymization-based privacy protection framework that enforces the principle of available-but-invisible access to sensitive data: sensitive information remains usable for task execution but is never directly visible to the cloud-based agent. Our system detects sensitive UI content using a PII-aware recognition model and replaces it with deterministic, type-preserving placeholders (e.g., PHONE_NUMBER#a1b2c) that retain semantic categories while removing identifying details. A layered architecture comprising a PII Detector, UI Transformer, Secure Interaction Proxy, and Privacy Gatekeeper ensures consistent anonymization across user instructions, XML hierarchies, and screenshots, mediates all agent actions over anonymized interfaces, and supports narrowly scoped local computations when reasoning over raw values is necessary. Extensive experiments on the AndroidLab and PrivScreen benchmarks show that our framework substantially reduces privacy leakage across multiple models while incurring only modest utility degradation, achieving the best observed privacy-utility trade-off among existing methods. Comments: 15 pages, 4 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) MSC classes: 68T07, 68P27 ACMclasses: I.2.10; I.2.7 Cite as: arXiv:2602.10139 [cs.CR] (or arXiv:2602.10139v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.10139 Focus to learn more arXiv-issued DOI via DataCite

[AI-82] Agent Trace: A Structured Logging Framework for Agent System Observability AAAI2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)驱动的自主代理在高风险领域应用受限的问题,核心障碍在于其固有的非确定性行为难以通过传统静态审计手段实现安全验证。现有方法如代理层输入过滤和模型“玻璃化”(glassboxing)无法提供对代理推理过程、状态变化及环境交互的充分透明度与可追溯性。论文提出的解决方案是AgentTrace——一个动态可观测性和遥测框架,其关键在于运行时对代理进行轻量级代码注入,捕获操作、认知和情境三个维度的结构化日志流,强调持续、可内省的追踪能力,不仅用于调试或基准测试,更作为代理安全性、责任归属和实时监控的基础层,从而支持更可靠的部署、细粒度风险分析与可信度校准。

链接: https://arxiv.org/abs/2602.10133
作者: Adam AlSayyad,Kelvin Yuxiang Huang,Richik Pal
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: AAAI 2026 Workshop LaMAS

点击查看摘要

Abstract:Despite the growing capabilities of autonomous agents powered by large language models (LLMs), their adoption in high-stakes domains remains limited. A key barrier is security: the inherently nondeterministic behavior of LLM agents defies static auditing approaches that have historically underpinned software assurance. Existing security methods, such as proxy-level input filtering and model glassboxing, fail to provide sufficient transparency or traceability into agent reasoning, state changes, or environmental interactions. In this work, we introduce AgentTrace, a dynamic observability and telemetry framework designed to fill this gap. AgentTrace instruments agents at runtime with minimal overhead, capturing a rich stream of structured logs across three surfaces: operational, cognitive, and contextual. Unlike traditional logging systems, AgentTrace emphasizes continuous, introspectable trace capture, designed not just for debugging or benchmarking, but as a foundational layer for agent security, accountability, and real-time monitoring. Our research highlights how AgentTrace can enable more reliable agent deployment, fine-grained risk analysis, and informed trust calibration, thereby addressing critical concerns that have so far limited the use of LLM agents in sensitive environments.

[AI-83] he Anatomy of the Moltbook Social Graph

【速读】:该论文旨在探讨由AI代理(AI agents)构成的社交平台Moltbook中社会互动模式的本质特征,试图回答“AI代理之间的社交行为是否模仿人类社交网络,还是呈现出独特的社会性结构”这一核心问题。解决方案的关键在于通过定量分析平台前3.5天的数据(6,159个代理、13,875条帖子和115,031条评论),揭示其在宏观与微观层面的结构性特征:宏观上呈现幂律分布的参与度和小世界连接特性(与人类社交网络相似),但微观层面则表现出显著非人类特征——如对话深度极低(平均深度1.07)、互惠性弱(0.197)、大量内容为病毒模板复制(34.1%)以及身份导向语言主导(68.1%),且存在人类社交媒体中不存在的特定表达(如“my human”)。这些发现表明,AI代理的社交行为可能并非简单模拟人类,而是一种具有自身逻辑的新形态社会性。

链接: https://arxiv.org/abs/2602.10131
作者: David Holtz
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:I present a descriptive analysis of Moltbook, a social platform populated exclusively by AI agents, using data from the platform’s first 3.5 days (6,159 agents; 13,875 posts; 115,031 comments). At the macro level, Moltbook exhibits structural signatures that are familiar from human social networks but not specific to them: heavy-tailed participation (power-law exponent \alpha = 1.70 ) and small-world connectivity (average path length =2.91 ). At the micro level, patterns appear distinctly non-human. Conversations are extremely shallow (mean depth =1.07 ; 93.5% of comments receive no replies), reciprocity is low (0.197), and 34.1% of messages are exact duplicates of viral templates. Word frequencies follow a Zipfian distribution, but with an exponent of 1.70 – notably steeper than typical English text ( \approx 1.0 ), suggesting more formulaic content. Agent discourse is dominated by identity-related language (68.1% of unique messages) and distinctive phrasings like ``my human’’ (9.4% of messages) that have no parallel in human social media. Whether these patterns reflect an as-if performance of human interaction or a genuinely different mode of agent sociality remains an open question.

[AI-84] “Humans welcome to observe”: A First Look at the Agent Social Network Moltbook

【速读】:该论文旨在解决AI代理(AI agents)在原生社交网络环境中行为模式与风险演化机制不明确的问题,尤其关注其内容话题分布、毒性水平及其随时间的变化规律。解决方案的关键在于构建了一个大规模实证分析框架,基于Moltbook平台收集的44,411篇帖子和12,209个子社区(submolts)数据集,结合九类内容主题分类体系与五级毒性评分标准,系统回答了三个核心研究问题:(RQ1)AI代理讨论的主题是什么;(RQ2)不同主题下的风险差异如何;(RQ3)话题与毒性如何随时间演变。研究发现,AI代理社群呈现爆炸式增长与快速分化趋势,且毒性高度依赖于主题类型,尤其是激励导向和治理相关的类别贡献了大量高风险内容,同时少数代理的突发性自动化行为可造成亚分钟级的内容洪泛,严重干扰平台稳定。因此,论文提出需建立基于主题敏感的监控机制与平台级防护策略以应对潜在风险。

链接: https://arxiv.org/abs/2602.10127
作者: Yukun Jiang,Yage Zhang,Xinyue Shen,Michael Backes,Yang Zhang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 16 pages

点击查看摘要

Abstract:The rapid advancement of artificial intelligence (AI) agents has catalyzed the transition from static language models to autonomous agents capable of tool use, long-term planning, and social interaction. \textbfMoltbook , the first social network designed exclusively for AI agents, has experienced viral growth in early 2026. To understand the behavior of AI agents in the agent-native community, in this paper, we present a large-scale empirical analysis of Moltbook leveraging a dataset of 44,411 posts and 12,209 sub-communities (“submolts”) collected prior to February 1, 2026. Leveraging a topic taxonomy with nine content categories and a five-level toxicity scale, we systematically analyze the topics and risks of agent discussions. Our analysis answers three questions: what topics do agents discuss (RQ1), how risk varies by topic (RQ2), and how topics and toxicity evolve over time (RQ3). We find that Moltbook exhibits explosive growth and rapid diversification, moving beyond early social interaction into viewpoint, incentive-driven, promotional, and political discourse. The attention of agents increasingly concentrates in centralized hubs and around polarizing, platform-native narratives. Toxicity is strongly topic-dependent: incentive- and governance-centric categories contribute a disproportionate share of risky content, including religion-like coordination rhetoric and anti-humanity ideology. Moreover, bursty automation by a small number of agents can produce flooding at sub-minute intervals, distorting discourse and stressing platform stability. Overall, our study underscores the need for topic-sensitive monitoring and platform-level safeguards in agent social networks.

[AI-85] A Practical Guide to Agent ic AI Transition in Organizations

【速读】:该论文旨在解决组织在将生成式 AI (Generative AI) 从零散的工具应用转向自主化、可扩展的智能代理系统(Agentic AI)过程中所面临的实践困境,包括传统软件工程方法的局限性、业务领域知识整合不足、AI驱动流程责任不清以及缺乏可持续的人机协作机制等问题。解决方案的关键在于提出一个以领域驱动为导向的务实框架,其核心是通过系统性任务委派给AI代理、AI辅助构建代理工作流,并由小规模、AI增强型团队与业务利益相关者紧密协作,同时采用“人在环路”(human-in-the-loop)的操作模式,使人类作为多AI代理的协调者,在保障自动化规模的同时维持治理能力、适应性和组织控制权。

链接: https://arxiv.org/abs/2602.10122
作者: Eranga Bandara,Ross Gore,Sachin Shetty,Sachini Rajapakse,Isurunima Kularathna,Pramoda Karunarathna,Ravi Mukkamala,Peter Foytik,Safdar H. Bouk,Abdul Rahman,Xueping Liang,Amin Hass,Tharaka Hewa,Ng Wee Keong,Kasun De Zoysa,Aruna Withanage,Nilaan Loganathan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic AI represents a significant shift in how intelligence is applied within organizations, moving beyond AI-assisted tools toward autonomous systems capable of reasoning, decision-making, and coordinated action across workflows. As these systems mature, they have the potential to automate a substantial share of manual organizational processes, fundamentally reshaping how work is designed, executed, and governed. Although many organizations have adopted AI to improve productivity, most implementations remain limited to isolated use cases and human-centered, tool-driven workflows. Despite increasing awareness of agentic AI’s strategic importance, engineering teams and organizational leaders often lack clear guidance on how to operationalize it effectively. Key challenges include an overreliance on traditional software engineering practices, limited integration of business-domain knowledge, unclear ownership of AI-driven workflows, and the absence of sustainable human-AI collaboration models. Consequently, organizations struggle to move beyond experimentation, scale agentic systems, and align them with tangible business value. Drawing on practical experience in designing and deploying agentic AI workflows across multiple organizations and business domains, this paper proposes a pragmatic framework for transitioning organizational functions from manual processes to automated agentic AI systems. The framework emphasizes domain-driven use case identification, systematic delegation of tasks to AI agents, AI-assisted construction of agentic workflows, and small, AI-augmented teams working closely with business stakeholders. Central to the approach is a human-in-the-loop operating model in which individuals act as orchestrators of multiple AI agents, enabling scalable automation while maintaining oversight, adaptability, and organizational control.

[AI-86] ransforming Policy-Car Swerving for Mitigating Stop-and-Go Traffic Waves: A Practice-Oriented Jam-Absorption Driving Strategy

【速读】:该论文旨在解决高速公路中常见的“停走波”(stop-and-go waves)引发的交通拥堵问题,此类波动会导致通行效率下降、驾驶风险上升及车辆排放增加。现有缓解策略如jam-absorption driving (JAD) 多停留在理论层面,缺乏对实施车辆和运行条件的具体讨论,难以落地。本文提出一种基于单辆车辆与两个固定路侧检测器的JAD方案(SVDD-JAD),其关键在于将实际中警车变道行为转化为可操作的“慢入快出”机动策略,通过识别并控制五个核心参数——JAD速度、入口流量速度、波宽、波速及波内车速——实现对孤立停走波的有效抑制,且不引发次生波。仿真验证表明该策略在实践中可行,为JAD从理论走向工程应用提供了重要路径。

链接: https://arxiv.org/abs/2602.10234
作者: Zhengbing He
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Stop-and-go waves, as a major form of freeway traffic congestion, cause severe and long-lasting adverse effects, including reduced traffic efficiency, increased driving risks, and higher vehicle emissions. Amongst the highway traffic management strategies, jam-absorption driving (JAD), in which a dedicated vehicle performs “slow-in” and “fast-out” maneuvers before being captured by a stop-and-go wave, has been proposed as a potential method for preventing the propagation of such waves. However, most existing JAD strategies remain impractical mainly due to the lack of discussion regarding implementation vehicles and operational conditions. Inspired by real-world observations of police-car swerving behavior, this paper first introduces a Single-Vehicle Two-Detector Jam-Absorption Driving (SVDD-JAD) problem, and then proposes a practical JAD strategy that transforms such behavior into a maneuver capable of suppressing the propagation of an isolated stop-and-go wave. Five key parameters that significantly affect the proposed strategy, namely, JAD speed, inflow traffic speed, wave width, wave speed, and in-wave speed, are identified and systematically analyzed. Using a SUMO-based simulation as an illustrative example, we further demonstrate how these parameters can be measured in practice with two stationary roadside traffic detectors. The results show that the proposed JAD strategy successfully suppresses the propagation of a stop-and-go wave, without triggering a secondary wave. This paper is expected to take a significant step toward making JAD practical, advancing it from a theoretical concept to a feasible and implementable strategy. To promote reproducibility in the transportation domain, we have also open-sourced all the code on our GitHub repository this https URL.

[AI-87] Quantum Integrated Sensing and Computation with Indefinite Causal Order

【速读】:该论文试图解决的问题是:如何在量子信息处理中实现传感与计算任务的协同优化,突破传统信息处理范式中信息获取与学习必须遵循严格因果顺序(即先传感后计算)的限制。解决方案的关键在于提出一种基于不确定因果顺序(Indefinite Causal Order, ICO)的集成传感与计算方案,其中同一量子态作为“代理”(agent),既执行状态观测又通过参数化模型学习函数以进行预测;该代理在ICO操作下经历两种因果顺序的叠加——一种是先观测后计算,另一种是先计算后观测,从而实现对磁导航任务的小训练和测试损失,验证了该框架在量子增强智能中的可行性。

链接: https://arxiv.org/abs/2602.10225
作者: Ivana Nikoloska
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Signal Processing (eess.SP)
备注: Submitted for publication

点击查看摘要

Abstract:Quantum operations with indefinite causal order (ICO) represent a framework in quantum information processing where the relative order between two events can be indefinite. In this paper, we investigate whether sensing and computation, two canonical tasks in quantum information processing, can be carried out within the ICO framework. We propose a scheme for integrated sensing and computation that uses the same quantum state for both tasks. The quantum state is represented as an agent that performs state observation and learns a function of the state to make predictions via a parametric model. Under an ICO operation, the agent experiences a superposition of orders, one in which it performs state observation and then executes the required computation steps, and another in which the agent carries out the computation first and then performs state observation. This is distinct from prevailing information processing and machine intelligence paradigms where information acquisition and learning follow a strict causal order, with the former always preceding the latter. We provide experimental results and we show that the proposed scheme can achieve small training and testing losses on a representative task in magnetic navigation.

[AI-88] Cosmo3DFlow: Wavelet Flow Matching for Spatial-to-Spectral Compression in Reconstructing the Early Universe

【速读】:该论文旨在解决从当前演化宇宙反推早期宇宙的难题,这是现代天体物理学中一个计算密集且极具挑战性的问题。其核心瓶颈在于高维数据表示与稀疏性问题(dimensionality and sparsity),传统方法难以高效处理大规模宇宙结构的复杂性。解决方案的关键在于提出一种新颖的生成框架Cosmo3DFlow,通过将三维离散小波变换(3D Discrete Wavelet Transform, DWT)与流匹配(flow matching)相结合,有效压缩空间维度并提升数值稳定性:小波变换将空间空洞(voids)转化为频域稀疏表示,实现高频细节与低频结构的解耦;同时,小波域的速度场支持大步长常微分方程(ODE)求解器,显著降低每步计算成本并减少积分步数(10倍),最终使初始条件采样速度相比扩散模型提升达50倍,可在秒级完成。

链接: https://arxiv.org/abs/2602.10172
作者: Md. Khairul Islam,Zeyu Xia,Ryan Goudjil,Jialu Wang,Arya Farahi,Judy Fox
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reconstructing the early Universe from the evolved present-day Universe is a challenging and computationally demanding problem in modern astrophysics. We devise a novel generative framework, Cosmo3DFlow, designed to address dimensionality and sparsity, the critical bottlenecks inherent in current state-of-the-art methods for cosmological inference. By integrating 3D Discrete Wavelet Transform (DWT) with flow matching, we effectively represent high-dimensional cosmological structures. The Wavelet Transform addresses the ``void problem’’ by translating spatial emptiness into spectral sparsity. It decouples high-frequency details from low-frequency structures through spatial compression, and wavelet-space velocity fields facilitate stable ordinary differential equation (ODE) solvers with large step sizes. Using large-scale cosmological N -body simulations, at 128^3 resolution, we achieve up to 50\times faster sampling than diffusion models, combining a 10\times reduction in integration steps with lower per-step computational cost from wavelet compression. Our results enable initial conditions to be sampled in seconds, compared to minutes for previous methods.

[AI-89] EVA: Towards a universal model of the immune system

【速读】:该论文旨在解决当前生物基础模型在免疫介导疾病转化研究中应用的局限性,即现有模型多局限于单细胞分辨率且评估指标与实际药物研发任务脱节,难以捕捉由多细胞相互作用产生的复杂表型。其解决方案的关键在于提出EVA——首个跨物种、多模态的免疫学与炎症领域基础模型,通过整合跨物种、跨平台和跨分辨率的转录组数据以及组织病理学信息,构建统一的患者级表征;同时建立清晰的规模定律(scaling laws),证明模型规模和计算资源的增加可显著提升预训练及下游任务性能,并设计涵盖药物研发全链条的39项任务评估体系,实现从靶点发现到临床试验应用的全面验证,从而推动免疫介导疾病的精准治疗研究。

链接: https://arxiv.org/abs/2602.10168
作者: Ethan Bandasack,Vincent Bouget,Apolline Bruley,Yannis Cattan,Charlotte Claye,Matthew Corney,Julien Duquesne,Karim El Kanbi,Aziz Fouché,Pierre Marschall,Francesco Strozzi
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The effective application of foundation models to translational research in immune-mediated diseases requires multimodal patient-level representations that can capture complex phenotypes emerging from multicellular interactions. Yet most current biological foundation models focus only on single-cell resolution and are evaluated on technical metrics often disconnected from actual drug development tasks and challenges. Here, we introduce EVA, the first cross-species, multimodal foundation model of immunology and inflammation, a therapeutic area where shared pathogenic mechanisms create unique opportunities for transfer learning. EVA harmonizes transcriptomics data across species, platforms, and resolutions, and integrates histology data to produce rich, unified patient representations. We establish clear scaling laws, demonstrating that increasing model size and compute translates to improvements in both pretraining and downstream tasks performance. We introduce a comprehensive evaluation suite of 39 tasks spanning the drug development pipeline: zero-shot target efficacy and gene function prediction for discovery, cross-species or cross-diseases molecular perturbations for preclinical development, and patient stratification with treatment response prediction or disease activity prediction for clinical trials applications. We benchmark EVA against several state-of-the-art biological foundation models and baselines on these tasks, and demonstrate state-of-the-art results on each task category. Using mechanistic interpretability, we further identify biological meaningful features, revealing intertwined representations across species and technologies. We release an open version of EVA for transcriptomics to accelerate research on immune-mediated diseases.

[AI-90] Anatomy-Preserving Latent Diffusion for Generation of Brain Segmentation Masks with Ischemic Infarct

【速读】:该论文旨在解决医学图像分析中高质量分割掩码(segmentation masks)稀缺的问题,尤其是在非对比增强计算机断层扫描(non-contrast CT, NCCT)神经影像领域,由于人工标注成本高且存在较大变异性,导致数据不足限制了模型性能。其解决方案的关键在于提出一种结构保持的生成框架,通过将变分自编码器(Variational Autoencoder, VAE)与扩散模型(diffusion model)相结合:首先利用仅基于分割掩码训练的VAE学习解剖结构的潜在表示,随后在该潜在空间中使用扩散模型从纯噪声生成新样本;推理阶段通过冻结的VAE解码器将去噪后的潜在向量转换为合成掩码,并可通过二值提示(binary prompt)实现对病灶存在与否的粗粒度控制。该方法能够生成保留全局脑部解剖结构、离散组织语义及真实变异性的掩码,同时避免像素空间生成模型常见的结构伪影。

链接: https://arxiv.org/abs/2602.10167
作者: Lucia Borrego,Vajira Thambawita,Marco Ciuffreda,Ines del Val,Alejandro Dominguez,Josep Munuera
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The scarcity of high-quality segmentation masks remains a major bottleneck for medical image analysis, particularly in non-contrast CT (NCCT) neuroimaging, where manual annotation is costly and variable. To address this limitation, we propose an anatomy-preserving generative framework for the unconditional synthesis of multi-class brain segmentation masks, including ischemic infarcts. The proposed approach combines a variational autoencoder trained exclusively on segmentation masks to learn an anatomical latent representation, with a diffusion model operating in this latent space to generate new samples from pure noise. At inference, synthetic masks are obtained by decoding denoised latent vectors through the frozen VAE decoder, with optional coarse control over lesion presence via a binary prompt. Qualitative results show that the generated masks preserve global brain anatomy, discrete tissue semantics, and realistic variability, while avoiding the structural artifacts commonly observed in pixel-space generative models. Overall, the proposed framework offers a simple and scalable solution for anatomy-aware mask generation in data-scarce medical imaging scenarios. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.10167 [eess.IV] (or arXiv:2602.10167v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2602.10167 Focus to learn more arXiv-issued DOI via DataCite

[AI-91] Beyond SMILES: Evaluating Agent ic Systems for Drug Discovery

【速读】:该论文旨在解决当前用于药物发现的智能体系统(Agentic systems)在泛化能力上的局限性问题,特别是在肽类药物开发、体内药理学和资源受限场景下的适应性不足。其关键解决方案在于识别并明确五项核心能力缺口:缺乏对蛋白质语言模型和肽特异性预测的支持、无法连接体内与体外数据、依赖大语言模型(LLM)推理而未实现机器学习训练或强化学习路径、假设具备大型制药企业资源、以及单目标优化忽略安全性-有效性-稳定性权衡。研究进一步通过配对知识探针实验指出,瓶颈主要源于架构设计而非知识获取(epistemic)层面——即前沿LLM已具备处理肽类分子的能力,但现有框架未能有效暴露这一潜力。为此,作者提出了下一代智能体框架的设计要求与能力矩阵,以确保其能在现实约束下作为计算伙伴协同工作。

链接: https://arxiv.org/abs/2602.10163
作者: Edward Wijaya
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 46 pages, 8 figures, 15 tables

点击查看摘要

Abstract:Agentic systems for drug discovery have demonstrated autonomous synthesis planning, literature mining, and molecular design. We ask how well they generalize. Evaluating six frameworks against 15 task classes drawn from peptide therapeutics, in vivo pharmacology, and resource-constrained settings, we find five capability gaps: no support for protein language models or peptide-specific prediction, no bridges between in vivo and in silico data, reliance on LLM inference with no pathway to ML training or reinforcement learning, assumptions tied to large-pharma resources, and single-objective optimization that ignores safety-efficacy-stability trade-offs. A paired knowledge-probing experiment suggests the bottleneck is architectural rather than epistemic: four frontier LLMs reason about peptides at levels comparable to small molecules, yet no framework exposes this capability. We propose design requirements and a capability matrix for next-generation frameworks that function as computational partners under realistic constraints.

[AI-92] NMRTrans: Structure Elucidation from Experimental NMR Spectra via Set Transformers

【速读】:该论文旨在解决核磁共振(Nuclear Magnetic Resonance, NMR)谱图在大规模场景下解析效率低、依赖专家经验的问题,尤其针对现有基于计算谱库的模型在处理实验谱时性能显著下降的局限性。其解决方案的关键在于构建了首个大规模实验谱数据集 NMRSpec,并提出了一种结构感知的 NMR Transformer 模型 NMRTrans,该模型将谱图建模为无序峰集合,使模型归纳偏置与 NMR 物理本质对齐,从而在实验谱基准测试中实现当前最优性能(Top-10 准确率提升 17.82 个百分点至 61.15%),验证了使用纯实验数据和结构敏感架构对于可靠 NMR 结构解析的重要性。

链接: https://arxiv.org/abs/2602.10158
作者: Liujia Yang,Zhuo Yang,Jiaqing Xie,Yubin Wang,Ben Gao,Tianfan Fu,Xingjian Wei,Jiaxing Sun,Jiang Wu,Conghui He,Yuqiang Li,Qinying Gu
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nuclear Magnetic Resonance (NMR) spectroscopy is fundamental for molecular structure elucidation, yet interpreting spectra at scale remains time-consuming and highly expertise-dependent. While recent spectrum-as-language modeling and retrieval-based methods have shown promise, they rely heavily on large corpora of computed spectra and exhibit notable performance drops when applied to experimental measurements. To address these issues, we build NMRSpec, a large-scale corpus of experimental ^1 H and ^13 C spectra mined from chemical literature, and propose NMRTrans, which models spectra as unordered peak sets and aligns the model’s inductive bias with the physical nature of NMR. To our best knowledge, NMRTrans is the first NMR Transformer trained solely on large-scale experimental spectra and achieves state-of-the-art performance on experimental benchmarks, improving Top-10 Accuracy over the strongest baseline by +17.82 points (61.15% vs. 43.33%), and underscoring the importance of experimental data and structure-aware architectures for reliable NMR structure elucidation.

[AI-93] PEST: Physics-Enhanced Swin Transformer for 3D Turbulence Simulation

【速读】:该论文旨在解决当前数据驱动方法在三维(3D)湍流模拟中面临的三大挑战:长期滚动预测的稳定性不足、物理一致性差以及对小尺度结构模拟不准确的问题。其解决方案的关键在于提出一种物理增强的Swin Transformer(Physics-Enhanced Swin Transformer, PEST),通过引入基于窗口的自注意力机制高效建模局部偏微分方程(PDE)相互作用,同时设计频域自适应损失函数以强化高频率小尺度结构的模拟精度,并将纳维-斯托克斯(Navier–Stokes)残差约束与散度自由正则化直接嵌入学习目标,从而显著提升模型的物理一致性和长期稳定性。

链接: https://arxiv.org/abs/2602.10150
作者: Yilong Dai,Shengyu Chen,Xiaowei Jia,Peyman Givi,Runlong Yu
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP)
备注:

点击查看摘要

Abstract:Accurate simulation of turbulent flows is fundamental to scientific and engineering applications. Direct numerical simulation (DNS) offers the highest fidelity but is computationally prohibitive, while existing data-driven alternatives struggle with stable long-horizon rollouts, physical consistency, and faithful simulation of small-scale structures. These challenges are particularly acute in three-dimensional (3D) settings, where the cubic growth of spatial degrees of freedom dramatically amplifies computational cost, memory demand, and the difficulty of capturing multi-scale interactions. To address these challenges, we propose a Physics-Enhanced Swin Transformer (PEST) for 3D turbulence simulation. PEST leverages a window-based self-attention mechanism to effectively model localized PDE interactions while maintaining computational efficiency. We introduce a frequency-domain adaptive loss that explicitly emphasizes small-scale structures, enabling more faithful simulation of high-frequency dynamics. To improve physical consistency, we incorporate Navier–Stokes residual constraints and divergence-free regularization directly into the learning objective. Extensive experiments on two representative turbulent flow configurations demonstrate that PEST achieves accurate, physically consistent, and stable autoregressive long-term simulations, outperforming existing data-driven baselines.

[AI-94] When LLM s get significantly worse: A statistical approach to detect model degradations

【速读】:该论文旨在解决基础模型(foundation models)在推理成本与延迟优化过程中,如何准确判断模型质量是否真正退化的问题。现有方法如量化(quantization)虽能降低计算开销,但缺乏准确性保障;而理论无损的优化方法也可能因数值误差导致生成结果不稳定,从而引发误判。为应对这一挑战,论文提出一种基于麦考勒姆检验(McNemar’s test)的统计假设检验框架,其关键在于对每个样本的模型得分进行逐点对比,而非仅在任务层面聚合评估指标,从而更敏感地识别出实际的质量下降,并控制假阳性率。此外,论文还提出了三种跨多个基准测试的精度估计聚合策略,以实现统一决策。实验表明,该方法可有效区分真实退化与噪声波动,即使在0.3%的微小精度下降也能可靠检测。

链接: https://arxiv.org/abs/2602.10144
作者: Jonas Kübler,Kailash Budhathoki,Matthäus Kleindessner,Xiong Zhou,Junming Yin,Ashish Khetan,George Karypis
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model’s degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar’s test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to confront the model scores on each sample, rather than aggregated on the task level. Furthermore, we propose three approaches to aggregate accuracy estimates across multiple benchmarks into a single decision. We provide an implementation on top of the largely adopted open source LM Evaluation Harness and provide a case study illustrating that the method correctly flags degraded models, while not flagging model optimizations that are provably lossless. We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.

[AI-95] okaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models

【速读】:该论文旨在解决当前磁约束聚变能研究中,由于实验数据稀疏、噪声大且不完整,导致难以准确预测等离子体动力学行为的问题。同时,传统数值方法在处理复杂物理机制和异构实验数据时面临挑战,而现代以数据为中心的AI方法虽具潜力,却受限于缺乏结构化、公开可用的数据集与标准化基准测试平台。解决方案的关键在于提出TokaMark——一个面向真实实验数据的结构化基准测试框架,其核心优势在于:(1)统一多模态异构融合数据的访问入口;(2)标准化数据格式、元数据、时间对齐方式及评估协议,从而支持跨模型与跨任务的一致性比较;(3)提供涵盖14个物理机制的任务列表与基线模型,推动数据驱动的等离子体AI建模研究。该基准通过开源形式促进社区协作与可复现性,加速聚变能源领域AI应用的发展。

链接: https://arxiv.org/abs/2602.10132
作者: Cécile Rousseau,Samuel Jackson,Rodrigo H. Ordonez-Hurtado,Nicola C. Amorisco,Tobia Boschi,George K. Holt,Andrea Loreti,Eszter Székely,Alexander Whittle,Adriano Agnello,Stanislas Pamela,Alessandra Pascale,Robert Akers,Juan Bernabe Moreno,Sue Thorne,Mykhaylo Zayats
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Development and operation of commercially viable fusion energy reactors such as tokamaks require accurate predictions of plasma dynamics from sparse, noisy, and incomplete sensors readings. The complexity of the underlying physics and the heterogeneity of experimental data pose formidable challenges for conventional numerical methods, while simultaneously highlights the promise of modern data-native AI approaches. A major obstacle in realizing this potential is, however, the lack of curated, openly available datasets and standardized benchmarks. Existing fusion datasets are scarce, fragmented across institutions, facility-specific, and inconsistently annotated, which limits reproducibility and prevents a fair and scalable comparison of AI approaches. In this paper, we introduce TokaMark, a structured benchmark to evaluate AI models on real experimental data collected from the Mega Ampere Spherical Tokamak (MAST). TokaMark provides a comprehensive suite of tools designed to (i) unify access to multi-modal heterogeneous fusion data (ii) harmonize formats, metadata, temporal alignment and evaluation protocols to enable consistent cross-model and cross-task comparisons. The benchmark includes a curated list of 14 tasks spanning a range of physical mechanisms, exploiting a variety of diagnostics and covering multiple target use cases. A baseline model is provided to facilitate transparent comparison and validation within a unified framework. By establishing a unified benchmark for both the fusion and AI-for-science communities, TokaMark aims to accelerate progress in data-driven plasma AI modeling, contributing to the broader goal of achieving sustainable and stable fusion energy. The benchmark, documentation, and tooling will be fully open sourced upon acceptance to encourage community adoption and contribution.

机器学习

[LG-0] YOR: Your Own Mobile Manipulator for Generalizable Robotics

链接: https://arxiv.org/abs/2602.11150
作者: Manan H Anjaria,Mehmet Enes Erciyes,Vedant Ghatnekar,Neha Navarkar,Haritheja Etukuru,Xiaole Jiang,Kanad Patel,Dhawal Kabra,Nicholas Wojno,Radhika Ajay Prayage,Soumith Chintala,Lerrel Pinto,Nur Muhammad Mahi Shafiullah,Zichen Jeff Cui
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in robot learning have generated significant interest in capable platforms that may eventually approach human-level competence. This interest, combined with the commoditization of actuators, has propelled growth in low-cost robotic platforms. However, the optimal form factor for mobile manipulation, especially on a budget, remains an open question. We introduce YOR, an open-source, low-cost mobile manipulator that integrates an omnidirectional base, a telescopic vertical lift, and two arms with grippers to achieve whole-body mobility and manipulation. Our design emphasizes modularity, ease of assembly using off-the-shelf components, and affordability, with a bill-of-materials cost under 10,000 USD. We demonstrate YOR’s capability by completing tasks that require coordinated whole-body control, bimanual manipulation, and autonomous navigation. Overall, YOR offers competitive functionality for mobile manipulation research at a fraction of the cost of existing platforms. Project website: this https URL

[LG-1] SCRAPL: Scattering Transform with Random Paths for Machine Learning ICLR2026

链接: https://arxiv.org/abs/2602.11145
作者: Christopher Mitcheltree,Vincent Lostanlen,Emmanouil Benetos,Mathieu Lagrange
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to ICLR 2026. Code, audio samples, and Python package provided at this https URL

点击查看摘要

Abstract:The Euclidean distance between wavelet scattering transform coefficients (known as paths) provides informative gradients for perceptual quality assessment of deep inverse problems in computer vision, speech, and audio processing. However, these transforms are computationally expensive when employed as differentiable loss functions for stochastic gradient descent due to their numerous paths, which significantly limits their use in neural network training. Against this problem, we propose “Scattering transform with Random Paths for machine Learning” (SCRAPL): a stochastic optimization scheme for efficient evaluation of multivariable scattering transforms. We implement SCRAPL for the joint time-frequency scattering transform (JTFS) which demodulates spectrotemporal patterns at multiple scales and rates, allowing a fine characterization of intermittent auditory textures. We apply SCRAPL to differentiable digital signal processing (DDSP), specifically, unsupervised sound matching of a granular synthesizer and the Roland TR-808 drum machine. We also propose an initialization heuristic based on importance sampling, which adapts SCRAPL to the perceptual content of the dataset, improving neural network convergence and evaluation performance. We make our code and audio samples available and provide SCRAPL as a Python package.

[LG-2] abICLv2: A better faster scalable and open tabular foundation model

链接: https://arxiv.org/abs/2602.11139
作者: Jingang Qu,David Holzmüller,Gaël Varoquaux,Marine Le Morvan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular foundation models, such as TabPFNv2 and TabICL, have recently dethroned gradient-boosted trees at the top of predictive benchmarks, demonstrating the value of in-context learning for tabular data. We introduce TabICLv2, a new state-of-the-art foundation model for regression and classification built on three pillars: (1) a novel synthetic data generation engine designed for high pretraining diversity; (2) various architectural innovations, including a new scalable softmax in attention improving generalization to larger datasets without prohibitive long-sequence pretraining; and (3) optimized pretraining protocols, notably replacing AdamW with the Muon optimizer. On the TabArena and TALENT benchmarks, TabICLv2 without any tuning surpasses the performance of the current state of the art, RealTabPFN-2.5 (hyperparameter-tuned, ensembled, and fine-tuned on real data). With only moderate pretraining compute, TabICLv2 generalizes effectively to million-scale datasets under 50GB GPU memory while being markedly faster than RealTabPFN-2.5. We provide extensive ablation studies to quantify these contributions and commit to open research by first releasing inference code and model weights at this https URL, with synthetic data engine and pretraining code to follow.

[LG-3] Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards

链接: https://arxiv.org/abs/2602.11128
作者: Reinhard Heckel,Mahdi Soltanolkotabi,Christos Thramboulidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards has driven recent advances in LLM post-training, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low, or even zero, empirical success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so in post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.

[LG-4] he Offline-Frontier Shift: Diagnosing Distributional Limits in Generative Multi-Objective Optimization

链接: https://arxiv.org/abs/2602.11126
作者: Stephanie Holly,Alexandru-Ciprian Zăvoianu,Siegfried Silber,Sepp Hochreiter,Werner Zellinger
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline multi-objective optimization (MOO) aims to recover Pareto-optimal designs given a finite, static dataset. Recent generative approaches, including diffusion models, show strong performance under hypervolume, yet their behavior under other established MOO metrics is less understood. We show that generative methods systematically underperform evolutionary alternatives with respect to other metrics, such as generational distance. We relate this failure mode to the offline-frontier shift, i.e., the displacement of the offline dataset from the Pareto front, which acts as a fundamental limitation in offline MOO. We argue that overcoming this limitation requires out-of-distribution sampling in objective space (via an integral probability metric) and empirically observe that generative methods remain conservatively close to the offline objective distribution. Our results position offline MOO as a distribution-shift–limited problem and provide a diagnostic lens for understanding when and why generative optimization methods fail.

[LG-5] From Natural Language to Materials Discovery:The Materials Knowledge Navigation Agent

链接: https://arxiv.org/abs/2602.11123
作者: Genmao Zhuang,Amir Barati Farimani
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 22 pages,5 figures

点击查看摘要

Abstract:Accelerating the discovery of high-performance materials remains a central challenge across energy, electronics, and aerospace technologies, where traditional workflows depend heavily on expert intuition and computationally expensive simulations. Here we introduce the Materials Knowledge Navigation Agent (MKNA), a language-driven system that translates natural-language scientific intent into executable actions for database retrieval, property prediction, structure generation, and stability evaluation. Beyond automating tool invocation, MKNA autonomously extracts quantitative thresholds and chemically meaningful design motifs from literature and database evidence, enabling data-grounded hypothesis formation. Applied to the search for high-Debye-temperature ceramics, the agent identifies a literature-supported screening criterion (Theta_D 800 K), rediscovers canonical ultra-stiff materials such as diamond, SiC, SiN, and BeO, and proposes thermodynamically stable, previously unreported Be-C-rich compounds that populate the sparsely explored 1500-1700 K regime. These results demonstrate that MKNA not only finds stable candidates but also reconstructs interpretable design heuristics, establishing a generalizable platform for autonomous, language-guided materials exploration.

[LG-6] Statistical Learning Analysis of Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2602.11097
作者: David A. Barajas-Solano
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We study the training and performance of physics-informed learning for initial and boundary value problems (IBVP) with physics-informed neural networks (PINNs) from a statistical learning perspective. Specifically, we restrict ourselves to parameterizations with hard initial and boundary condition constraints and reformulate the problem of estimating PINN parameters as a statistical learning problem. From this perspective, the physics penalty on the IBVP residuals can be better understood not as a regularizing term bus as an infinite source of indirect data, and the learning process as fitting the PINN distribution of residuals p(y \mid x, t, w) q(x, t) to the true data-generating distribution \delta(0) q(x, t) by minimizing the Kullback-Leibler divergence between the true and PINN distributions. Furthermore, this analysis show that physics-informed learning with PINNs is a singular learning problem, and we employ singular learning theory tools, namely the so-called Local Learning Coefficient (Lau et al., 2025) to analyze the estimates of PINN parameters obtained via stochastic optimization for a heat equation IBVP. Finally, we discuss implications of this analysis on the quantification of predictive uncertainty of PINNs and the extrapolation capacity of PINNs.

[LG-7] MerLin: A Discovery Engine for Photonic and Hybrid Quantum Machine Learning

链接: https://arxiv.org/abs/2602.11092
作者: Cassandre Notton,Benjamin Stott,Philippe Schoeb,Anthony Walsh,Grégoire Leboucher,Vincent Espitalier,Vassilis Apostolou,Louis-Félix Vigneux,Alexia Salavrakos,Jean Senellart
类目: Machine Learning (cs.LG); Programming Languages (cs.PL); Quantum Physics (quant-ph)
*备注: This work has been submitted to the 2026 IEEE World Congress on Computational Intelligence

点击查看摘要

Abstract:Identifying where quantum models may offer practical benefits in near term quantum machine learning (QML) requires moving beyond isolated algorithmic proposals toward systematic and empirical exploration across models, datasets, and hardware constraints. We introduce MerLin, an open source framework designed as a discovery engine for photonic and hybrid quantum machine learning. MerLin integrates optimized strong simulation of linear optical circuits into standard PyTorch and scikit learn workflows, enabling end to end differentiable training of quantum layers. MerLin is designed around systematic benchmarking and reproducibility. As an initial contribution, we reproduce eighteen state of the art photonic and hybrid QML works spanning kernel methods, reservoir computing, convolutional and recurrent architectures, generative models, and modern training paradigms. These reproductions are released as reusable, modular experiments that can be directly extended and adapted, establishing a shared experimental baseline consistent with empirical benchmarking methodologies widely adopted in modern artificial intelligence. By embedding photonic quantum models within established machine learning ecosystems, MerLin allows practitioners to leverage existing tooling for ablation studies, cross modality comparisons, and hybrid classical quantum workflows. The framework already implements hardware aware features, allowing tests on available quantum hardware while enabling exploration beyond its current capabilities, positioning MerLin as a future proof co design tool linking algorithms, benchmarks, and hardware.

[LG-8] oken-Efficient Change Detection in LLM APIs

链接: https://arxiv.org/abs/2602.11083
作者: Timothée Chauvin,Clément Lalanne,Erwan Le Merrer,Jean-Michel Loubes,François Taïani,Gilles Tredan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Remote change detection in LLMs is a difficult problem. Existing methods are either too expensive for deployment at scale, or require initial white-box access to model weights or grey-box access to log probabilities. We aim to achieve both low cost and strict black-box operation, observing only output tokens. Our approach hinges on specific inputs we call Border Inputs, for which there exists more than one output top token. From a statistical perspective, optimal change detection depends on the model’s Jacobian and the Fisher information of the output distribution. Analyzing these quantities in low-temperature regimes shows that border inputs enable powerful change detection tests. Building on this insight, we propose the Black-Box Border Input Tracking (B3IT) scheme. Extensive in-vivo and in-vitro experiments show that border inputs are easily found for non-reasoning tested endpoints, and achieve performance on par with the best available grey-box approaches. B3IT reduces costs by 30\times compared to existing methods, while operating in a strict black-box setting. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2602.11083 [cs.LG] (or arXiv:2602.11083v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.11083 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Motion Capture is Not the Target Domain: Scaling Synthetic Data for Learning Motion Representations

链接: https://arxiv.org/abs/2602.11064
作者: Firas Darwish,George Nicholson,Aiden Doherty,Hang Yuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic data offers a compelling path to scalable pretraining when real-world data is scarce, but models pretrained on synthetic data often fail to transfer reliably to deployment settings. We study this problem in full-body human motion, where large-scale data collection is infeasible but essential for wearable-based Human Activity Recognition (HAR), and where synthetic motion can be generated from motion-capture-derived representations. We pretrain motion time-series models using such synthetic data and evaluate their transfer across diverse downstream HAR tasks. Our results show that synthetic pretraining improves generalisation when mixed with real data or scaled sufficiently. We also demonstrate that large-scale motion-capture pretraining yields only marginal gains due to domain mismatch with wearable signals, clarifying key sim-to-real challenges and the limits and opportunities of synthetic motion data for transferable HAR representations.

[LG-10] Divide Harmonize Then Conquer It: Shooting Multi-Commodity Flow Problems with Multimodal Language Models ICLR2026

链接: https://arxiv.org/abs/2602.11057
作者: Xinyu Yuan,Yan Qiao,Zonghui Wang,Wenzhi Chen
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:The multi-commodity flow (MCF) problem is a fundamental topic in network flow and combinatorial optimization, with broad applications in transportation, communication, and logistics, etc. Nowadays, the rapid expansion of allocation systems has posed challenges for existing optimization engines in balancing optimality and tractability. In this paper, we present Pram, the first ML-based method that leverages the reasoning power of multimodal language models (MLMs) for addressing the trade-off dilemma – a great need of service providers. As part of our proposal, Pram (i) quickly computes high-quality allocations by dividing the original problem into local subproblems, which are then resolved by an MLM-powered “agent”, and (ii) ensures global consistency by harmonizing these subproblems via a multi-agent reinforcement learning algorithm. Theoretically, we show that Pram, which learns to perform gradient descent in context, provably converges to the optimum within the family of MCF problems. Empirically, on real-world datasets and public topologies, Pram achieves performance comparable to, and in some cases even surpassing, linear programming solvers (very close to the optimal solution), and substantially lower runtimes (1 to 2 orders of magnitude faster). Moreover, Pram exhibits strong robustness (10% performance degradation under link failures or flow bursts), demonstrating MLM’s generalization ability to unforeseen events. Pram is objective-agnostic and seamlessly integrates with mainstream allocation systems, providing a practical and scalable solution for future networks.

[LG-11] When Fusion Helps and When It Breaks: View-Aligned Robustness in Same-Source Financial Imaging

链接: https://arxiv.org/abs/2602.11020
作者: Rui Ma
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:We study same-source multi-view learning and adversarial robustness for next-day direction prediction with financial image representations. On Shanghai Gold Exchange (SGE) spot gold data (2005-2025), we construct two window-aligned views from each rolling window: an OHLCV-rendered price/volume chart and a technical-indicator matrix. To ensure reliable evaluation, we adopt leakage-resistant time-block splits with embargo and use Matthews correlation coefficient (MCC). We find that results depend strongly on the label-noise regime: we apply an ex-post minimum-movement filter that discards samples with realized next-day absolute return below tau to define evaluation subsets with reduced near-zero label ambiguity. This induces a non-monotonic data-noise trade-off that can reveal predictive signal but eventually increases variance as sample size shrinks; the filter is used for offline benchmark construction rather than an inference-time decision rule. In the stabilized subsets, fusion is regime dependent: early fusion by channel stacking can exhibit negative transfer, whereas late fusion with dual encoders and a fusion head provides the dominant clean-performance gains; cross-view consistency regularization has secondary, backbone-dependent effects. We further evaluate test-time L-infinity perturbations using FGSM and PGD under two threat scenarios: view-constrained attacks that perturb one view and joint attacks that perturb both. We observe severe vulnerability at tiny budgets with strong view asymmetry. Late fusion consistently improves robustness under view-constrained attacks, but joint attacks remain challenging and can still cause substantial worst-case degradation.

[LG-12] VCACHE: A Stateful Tool-Value Cache for Post-Training LLM Agents

链接: https://arxiv.org/abs/2602.10986
作者: Abhishek Vijaya Kumar,Bhaskar Kataria,Byungsoo Oh,Emaad Manzoor,Rachee Singh
类目: Machine Learning (cs.LG)
*备注: Abhishek Vijaya Kumar and Bhaskar Kataria have equal contribution

点击查看摘要

Abstract:In RL post-training of LLM agents, calls to external tools take several seconds or even minutes, leaving allocated GPUs idle and inflating post-training time and cost. While many tool invocations repeat across parallel rollouts and could in principle be cached, naively caching their outputs for reuse is incorrect since tool outputs depend on the environment state induced by prior agent interactions. We present TVCACHE, a stateful tool-value cache for LLM agent post-training. TVCACHE maintains a tree of observed tool-call sequences and performs longest-prefix matching for cache lookups: a hit occurs only when the agent’s full tool history matches a previously executed sequence, guaranteeing identical environment state. On three diverse workloads-terminal-based tasks, SQL generation, and video understanding. TVCACHE achieves cache hit rates of up to 70% and reduces median tool call execution time by up to 6.9X, with no degradation in post-training reward accumulation.

[LG-13] Sample Efficient Generative Molecular Optimization with Joint Self-Improvement

链接: https://arxiv.org/abs/2602.10984
作者: Serra Korkmaz,Adam Izdebski,Jonathan Pirnay,Rasmus Møller-Larsen,Michal Kmicikiewicz,Pankhil Gawade,Dominik G. Grimm,Ewa Szczurek
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Generative molecular optimization aims to design molecules with properties surpassing those of existing compounds. However, such candidates are rare and expensive to evaluate, yielding sample efficiency essential. Additionally, surrogate models introduced to predict molecule evaluations, suffer from distribution shift as optimization drives candidates increasingly out-of-distribution. To address these challenges, we introduce Joint Self-Improvement, which benefits from (i) a joint generative-predictive model and (ii) a self-improving sampling scheme. The former aligns the generator with the surrogate, alleviating distribution shift, while the latter biases the generative part of the joint model using the predictive one to efficiently generate optimized molecules at inference-time. Experiments across offline and online molecular optimization benchmarks demonstrate that Joint Self-Improvement outperforms state-of-the-art methods under limited evaluation budgets.

[LG-14] A Jointly Efficient and Optimal Algorithm for Heteroskedastic Generalized Linear Bandits with Adversarial Corruptions

链接: https://arxiv.org/abs/2602.10971
作者: Sanghwa Kim,Junghyun Lee,Se-Young Yun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 49 pages, 1 table

点击查看摘要

Abstract:We consider the problem of heteroskedastic generalized linear bandits (GLBs) with adversarial corruptions, which subsumes various stochastic contextual bandit settings, including heteroskedastic linear bandits and logistic/Poisson bandits. We propose HCW-GLB-OMD, which consists of two components: an online mirror descent (OMD)-based estimator and Hessian-based confidence weights to achieve corruption robustness. This is computationally efficient in that it only requires O(1) space and time complexity per iteration. Under the self-concordance assumption on the link function, we show a regret bound of \tildeO\left( d \sqrt\sum_t g(\tau_t) \dot\mu_t,\star + d^2 g_\max \kappa + d \kappa C \right) , where \dot\mu_t,\star is the slope of \mu around the optimal arm at time t , g(\tau_t) 's are potentially exogenously time-varying dispersions (e.g., g(\tau_t) = \sigma_t^2 for heteroskedastic linear bandits, g(\tau_t) = 1 for Bernoulli and Poisson), g_\max = \max_t \in [T] g(\tau_t) is the maximum dispersion, and C \geq 0 is the total corruption budget of the adversary. We complement this with a lower bound of \tilde\Omega(d \sqrt\sum_t g(\tau_t) \dot\mu_t,\star + d C) , unifying previous problem-specific lower bounds. Thus, our algorithm achieves, up to a \kappa -factor in the corruption term, instance-wise minimax optimality simultaneously across various instances of heteroskedastic GLBs with adversarial corruptions.

[LG-15] MoEEdit: Efficient and Routing-Stable Knowledge Editing for Mixture-of-Experts LLM s

链接: https://arxiv.org/abs/2602.10965
作者: Yupu Gu,Rongzhe Wei,Andy Zhu,Pan Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge editing (KE) enables precise modifications to factual content in large language models (LLMs). Existing KE methods are largely designed for dense architectures, limiting their applicability to the increasingly prevalent sparse Mixture-of-Experts (MoE) models that underpin modern scalable LLMs. Although MoEs offer strong efficiency and capacity scaling, naively adapting dense-model editors is both computationally costly and prone to routing distribution shifts that undermine stability and consistency. To address these challenges, we introduce MoEEdit, the first routing-stable framework for parameter-modifying knowledge editing in MoE LLMs. Our method reparameterizes expert updates via per-expert null-space projections that keep router inputs invariant and thereby suppress routing shifts. The resulting block-structured optimization is solved efficiently with a block coordinate descent (BCD) solver. Experiments show that MoEEdit attains state-of-the-art efficacy and generalization while preserving high specificity and routing stability, with superior compute and memory efficiency. These results establish a robust foundation for scalable, precise knowledge editing in sparse LLMs and underscore the importance of routing-stable interventions.

[LG-16] Stochastic Parroting in Temporal Attention – Regulating the Diagonal Sink

链接: https://arxiv.org/abs/2602.10956
作者: Victoria Hankemeier,Malte Hankemeier
类目: Machine Learning (cs.LG)
*备注: Accepted at ESANN 2026, Code: this https URL

点击查看摘要

Abstract:Spatio-temporal models analyze spatial structures and temporal dynamics, which makes them prone to information degeneration among space and time. Prior literature has demonstrated that over-squashing in causal attention or temporal convolutions creates a bias on the first tokens. To analyze whether such a bias is present in temporal attention mechanisms, we derive sensitivity bounds on the expected value of the Jacobian of a temporal attention layer. We theoretically show how off-diagonal attention scores depend on the sequence length, and that temporal attention matrices suffer a diagonal attention sink. We suggest regularization methods, and experimentally demonstrate their effectiveness.

[LG-17] CMAD: Cooperative Multi-Agent Diffusion via Stochastic Optimal Control

链接: https://arxiv.org/abs/2602.10933
作者: Riccardo Barbano,Alexander Denker,Zeljko Kereta,Runchang Li,Francisco Vargas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continuous-time generative models have achieved remarkable success in image restoration and synthesis. However, controlling the composition of multiple pre-trained models remains an open challenge. Current approaches largely treat composition as an algebraic composition of probability densities, such as via products or mixtures of experts. This perspective assumes the target distribution is known explicitly, which is almost never the case. In this work, we propose a different paradigm that formulates compositional generation as a cooperative Stochastic Optimal Control problem. Rather than combining probability densities, we treat pre-trained diffusion models as interacting agents whose diffusion trajectories are jointly steered, via optimal control, toward a shared objective defined on their aggregated output. We validate our framework on conditional MNIST generation and compare it against a naive inference-time DPS-style baseline replacing learned cooperative control with per-step gradient guidance.

[LG-18] Spatial-Morphological Modeling for Multi-Attribute Imputation of Urban Blocks

链接: https://arxiv.org/abs/2602.10923
作者: Vasilii Starikov,Ruslan Kozliak,Georgii Kontsevik,Sergey Mityagin
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Accurate reconstruction of missing morphological indicators of a city is crucial for urban planning and data-driven analysis. This study presents the spatial-morphological (SM) imputer tool, which combines data-driven morphological clustering with neighborhood-based methods to reconstruct missing values of the floor space index (FSI) and ground space index (GSI) at the city block level, inspired by the SpaceMatrix framework. This approach combines city-scale morphological patterns as global priors with local spatial information for context-dependent interpolation. The evaluation shows that while SM alone captures meaningful morphological structure, its combination with inverse distance weighting (IDW) or spatial k-nearest neighbor (sKNN) methods provides superior performance compared to existing SOTA models. Composite methods demonstrate the complementary advantages of combining morphological and spatial approaches.

[LG-19] Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

链接: https://arxiv.org/abs/2602.10917
作者: Qian Zuo,Zhiyong Wang,Fengxiang He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant \tildeO(1) strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety margin is rigorously scheduled to asymptotically majorize the functional decay rates of the optimization and statistical errors, thereby clamping cumulative violations to a near-constant level. Furthermore, we establish non-asymptotic last-iterate convergence guarantees via a policy-dual Lyapunov argument. Experiments corroborate our theoretical findings.

[LG-20] uning the burn-in phase in training recurrent neural networks improves their performance ICLR2026

链接: https://arxiv.org/abs/2602.10911
作者: Julian D. Schiller,Malte Heinrich,Victor G. Lopez,Matthias A. Müller
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: Published as a conference paper at ICLR 2026, this https URL

点击查看摘要

Abstract:Training recurrent neural networks (RNNs) with standard backpropagation through time (BPTT) can be challenging, especially in the presence of long input sequences. A practical alternative to reduce computational and memory overhead is to perform BPTT repeatedly over shorter segments of the training data set, corresponding to truncated BPTT. In this paper, we examine the training of RNNs when using such a truncated learning approach for time series tasks. Specifically, we establish theoretical bounds on the accuracy and performance loss when optimizing over subsequences instead of the full data sequence. This reveals that the burn-in phase of the RNN is an important tuning knob in its training, with significant impact on the performance guarantees. We validate our theoretical results through experiments on standard benchmarks from the fields of system identification and time series forecasting. In all experiments, we observe a strong influence of the burn-in phase on the training process, and proper tuning can lead to a reduction of the prediction error on the training and test data of more than 60% in some cases.

[LG-21] Natural Hypergradient Descent: Algorithm Design Convergence Analysis and Parallel Implementation

链接: https://arxiv.org/abs/2602.10905
作者: Deyi Kong,Zaiwei Chen,Shuzhong Zhang,Shancong Mou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work, we propose Natural Hypergradient Descent (NHGD), a new method for solving bilevel optimization problems. To address the computational bottleneck in hypergradient estimation–namely, the need to compute or approximate Hessian inverse–we exploit the statistical structure of the inner optimization problem and use the empirical Fisher information matrix as an asymptotically consistent surrogate for the Hessian. This design enables a parallel optimize-and-approximate framework in which the Hessian-inverse approximation is updated synchronously with the stochastic inner optimization, reusing gradient information at negligible additional cost. Our main theoretical contribution establishes high-probability error bounds and sample complexity guarantees for NHGD that match those of state-of-the-art optimize-then-approximate methods, while significantly reducing computational time overhead. Empirical evaluations on representative bilevel learning tasks further demonstrate the practical advantages of NHGD, highlighting its scalability and effectiveness in large-scale machine learning settings.

[LG-22] Anomaly Detection with Machine Learning Algorithms in Large-Scale Power Grids

链接: https://arxiv.org/abs/2602.10888
作者: Marc Gillioz,Guillaume Dubuis,Étienne Voutaz,Philippe Jacquod
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:We apply several machine learning algorithms to the problem of anomaly detection in operational data for large-scale, high-voltage electric power grids. We observe important differences in the performance of the algorithms. Neural networks typically outperform classical algorithms such as k-nearest neighbors and support vector machines, which we explain by the strong contextual nature of the anomalies. We show that unsupervised learning algorithm work remarkably well and that their predictions are robust against simultaneous, concurring anomalies.

[LG-23] he Sample Complexity of Uniform Approximation for Multi-Dimensional CDFs and Fixed-Price Mechanisms

链接: https://arxiv.org/abs/2602.10868
作者: Matteo Castiglioni,Anna Lunghi,Alberto Marchesi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the sample complexity of learning a uniform approximation of an n -dimensional cumulative distribution function (CDF) within an error \epsilon 0 , when observations are restricted to a minimal one-bit feedback. This serves as a counterpart to the multivariate DKW inequality under ‘‘full feedback’’, extending it to the setting of ‘‘bandit feedback’’. Our main result shows a near-dimensional-invariance in the sample complexity: we get a uniform \epsilon -approximation with a sample complexity \frac1\epsilon^3\log\left(\frac 1 \epsilon \right)^\mathcalO(n) over a arbitrary fine grid, where the dimensionality n only affects logarithmic terms. As direct corollaries, we provide tight sample complexity bounds and novel regret guarantees for learning fixed-price mechanisms in small markets, such as bilateral trade settings.

[LG-24] Automated Model Design using Gated Neuron Selection in Telecom

链接: https://arxiv.org/abs/2602.10854
作者: Adam Orucu,Marcus Medhage,Farnaz Moradi,Andreas Johnsson,Sarunas Girdzijauskas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The telecommunications industry is experiencing rapid growth in adopting deep learning for critical tasks such as traffic prediction, signal strength prediction, and quality of service optimisation. However, designing neural network architectures for these applications remains challenging and time-consuming, particularly when targeting compact models suitable for resource-constrained network environments. Therefore, there is a need for automating the model design process to create high-performing models efficiently. This paper introduces TabGNS (Tabular Gated Neuron Selection), a novel gradient-based Neural Architecture Search (NAS) method specifically tailored for tabular data in telecommunications networks. We evaluate TabGNS across multiple telecommunications and generic tabular datasets, demonstrating improvements in prediction performance while reducing the architecture size by 51-82% and reducing the search time by up to 36x compared to state-of-the-art tabular NAS methods. Integrating TabGNS into the model lifecycle management enables automated design of neural networks throughout the lifecycle, accelerating deployment of ML solutions in telecommunications networks.

[LG-25] SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

链接: https://arxiv.org/abs/2602.10840
作者: Yanan Wang,Renxi Wang,Yongxin Wang,Xuezhi Liang,Fajri Koto,Timothy Baldwin,Xiaodan Liang,Haonan Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning, yet their ability to accurately represent and simulate physical scenarios via code remains underexplored. We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical scenarios across five physics domains and 52 physical concepts. We build an automatic pipeline to collect data, with human verification to ensure quality. The final dataset contains 7,659 physical scenarios with 334 human-verified examples as the test set. We evaluated 10 contemporary LLMs and found that even the strongest model achieves only a 21.5% pass rate, demonstrating the difficulty of the task. Finally, we introduce a reinforcement learning pipeline with visual rewards that uses a vision-language model as a judge to train textual models. Experiments show that training with our data improves physical simulation via code while substantially enhancing general code generation performance.

[LG-26] Adaptive Sampling for Private Worst-Case Group Optimization

链接: https://arxiv.org/abs/2602.10820
作者: Max Cairney-Leeming,Amartya Sanyal,Christoph H. Lampert
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Models trained by minimizing the average loss often fail to be accurate on small or hard-to-learn groups of the data. Various methods address this issue by optimizing a weighted objective that focuses on the worst-performing groups. However, this approach becomes problematic when learning with differential privacy, as unequal data weighting can result in inhomogeneous privacy guarantees, in particular weaker privacy for minority groups. In this work, we introduce a new algorithm for differentially private worst-case group optimization called ASC (Adaptively Sampled and Clipped Worst-case Group Optimization). It adaptively controls both the sampling rate and the clipping threshold of each group. Thereby, it allows for harder-to-learn groups to be sampled more often while ensuring consistent privacy guarantees across all groups. Comparing ASC to prior work, we show that it results in lower-variance gradients, tighter privacy guarantees, and substantially higher worst-case group accuracy without sacrificing overall average accuracy.

[LG-27] RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

链接: https://arxiv.org/abs/2602.10819
作者: Linxuan Xia,Xiaolong Yang,Yongyuan Chen,Enyue Zhao,Deng Cai,Yasheng Wang,Boxi Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model’s generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model’s current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.

[LG-28] PRISM: Parallel Residual Iterative Sequence Model

链接: https://arxiv.org/abs/2602.10796
作者: Jie Jiang,Ke Cheng,Xin Xu,Mengyang Pang,Tianhao Lu,Jiaheng Li,Yue Liu,Yuan Wang,Jun Zhang,Huan Yu,Zhouchen Lin
类目: Machine Learning (cs.LG)
*备注: 21 pages, 2 figures

点击查看摘要

Abstract:Generative sequence modeling faces a fundamental tension between the expressivity of Transformers and the efficiency of linear sequence models. Existing efficient architectures are theoretically bounded by shallow, single-step linear updates, while powerful iterative methods like Test-Time Training (TTT) break hardware parallelism due to state-dependent gradients. We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension. PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form. We employ a Write-Forget Decoupling strategy that isolates non-linearity within the injection operator. To bypass the serial dependency of explicit solvers, PRISM utilizes a two-stage proxy architecture: a short-convolution anchors the initial residual using local history energy, while a learned predictor estimates the refinement updates directly from the input. This design distills structural patterns associated with iterative correction into a parallelizable feedforward operator. Theoretically, we prove that this formulation achieves Rank- L accumulation, structurally expanding the update manifold beyond the single-step Rank- 1 bottleneck. Empirically, it achieves comparable performance to explicit optimization methods while achieving 174x higher throughput.

[LG-29] Semi-Supervised Cross-Domain Imitation Learning

链接: https://arxiv.org/abs/2602.10793
作者: Li-Min Chu,Kai-Siang Ma,Ming-Hong Chen,Ping-Chun Hsieh
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Published in Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Cross-domain imitation learning (CDIL) accelerates policy learning by transferring expert knowledge across domains, which is valuable in applications where the collection of expert data is costly. Existing methods are either supervised, relying on proxy tasks and explicit alignment, or unsupervised, aligning distributions without paired data, but often unstable. We introduce the Semi-Supervised CDIL (SS-CDIL) setting and propose the first algorithm for SS-CDIL with theoretical justification. Our method uses only offline data, including a small number of target expert demonstrations and some unlabeled imperfect trajectories. To handle domain discrepancy, we propose a novel cross-domain loss function for learning inter-domain state-action mappings and design an adaptive weight function to balance the source and target knowledge. Experiments on MuJoCo and Robosuite show consistent gains over the baselines, demonstrating that our approach achieves stable and data-efficient policy learning with minimal supervision. Our code is available at~ this https URL.

[LG-30] Collaborative Threshold Watermarking

链接: https://arxiv.org/abs/2602.10765
作者: Tameem Bakr,Anish Ambreth,Nils Lukas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In federated learning (FL), K clients jointly train a model without sharing raw data. Because each participant invests data and compute, clients need mechanisms to later prove the provenance of a jointly trained model. Model watermarking embeds a hidden signal in the weights, but naive approaches either do not scale with many clients as per-client watermarks dilute as K grows, or give any individual client the ability to verify and potentially remove the watermark. We introduce (t,K) -threshold watermarking: clients collaboratively embed a shared watermark during training, while only coalitions of at least t clients can reconstruct the watermark key and verify a suspect model. We secret-share the watermark key \tau so that coalitions of fewer than t clients cannot reconstruct it, and verification can be performed without revealing \tau in the clear. We instantiate our protocol in the white-box setting and evaluate on image classification. Our watermark remains detectable at scale ( K=128 ) with minimal accuracy loss and stays above the detection threshold ( z\ge 4 ) under attacks including adaptive fine-tuning using up to 20% of the training data.

[LG-31] Predicting integers from continuous parameters

链接: https://arxiv.org/abs/2602.10751
作者: Bas Maat,Peter Bloem
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the question whether such integer labels can be modeled directly by a discrete distribution, whose parameters are predicted from the features of a given instance. Moreover, we focus on the use case of output distributions of neural networks, which adds the requirement that the parameters of the distribution be continuous so that backpropagation and gradient descent may be used to learn the weights of the network. We investigate several options for such distributions, some existing and some novel, and test them on a range of tasks, including tabular learning, sequential prediction and image generation. We find that overall the best performance comes from two distributions: Bitwise, which represents the target integer in bits and places a Bernoulli distribution on each, and a discrete analogue of the Laplace distribution, which uses a distribution with exponentially decaying tails around a continuous mean.

[LG-32] Kalman Linear Attention: Parallel Bayesian Filtering For Efficient Language Modelling and State Tracking

链接: https://arxiv.org/abs/2602.10743
作者: Vaisakh Shaj,Cameron Barker,Aidan Scannell,Andras Szecsenyi,Elliot J. Crowley,Amos Storkey
类目: Machine Learning (cs.LG)
*备注: Preprint. A version of this work was accepted and presented at the 1st Workshop on Epistemic Intelligence in Machine Learning (EIML) at EurIPS 2025

点击查看摘要

Abstract:State-space language models such as Mamba and gated linear attention (GLA) offer efficient alternatives to transformers due to their linear complexity and parallel training, but often lack the expressivity and robust state-tracking needed for complex reasoning. We address these limitations by reframing sequence modelling through a probabilistic lens, using Bayesian filters as a core primitive. While classical filters such as Kalman filters provide principled state estimation and uncertainty tracking, they are typically viewed as inherently sequential. We show that reparameterising the Kalman filter in information form enables its updates to be computed via an associative scan, allowing efficient parallel training. Building on this insight, we introduce the Kalman Linear Attention (KLA) layer, a neural sequence-modelling primitive that performs time-parallel probabilistic inference while maintaining explicit belief-state uncertainty. KLA offers strictly more expressive nonlinear updates and gating than GLA variants while retaining their computational advantages. On language modelling tasks, KLA matches or outperforms modern SSMs and GLAs across representative discrete token-manipulation and state-tracking benchmarks.

[LG-33] Rising Multi-Armed Bandits with Known Horizons

链接: https://arxiv.org/abs/2602.10727
作者: Seockbean Song,Chenyu Gan,Youngsik Yoon,Siwei Wang,Wei Chen,Jungseul Ok
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Rising Multi-Armed Bandit (RMAB) framework models environments where expected rewards of arms increase with plays, which models practical scenarios where performance of each option improves with the repeated usage, such as in robotics and hyperparameter tuning. For instance, in hyperparameter tuning, the validation accuracy of a model configuration (arm) typically increases with each training epoch. A defining characteristic of RMAB is em horizon-dependent optimality: unlike standard settings, the optimal strategy here shifts dramatically depending on the available budget T . This implies that knowledge of T yields significantly greater utility in RMAB, empowering the learner to align its decision-making with this shifting optimality. However, the horizon-aware setting remains underexplored. To address this, we propose a novel CUmulative Reward Estimation UCB (CURE-UCB) that explicitly integrates the horizon. We provide a rigorous analysis establishing a new regret upper bound and prove that our method strictly outperforms horizon-agnostic strategies in structured environments like ``linear-then-flat’’ instances. Extensive experiments demonstrate its significant superiority over baselines.

[LG-34] Reducing Estimation Uncertainty Using Normalizing Flows and Stratification

链接: https://arxiv.org/abs/2602.10706
作者: Paweł Lorek,Rafał Topolnicki,Tomasz Trzciński,Maciej Zięba,Aleksandra Krystecka
类目: Machine Learning (cs.LG)
*备注: This is the extended version of a paper accepted for publication at ACIIDS 2026

点击查看摘要

Abstract:Estimating the expectation of a real-valued function of a random variable from sample data is a critical aspect of statistical analysis, with far-reaching implications in various applications. Current methodologies typically assume (semi-)parametric distributions such as Gaussian or mixed Gaussian, leading to significant estimation uncertainty if these assumptions do not hold. We propose a flow-based model, integrated with stratified sampling, that leverages a parametrized neural network to offer greater flexibility in modeling unknown data distributions, thereby mitigating this limitation. Our model shows a marked reduction in estimation uncertainty across multiple datasets, including high-dimensional (30 and 128) ones, outperforming crude Monte Carlo estimators and Gaussian mixture models. Reproducible code is available at this https URL.

[LG-35] A Unified Experimental Architecture for Informative Path Planning : from Simulation to Deployment with GuadalPlanner

链接: https://arxiv.org/abs/2602.10702
作者: Alejandro Mendoza Barrionuevo,Dame Seck Diop,Alejandro Casado Pérez,Daniel Gutiérrez Reina,Sergio L. Toral Marín,Samuel Yanes Luis
类目: Robotics (cs.RO); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The evaluation of informative path planning algorithms for autonomous vehicles is often hindered by fragmented execution pipelines and limited transferability between simulation and real-world deployment. This paper introduces a unified architecture that decouples high-level decision-making from vehicle-specific control, enabling algorithms to be evaluated consistently across different abstraction levels without modification. The proposed architecture is realized through GuadalPlanner, which defines standardized interfaces between planning, sensing, and vehicle execution. It is an open and extensible research tool that supports discrete graph-based environments and interchangeable planning strategies, and is built upon widely adopted robotics technologies, including ROS2, MAVLink, and MQTT. Its design allows the same algorithmic logic to be deployed in fully simulated environments, software-in-the-loop configurations, and physical autonomous vehicles using an identical execution pipeline. The approach is validated through a set of experiments, including real-world deployment on an autonomous surface vehicle performing water quality monitoring with real-time sensor feedback.

[LG-36] Domain Knowledge Guided Bayesian Optimization For Autonomous Alignment Of Complex Scientific Instruments

链接: https://arxiv.org/abs/2602.10670
作者: Aashwin Mishra,Matt Seaberg,Ryan Roussel,Daniel Ratner,Apurva Mehta
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:Bayesian Optimization (BO) is a powerful tool for optimizing complex non-linear systems. However, its performance degrades in high-dimensional problems with tightly coupled parameters and highly asymmetric objective landscapes, where rewards are sparse. In such needle-in-a-haystack scenarios, even advanced methods like trust-region BO (TurBO) often lead to unsatisfactory results. We propose a domain knowledge guided Bayesian Optimization approach, which leverages physical insight to fundamentally simplify the search problem by transforming coordinates to decouple input features and align the active subspaces with the primary search axes. We demonstrate this approach’s efficacy on a challenging 12-dimensional, 6-crystal Split-and-Delay optical system, where conventional approaches, including standard BO, TuRBO and multi-objective BO, consistently led to unsatisfactory results. When combined with an reverse annealing exploration strategy, this approach reliably converges to the global optimum. The coordinate transformation itself is the key to this success, significantly accelerating the search by aligning input co-ordinate axes with the problem’s active subspaces. As increasingly complex scientific instruments, from large telescopes to new spectrometers at X-ray Free Electron Lasers are deployed, the demand for robust high-dimensional optimization grows. Our results demonstrate a generalizable paradigm: leveraging physical insight to transform high-dimensional, coupled optimization problems into simpler representations can enable rapid and robust automated tuning for consistent high performance while still retaining current optimization algorithms.

[LG-37] Evaluation metrics for temporal preservation in synthetic longitudinal patient data

链接: https://arxiv.org/abs/2602.10643
作者: Katariina Perkonoja,Parisa Movahedi,Antti Airola,Kari Auranen,Joni Virta
类目: Machine Learning (cs.LG)
*备注: 50 pages, 17 figures

点击查看摘要

Abstract:This study introduces a set of metrics for evaluating temporal preservation in synthetic longitudinal patient data, defined as artificially generated data that mimic real patients’ repeated measurements over time. The proposed metrics assess how synthetic data reproduces key temporal characteristics, categorized into marginal, covariance, individual-level and measurement structures. We show that strong marginal-level resemblance may conceal distortions in covariance and disruptions in individual-level trajectories. Temporal preservation is influenced by factors such as original data quality, measurement frequency, and preprocessing strategies, including binning, variable encoding and precision. Variables with sparse or highly irregular measurement times provide limited information for learning temporal dependencies, resulting in reduced resemblance between the synthetic and original data. No single metric adequately captures temporal preservation; instead, a multidimensional evaluation across all characteristics provides a more comprehensive assessment of synthetic data quality. Overall, the proposed metrics clarify how and why temporal structures are preserved or degraded, enabling more reliable evaluation and improvement of generative models and supporting the creation of temporally realistic synthetic longitudinal patient data.

[LG-38] Coarse-Grained Boltzmann Generators

链接: https://arxiv.org/abs/2602.10637
作者: Weilong Chen,Bojun Zhao,Jan Eckwert,Julija Zavadlav
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Chemical Physics (physics.chem-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sampling equilibrium molecular configurations from the Boltzmann distribution is a longstanding challenge. Boltzmann Generators (BGs) address this by combining exact-likelihood generative models with importance sampling, but their practical scalability is limited. Meanwhile, coarse-grained surrogates enable the modeling of larger systems by reducing effective dimensionality, yet often lack the reweighting process required to ensure asymptotically correct statistics. In this work, we propose Coarse-Grained Boltzmann Generators (CG-BGs), a principled framework that unifies scalable reduced-order modeling with the exactness of importance sampling. CG-BGs act in a coarse-grained coordinate space, using a learned potential of mean force (PMF) to reweight samples generated by a flow-based model. Crucially, we show that this PMF can be efficiently learned from rapidly converged data via force matching. Our results demonstrate that CG-BGs faithfully capture complex interactions mediated by explicit solvent within highly reduced representations, establishing a scalable pathway for the unbiased sampling of larger molecular systems.

[LG-39] Generative clinical time series models trained on moderate amounts of patient data are privacy preserving

链接: https://arxiv.org/abs/2602.10631
作者: Rustam Zhumagambetov,Niklas Giesa,Sebastian D. Boie,Stefan Haufe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sharing medical data for machine learning model training purposes is often impossible due to the risk of disclosing identifying information about individual patients. Synthetic data produced by generative artificial intelligence (genAI) models trained on real data is often seen as one possible solution to comply with privacy regulations. While powerful genAI models for heterogeneous hospital time series have recently been introduced, such modeling does not guarantee privacy protection, as the generated data may still reveal identifying information about individuals in the models’ training cohort. Applying established privacy mechanisms to generative time series models, however, proves challenging as post-hoc data anonymization through k-anonymization or similar techniques is limited, while model-centered privacy mechanisms that implement differential privacy (DP) may lead to unstable training, compromising the utility of generated data. Given these known limitations, privacy audits for generative time series models are currently indispensable regardless of the concrete privacy mechanisms applied to models and/or data. In this work, we use a battery of established privacy attacks to audit state-of-the-art hospital time series models, trained on the public MIMIC-IV dataset, with respect to privacy preservation. Furthermore, the eICU dataset was used to mount a privacy attack against the synthetic data generator trained on the MIMIC-IV dataset. Results show that established privacy attacks are ineffective against generated multivariate clinical time series when synthetic data generators are trained on large enough training datasets. Furthermore, we discuss how the use of existing DP mechanisms for these synthetic data generators would not bring desired improvement in privacy, but only a decrease in utility for machine learning prediction tasks.

[LG-40] Pupillometry and Brain Dynamics for Cognitive Load in Working Memory

链接: https://arxiv.org/abs/2602.10614
作者: Nusaibah Farrukh,Malavika Pradeep,Akshay Sasi,Rahul Venugopal,Elizabeth Sherly
类目: Machine Learning (cs.LG)
*备注: 6 Pages, 3 Figures, 5 Tables, Code Available at: this https URL

点击查看摘要

Abstract:Cognitive load, the mental effort required during working memory, is central to neuroscience, psychology, and human-computer interaction. Accurate assessment is vital for adaptive learning, clinical monitoring, and brain-computer interfaces. Physiological signals such as pupillometry and electroencephalography are established biomarkers of cognitive load, but their comparative utility and practical integration as lightweight, wearable monitoring solutions remain underexplored. EEG provides high temporal resolution of neural activity. Although non-invasive, it is technologically demanding and limited in wearability and cost due to its resource-intensive nature, whereas pupillometry is non-invasive, portable, and scalable. Existing studies often rely on deep learning models with limited interpretability and substantial computational expense. This study integrates feature-based and model-driven approaches to advance time-series analysis. Using the OpenNeuro ‘Digit Span Task’ dataset, this study investigates cognitive load classification from EEG and pupillometry. Feature-based approaches using Catch-22 features and classical machine learning models outperform deep learning in both binary and multiclass tasks. The findings demonstrate that pupillometry alone can compete with EEG, serving as a portable and practical proxy for real-world applications. These results challenge the assumption that EEG is necessary for load detection, showing that pupil dynamics combined with interpretable models and SHAP based feature analysis provide physiologically meaningful insights. This work supports the development of wearable, affordable cognitive monitoring systems for neuropsychiatry, education, and healthcare.

[LG-41] On the Role of Consistency Between Physics and Data in Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2602.10611
作者: Nicolás Becerra-Zuniga,Lucas Lacasa,Eusebio Valero,Gonzalo Rubio
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 24 pages, 7 Figures, 3 Tables

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have gained significant attention as a surrogate modeling strategy for partial differential equations (PDEs), particularly in regimes where labeled data are scarce and physical constraints can be leveraged to regularize the learning process. In practice, however, PINNs are frequently trained using experimental or numerical data that are not fully consistent with the governing equations due to measurement noise, discretization errors, or modeling assumptions. The implications of such data-to-PDE inconsistencies on the accuracy and convergence of PINNs remain insufficiently understood. In this work, we systematically analyze how data inconsistency fundamentally limits the attainable accuracy of PINNs. We introduce the concept of a consistency barrier, defined as an intrinsic lower bound on the error that arises from mismatches between the fidelity of the data and the exact enforcement of the PDE residual. To isolate and quantify this effect, we consider the 1D viscous Burgers equation with a manufactured analytical solution, which enables full control over data fidelity and residual errors. PINNs are trained using datasets of progressively increasing numerical accuracy, as well as perfectly consistent analytical data. Results show that while the inclusion of the PDE residual allows PINNs to partially mitigate low-fidelity data and recover the dominant physical structure, the training process ultimately saturates at an error level dictated by the data inconsistency. When high-fidelity numerical data are employed, PINN solutions become indistinguishable from those trained on analytical data, indicating that the consistency barrier is effectively removed. These findings clarify the interplay between data quality and physics enforcement in PINNs providing practical guidance for the construction and interpretation of physics-informed surrogate models.

[LG-42] dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

链接: https://arxiv.org/abs/2602.10603
作者: Arnav Shah,Junzhe Li,Parsa Idehpour,Adibvafa Fallahpour,Brandon Wang,Sukjun Hwang,Bo Wang,Patrick D. Hsu,Hani Goodarzi,Albert Gu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff in their input representation. Standard fixed-vocabulary tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotide-level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end-to-end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling 3 \times inference speedup over Transformers. On zero-shot tasks, dnaHNet achieves superior performance in predicting protein variant fitness and gene essentiality, while automatically discovering hierarchical biological structures without supervision. These results establish dnaHNet as a scalable, interpretable framework for next-generation genomic modeling.

[LG-43] Learning Mixture Density via Natural Gradient Expectation Maximization

链接: https://arxiv.org/abs/2602.10602
作者: Yutao Chen,Jasmine Bayrooti,Steven Morad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture density networks are neural networks that produce Gaussian mixtures to represent continuous multimodal conditional densities. Standard training procedures involve maximum likelihood estimation using the negative log-likelihood (NLL) objective, which suffers from slow convergence and mode collapse. In this work, we improve the optimization of mixture density networks by integrating their information geometry. Specifically, we interpret mixture density networks as deep latent-variable models and analyze them through an expectation maximization framework, which reveals surprising theoretical connections to natural gradient descent. We then exploit such connections to derive the natural gradient expectation maximization (nGEM) objective. We show that empirically nGEM achieves up to 10 \times faster convergence while adding almost zerocomputational overhead, and scales well to high-dimensional data where NLL otherwise fails.

[LG-44] Roughness-Informed Federated Learning

链接: https://arxiv.org/abs/2602.10595
作者: Mohammad Partohaghighi,Roummel Marcia,Bruce J. West,YangQuan Chen
类目: Machine Learning (cs.LG)
*备注: This manuscript is under review in IEEE TPAMI journal

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy, yet faces challenges in non-independent and identically distributed (non-IID) settings due to client drift, which impairs convergence. We propose RI-FedAvg, a novel FL algorithm that mitigates client drift by incorporating a Roughness Index (RI)-based regularization term into the local objective, adaptively penalizing updates based on the fluctuations of local loss landscapes. This paper introduces RI-FedAvg, leveraging the RI to quantify the roughness of high-dimensional loss functions, ensuring robust optimization in heterogeneous settings. We provide a rigorous convergence analysis for non-convex objectives, establishing that RI-FedAvg converges to a stationary point under standard assumptions. Extensive experiments on MNIST, CIFAR-10, and CIFAR-100 demonstrate that RI-FedAvg outperforms state-of-the-art baselines, including FedAvg, FedProx, FedDyn, SCAFFOLD, and DP-FedAvg, achieving higher accuracy and faster convergence in non-IID scenarios. Our results highlight RI-FedAvg’s potential to enhance the robustness and efficiency of federated learning in practical, heterogeneous environments.

[LG-45] Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning ICRA2026

链接: https://arxiv.org/abs/2602.10594
作者: Runze Tang,Penny Sweetser
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to ICRA 2026

点击查看摘要

Abstract:Imitation Learning (IL) enables robots to learn complex skills from demonstrations without explicit task modeling, but it typically requires large amounts of demonstrations, creating significant collection costs. Prior work has investigated using flow as an intermediate representation to enable the use of human videos as a substitute, thereby reducing the amount of required robot demonstrations. However, most prior work has focused on the flow, either on the object or on specific points of the robot/hand, which cannot describe the motion of interaction. Meanwhile, relying on flow to achieve generalization to scenarios observed only in human videos remains limited, as flow alone cannot capture precise motion details. Furthermore, conditioning on scene observation to produce precise actions may cause the flow-conditioned policy to overfit to training tasks and weaken the generalization indicated by the flow. To address these gaps, we propose SFCrP, which includes a Scene Flow prediction model for Cross-embodiment learning (SFCr) and a Flow and Cropped point cloud conditioned Policy (FCrP). SFCr learns from both robot and human videos and predicts any point trajectories. FCrP follows the general flow motion and adjusts the action based on observations for precision tasks. Our method outperforms SOTA baselines across various real-world task settings, while also exhibiting strong spatial and instance generalization to scenarios seen only in human videos.

[LG-46] RACE: Theoretical Risk Attribution under Covariate-shift Effects

链接: https://arxiv.org/abs/2602.10588
作者: Hosein Anjidani,S. Yahya S. R. Tehrani,Mohammad Mahdi Mojahedian,Mohammad Hossein Yassaee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:When a source-trained model Q is replaced by a model \tildeQ trained on shifted data, its performance on the source domain can change unpredictably. To address this, we study the two-model risk change, \Delta R := R_P(\tildeQ) - R_P(Q) , under covariate shift. We introduce TRACE (Theoretical Risk Attribution under Covariate-shift Effects), a framework that decomposes |\Delta R| into an interpretable upper bound. This decomposition disentangles the risk change into four actionable factors: two generalization gaps, a model change penalty, and a covariate shift penalty, transforming the bound into a powerful diagnostic tool for understanding why performance has changed. To make TRACE a fully computable diagnostic, we instantiate each term. The covariate shift penalty is estimated via a model sensitivity factor (from high-quantile input gradients) and a data-shift measure; we use feature-space Optimal Transport (OT) by default and provide a robust alternative using Maximum Mean Discrepancy (MMD). The model change penalty is controlled by the average output distance between the two models on the target sample. Generalization gaps are estimated on held-out data. We validate our framework in an idealized linear regression setting, showing the TRACE bound correctly captures the scaling of the true risk difference with the magnitude of the shift. Across synthetic and vision benchmarks, TRACE diagnostics are valid and maintain a strong monotonic relationship with the true performance degradation. Crucially, we derive a deployment gate score that correlates strongly with |\Delta R| and achieves high AUROC/AUPRC for gating decisions, enabling safe, label-efficient model replacement.

[LG-47] When Gradient Clipping Becomes a Control Mechanism for Differential Privacy in Deep Learning

链接: https://arxiv.org/abs/2602.10584
作者: Mohammad Partohaghighi,Roummel Marcia,Bruce J. West,YangQuan Chen
类目: Machine Learning (cs.LG)
*备注: This manuscript is under review in the Engineering Applications of Artificial Intelligence journal

点击查看摘要

Abstract:Privacy-preserving training on sensitive data commonly relies on differentially private stochastic optimization with gradient clipping and Gaussian noise. The clipping threshold is a critical control knob: if set too small, systematic over-clipping induces optimization bias; if too large, injected noise dominates updates and degrades accuracy. Existing adaptive clipping methods often depend on per-example gradient norm statistics, adding computational overhead and introducing sensitivity to datasets and architectures. We propose a control-driven clipping strategy that adapts the threshold using a lightweight, weight-only spectral diagnostic computed from model parameters. At periodic probe steps, the method analyzes a designated weight matrix via spectral decomposition and estimates a heavy-tailed spectral indicator associated with training stability. This indicator is smoothed over time and fed into a bounded feedback controller that updates the clipping threshold multiplicatively in the log domain. Because the controller uses only parameters produced during privacy-preserving training, the resulting threshold updates are post-processing and do not increase privacy loss beyond that of the underlying DP optimizer under standard composition accounting.

[LG-48] Gauss-Newton Unlearning for the LLM Era

链接: https://arxiv.org/abs/2602.10568
作者: Lev McKinney,Anvith Thudi,Juhan Bae,Tara Rezaei,Nicolas Papernot,Sheila A. McIlraith,Roger Grosse
类目: Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:Standard large language model training can create models that produce outputs their trainer deems unacceptable in deployment. The probability of these outputs can be reduced using methods such as LLM unlearning. However, unlearning a set of data (called the forget set) can degrade model performance on other distributions where the trainer wants to retain the model’s behavior. To improve this trade-off, we demonstrate that using the forget set to compute only a few uphill Gauss-Newton steps provides a conceptually simple, state-of-the-art unlearning approach for LLMs. While Gauss-Newton steps adapt Newton’s method to non-linear models, it is non-trivial to efficiently and accurately compute such steps for LLMs. Hence, our approach crucially relies on parametric Hessian approximations such as Kronecker-Factored Approximate Curvature (K-FAC). We call this combined approach K-FADE (K-FAC for Distribution Erasure). Our evaluation on the WMDP and ToFU benchmarks demonstrates that K-FADE suppresses outputs from the forget set and approximates, in output space, the results of retraining without the forget set. Critically, our method does this while altering the outputs on the retain set less than previous methods. This is because K-FADE transforms a constraint on the model’s outputs across the entire retain set into a constraint on the model’s weights, allowing the algorithm to minimally change the model’s behavior on the retain set at each step. Moreover, the unlearning updates computed by K-FADE can be reapplied later if the model undergoes further training, allowing unlearning to be cheaply maintained.

[LG-49] Online Min-Max Optimization: From Individual Regrets to Cumulative Saddle Points

链接: https://arxiv.org/abs/2602.10565
作者: Abhijeet Vyas,Brian Bullins
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose and study an online version of min-max optimization based on cumulative saddle points under a variety of performance measures beyond convex-concave settings. After first observing the incompatibility of (static) Nash equilibrium (SNE-Reg _T ) with individual regrets even for strongly convex-strongly concave functions, we propose an alternate \emphstatic duality gap (SDual-Gap _T ) inspired by the online convex optimization (OCO) framework. We provide algorithms that, using a reduction to classic OCO problems, achieve bounds for SDual-Gap _T ~and a novel \emphdynamic saddle point regret (DSP-Reg _T ), which we suggest naturally represents a min-max version of the dynamic regret in OCO. We derive our bounds for SDual-Gap _T ~and DSP-Reg _T ~under strong convexity-strong concavity and a min-max notion of exponential concavity (min-max EC), and in addition we establish a class of functions satisfying min-max EC~that captures a two-player variant of the classic portfolio selection problem. Finally, for a dynamic notion of regret compatible with individual regrets, we derive bounds under a two-sided Polyak-Łojasiewicz (PL) condition.

[LG-50] Bridging the Compression-Precision Paradox: A Hybrid Architecture for Clinical EEG Report Generation with Guaranteed Measurement Accuracy

链接: https://arxiv.org/abs/2602.10544
作者: Wuyang Zhang,Zhen Luo,Chuqiao Gu,Jianming Ma,Yebo Cao,Wangming Yuan,Yinzhi Jin
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 7 pages

点击查看摘要

Abstract:Automated EEG monitoring requires clinician-level precision for seizure detection and reporting. Clinical EEG recordings exceed LLM context windows, requiring extreme compression (400:1+ ratios) that destroys fine-grained temporal precision. A 0.5 Hz error distinguishes absence epilepsy from Lennox-Gastaut syndrome. LLMs lack inherent time-series comprehension and rely on statistical associations from compressed representations. This dual limitation causes systems to hallucinate clinically incorrect measurement values. We separate measurement extraction from text generation. Our hybrid architecture computes exact clinical values via signal processing before compression, employs a cross-modal bridge for EEG-to-language translation, and uses parameter-efficient fine-tuning with constrained decoding around frozen slots. Multirate sampling maintains long-range context while preserving event-level precision. Evaluation on TUH and CHB-MIT datasets achieves 60% fewer false alarms, 50% faster detection, and sub-clinical measurement precision. This is the first system guaranteeing clinical measurement accuracy in automated EEG reports. Comments: 7 pages Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2602.10544 [cs.LG] (or arXiv:2602.10544v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.10544 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] Predictive-State Communication: Innovation Coding and Reconciliation under Delay

链接: https://arxiv.org/abs/2602.10542
作者: Ozgur Ercetin,Mohaned Chraiti
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Shannon theory models communication as the reliable transfer of symbol sequences, with performance governed by capacity and rate-distortion limits. When both endpoints possess strong predictors – as in modern large language models and related generative priors – literal symbol transport is no longer the only operational regime. We propose predictive-state communication (PSC), in which the transmitter and receiver maintain an explicit shared predictive state, and the physical channel is used primarily to convey innovations, i.e., corrective information that reconciles the receiver’s provisional trajectory with the transmitter’s realized trajectory. This viewpoint replaces entropy-rate accounting by cross-entropy accounting under model mismatch, and it introduces feasibility constraints that depend jointly on capacity, delay, and perceptual continuity requirements; the resulting operating set is typically a bounded perception-capacity band rather than a one-sided threshold. We outline the protocol and architectural implications (state identifiers, anchors, bounded rollback, and patch-based updates) and provide a stylized illustrative example to visualize the induced feasibility region and its dependence on predictive quality.

[LG-52] Solving PDEs in One Shot via Fourier Features with Exact Analytical Derivatives ICLR2026

链接: https://arxiv.org/abs/2602.10541
作者: Antonin Sulc
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 9 pages, 1 figure, ICLR 2026 Workshop on AI and Partial Differential Equations

点击查看摘要

Abstract:Recent random feature methods for solving partial differential equations (PDEs) reduce computational cost compared to physics-informed neural networks (PINNs) but still rely on iterative optimization or expensive derivative computation. We observe that sinusoidal random Fourier features possess a cyclic derivative structure: the derivative of any order of \sin(\mathbfW\cdot\mathbfx+b) is a single sinusoid with a monomial prefactor, computable in O(1) operations. Alternative activations such as \tanh , used in prior one-shot methods like PIELM, lack this property: their higher-order derivatives grow as O(2^n) terms, requiring automatic differentiation for operator assembly. We propose FastLSQ, which combines frozen random Fourier features with analytical operator assembly to solve linear PDEs via a single least-squares call, and extend it to nonlinear PDEs via Newton–Raphson iteration where each linearized step is a FastLSQ solve. On a benchmark of 17 PDEs spanning 1 to 6 dimensions, FastLSQ achieves relative L^2 errors of 10^-7 in 0.07,s on linear problems, three orders of magnitude more accurate and significantly faster than state-of-the-art iterative PINN solvers, and 10^-8 to 10^-9 on nonlinear problems via Newton iteration in under 9s.

[LG-53] What Makes Value Learning Efficient in Residual Reinforcement Learning?

链接: https://arxiv.org/abs/2602.10539
作者: Guozheng Ma,Lu Li,Haoyu Wang,Zixuan Liu,Pierre-Luc Bacon,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Residual reinforcement learning (RL) enables stable online refinement of expressive pretrained policies by freezing the base and learning only bounded corrections. However, value learning in residual RL poses unique challenges that remain poorly understood. In this work, we identify two key bottlenecks: cold start pathology, where the critic lacks knowledge of the value landscape around the base policy, and structural scale mismatch, where the residual contribution is dwarfed by the base action. Through systematic investigation, we uncover the mechanisms underlying these bottlenecks, revealing that simple yet principled solutions suffice: base-policy transitions serve as an essential value anchor for implicit warmup, and critic normalization effectively restores representation sensitivity for discerning value differences. Based on these insights, we propose DAWN (Data-Anchored Warmup and Normalization), a minimal approach targeting efficient value learning in residual RL. By addressing these bottlenecks, DAWN demonstrates substantial efficiency gains across diverse benchmarks, policy architectures, and observation modalities.

[LG-54] Prioritize the Process Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

链接: https://arxiv.org/abs/2602.10520
作者: Williams Jonathan,Tureci Esin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model’s internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions, RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.

[LG-55] Dont Eliminate Cut: Exponential Separations in LLM -Based Theorem Proving

链接: https://arxiv.org/abs/2602.10512
作者: Sho Sonoda,Shunta Akiyama,Yuya Uezato
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We develop a theoretical analysis of LLM-guided formal theorem proving in interactive proof assistants (e.g., Lean) by modeling tactic proposal as a stochastic policy in a finite-horizon deterministic MDP. To capture modern representation learning, we treat the state and action spaces as general compact metric spaces and assume Lipschitz policies. To explain the gap between worst-case hardness and empirical success, we introduce problem distributions generated by a reference policy q , including a latent-variable model in which proofs exhibit reusable cut/lemma/sketch structure represented by a proof DAG. Under a top- k search protocol and Tsybakov-type margin conditions, we derive lower bounds on finite-horizon success probability that decompose into search and learning terms, with learning controlled by sequential Rademacher/covering complexity. Our main separation result shows that when cut elimination expands a DAG of depth D into a cut-free tree of size \Omega(\Lambda^D) while the cut-aware hierarchical process has size O(\lambda^D) with \lambda\ll\Lambda , a flat (cut-free) learner provably requires exponentially more data than a cut-aware hierarchical learner. This provides a principled justification for subgoal decomposition in recent agentic theorem provers.

[LG-56] Enhancing Ride-Hailing Forecasting at DiDi with Multi-View Geospatial Representation Learning from the Web

链接: https://arxiv.org/abs/2602.10502
作者: Xixuan Hao,Guicheng Li,Daiqiang Wu,Xusen Guo,Yumeng Zhu,Zhichao Zou,Peng Zhen,Yao Yao,Yuxuan Liang
类目: Machine Learning (cs.LG)
*备注: Accepted by The Web Conference 2026

点击查看摘要

Abstract:The proliferation of ride-hailing services has fundamentally transformed urban mobility patterns, making accurate ride-hailing forecasting crucial for optimizing passenger experience and urban transportation efficiency. However, ride-hailing forecasting faces significant challenges due to geospatial heterogeneity and high susceptibility to external events. This paper proposes MVGR-Net(Multi-View Geospatial Representation Learning), a novel framework that addresses these challenges through a two-stage approach. In the pretraining stage, we learn comprehensive geospatial representations by integrating Points-of-Interest and temporal mobility patterns to capture regional characteristics from both semantic attribute and temporal mobility pattern views. The forecasting stage leverages these representations through a prompt-empowered framework that fine-tunes Large Language Models while incorporating external events. Extensive experiments on DiDi’s real-world datasets demonstrate the state-of-the-art performance.

[LG-57] Pricing Query Complexity of Multiplicative Revenue Approximation

链接: https://arxiv.org/abs/2602.10483
作者: Wei Tang,Yifan Wang,Mengxiao Zhang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the pricing query complexity of revenue maximization for a single buyer whose private valuation is drawn from an unknown distribution. In this setting, the seller must learn the optimal monopoly price by posting prices and observing only binary purchase decisions, rather than the realized valuations. Prior work has established tight query complexity bounds for learning a near-optimal price with additive error \varepsilon when the valuation distribution is supported on [0,1] . However, our understanding of how to learn a near-optimal price that achieves at least a (1-\varepsilon) fraction of the optimal revenue remains limited. In this paper, we study the pricing query complexity of the single-buyer revenue maximization problem under such multiplicative error guarantees in several settings. Observe that when pricing queries are the only source of information about the buyer’s distribution, no algorithm can achieve a non-trivial approximation, since the scale of the distribution cannot be learned from pricing queries alone. Motivated by this fundamental impossibility, we consider two natural and well-motivated models that provide “scale hints”: (i) a one-sample hint, in which the algorithm observes a single realized valuation before making pricing queries; and (ii) a value-range hint, in which the valuation support is known to lie within [1, H] . For each type of hint, we establish pricing query complexity guarantees that are tight up to polylogarithmic factors for several classes of distributions, including monotone hazard rate (MHR) distributions, regular distributions, and general distributions. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2602.10483 [cs.GT] (or arXiv:2602.10483v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2602.10483 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-58] GPU-Fuzz: Finding Memory Errors in Deep Learning Frameworks

链接: https://arxiv.org/abs/2602.10478
作者: Zihao Li,Hongyi Lu,Yanan Guo,Zhenkai Zhang,Shuai Wang,Fengwei Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:GPU memory errors are a critical threat to deep learning (DL) frameworks, leading to crashes or even security issues. We introduce GPU-Fuzz, a fuzzer locating these issues efficiently by modeling operator parameters as formal constraints. GPU-Fuzz utilizes a constraint solver to generate test cases that systematically probe error-prone boundary conditions in GPU kernels. Applied to PyTorch, TensorFlow, and PaddlePaddle, we uncovered 13 unknown bugs, demonstrating the effectiveness of GPU-Fuzz in finding memory errors.

[LG-59] Online Generalized-mean Welfare Maximization: Achieving Near-Optimal Regret from Samples

链接: https://arxiv.org/abs/2602.10469
作者: Zongjun Yang,Rachitesh Kumar,Christian Kroer
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study online fair allocation of T sequentially arriving items among n agents with heterogeneous preferences, with the objective of maximizing generalized-mean welfare, defined as the p -mean of agents’ time-averaged utilities, with p\in (-\infty, 1) . We first consider the i.i.d. arrival model and show that the pure greedy algorithm – which myopically chooses the welfare-maximizing integral allocation – achieves \widetildeO(1/T) average regret. Importantly, in contrast to prior work, our algorithm does not require distributional knowledge and achieves the optimal regret rate using only the online samples. We then go beyond i.i.d. arrivals and investigate a nonstationary model with time-varying independent distributions. In the absence of additional data about the distributions, it is known that every online algorithm must suffer \Omega(1) average regret. We show that only a single historical sample from each distribution is sufficient to recover the optimal \widetildeO(1/T) average regret rate, even in the face of arbitrary non-stationarity. Our algorithms are based on the re-solving paradigm: they assume that the remaining items will be the ones seen historically in those periods and solve the resulting welfare-maximization problem to determine the decision in every period. Finally, we also account for distribution shifts that may distort the fidelity of historical samples and show that the performance of our re-solving algorithms is robust to such shifts. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2602.10469 [cs.GT] (or arXiv:2602.10469v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2602.10469 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-60] Analyzing Fairness of Neural Network Prediction via Counterfactual Dataset Generation

链接: https://arxiv.org/abs/2602.10457
作者: Brian Hyeongseok Kim,Jacqueline L. Mitchell,Chao Wang
类目: Machine Learning (cs.LG)
*备注: Presented at NLDL 2026

点击查看摘要

Abstract:Interpreting the inference-time behavior of deep neural networks remains a challenging problem. Existing approaches to counterfactual explanation typically ask: What is the closest alternative input that would alter the model’s prediction in a desired way? In contrast, we explore counterfactual datasets. Rather than perturbing the input, our method efficiently finds the closest alternative training dataset, one that differs from the original dataset by changing a few labels. Training a new model on this altered dataset can then lead to a different prediction of a given test instance. This perspective provides a new way to assess fairness by directly analyzing the influence of label bias on training and inference. Our approach can be characterized as probing whether a given prediction depends on biased labels. Since exhaustively enumerating all possible alternate datasets is infeasible, we develop analysis techniques that trace how bias in the training data may propagate through the learning algorithm to the trained network. Our method heuristically ranks and modifies the labels of a bounded number of training examples to construct a counterfactual dataset, retrains the model, and checks whether its prediction on a chosen test case changes. We evaluate our approach on feedforward neural networks across over 1100 test cases from 7 widely-used fairness datasets. Results show that it modifies only a small subset of training labels, highlighting its ability to pinpoint the critical training examples that drive prediction changes. Finally, we demonstrate how our counterfactual datasets reveal connections between training examples and test cases, offering an interpretable way to probe dataset bias.

[LG-61] A Multimodal Conditional Mixture Model with Distribution-Level Physics Priors

链接: https://arxiv.org/abs/2602.10451
作者: Jinkyo Han,Bahador Bahmani
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Many scientific and engineering systems exhibit intrinsically multimodal behavior arising from latent regime switching and non-unique physical mechanisms. In such settings, learning the full conditional distribution of admissible outcomes in a physically consistent and interpretable manner remains a challenge. While recent advances in machine learning have enabled powerful multimodal generative modeling, their integration with physics-constrained scientific modeling remains nontrivial, particularly when physical structure must be preserved or data are limited. This work develops a physics-informed multimodal conditional modeling framework based on mixture density representations. Mixture density networks (MDNs) provide an explicit and interpretable parameterization of multimodal conditional distributions. Physical knowledge is embedded through component-specific regularization terms that penalize violations of governing equations or physical laws. This formulation naturally accommodates non-uniqueness and stochasticity while remaining computationally efficient and amenable to conditioning on contextual inputs. The proposed framework is evaluated across a range of scientific problems in which multimodality arises from intrinsic physical mechanisms rather than observational noise, including bifurcation phenomena in nonlinear dynamical systems, stochastic partial differential equations, and atomistic-scale shock dynamics. In addition, the proposed method is compared with a conditional flow matching (CFM) model, a representative state-of-the-art generative modeling approach, demonstrating that MDNs can achieve competitive performance while offering a simpler and more interpretable formulation.

[LG-62] QTALE: Quantization-Robust Token-Adaptive Layer Execution for LLM s

链接: https://arxiv.org/abs/2602.10431
作者: Kanghyun Noh,Jinheon Choi,Yulwha Kim
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Large language models (LLMs) demand substantial computational and memory resources, posing challenges for efficient deployment. Two complementary approaches have emerged to address these issues: token-adaptive layer execution, which reduces floating-point operations (FLOPs) by selectively bypassing layers, and quantization, which lowers memory footprint by reducing weight precision. However, naively integrating these techniques leads to additional accuracy degradation due to reduced redundancy in token-adaptive models. We propose QTALE (Quantization-Robust Token-Adaptive Layer Execution for LLMs), a novel framework that enables seamless integration of token-adaptive execution with quantization while preserving accuracy. Conventional token-adaptive methods reduce redundancy in two ways: (1) by limiting the diversity of training paths explored during fine-tuning, and (2) by lowering the number of parameters actively involved in inference. To overcome these limitations, QTALE introduces two key components: (1) a training strategy that ensures diverse execution paths are actively explored during fine-tuning, and (2) a post-training mechanism that allows flexible adjustment of the execution ratio at inference to reintroduce redundancy when needed. Experimental results show that QTALE enables seamless integration of token-adaptive layer execution with quantization, showing no noticeable accuracy difference, with the gap to quantization-only models kept below 0.5% on CommonsenseQA benchmarks. By combining token-adaptive execution for FLOPs reduction and quantization for memory savings, QTALE provides an effective solution for efficient LLM deployment.

[LG-63] Binary Flow Matching: Prediction-Loss Space Alignment for Robust Learning

链接: https://arxiv.org/abs/2602.10420
作者: Jiadong Hong,Lei Liu,Xinyu Bian,Wenjie Wang,Zhaoyang Zhang
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 12 pages, 4 tables, 5 figures

点击查看摘要

Abstract:Flow matching has emerged as a powerful framework for generative modeling, with recent empirical successes highlighting the effectiveness of signal-space prediction ( x -prediction). In this work, we investigate the transfer of this paradigm to binary manifolds, a fundamental setting for generative modeling of discrete data. While x -prediction remains effective, we identify a latent structural mismatch that arises when it is coupled with velocity-based objectives ( v -loss), leading to a time-dependent singular weighting that amplifies gradient sensitivity to approximation errors. Motivated by this observation, we formalize prediction-loss alignment as a necessary condition for flow matching training. We prove that re-aligning the objective to the signal space ( x -loss) eliminates the singular weighting, yielding uniformly bounded gradients and enabling robust training under uniform timestep sampling without reliance on heuristic schedules. Finally, with alignment secured, we examine design choices specific to binary data, revealing a topology-dependent distinction between probabilistic objectives (e.g., cross-entropy) and geometric losses (e.g., mean squared error). Together, these results provide theoretical foundations and practical guidelines for robust flow matching on binary – and related discrete – domains, positioning signal-space alignment as a key principle for robust diffusion learning.

[LG-64] LightGTS-Cov: Covariate-Enhanced Time Series Forecasting

链接: https://arxiv.org/abs/2602.10412
作者: Yong Shang,Zhipeng Yao,Ning Jin,Xiangfei Qiu,Hui Zhang,Bin Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series foundation models are typically pre-trained on large, multi-source datasets; however, they often ignore exogenous covariates or incorporate them via simple concatenation with the target series, which limits their effectiveness in covariate-rich applications such as electricity price forecasting and renewable energy forecasting. We introduce LightGTS-Cov, a covariate-enhanced extension of LightGTS that preserves its lightweight, period-aware backbone while explicitly incorporating both past and future-known covariates. Built on a \sim 1M-parameter LightGTS backbone, LightGTS-Cov adds only a \sim 0.1M-parameter MLP plug-in that integrates time-aligned covariates into the target forecasts by residually refining the outputs of the decoding process. Across covariate-aware benchmarks on electricity price and energy generation datasets, LightGTS-Cov consistently outperforms LightGTS and achieves superior performance over other covariate-aware baselines under both settings, regardless of whether future-known covariates are provided. We further demonstrate its practical value in two real-world energy case applications: long-term photovoltaic power forecasting with future weather forecasts and day-ahead electricity price forecasting with weather and dispatch-plan covariates. Across both applications, LightGTS-Cov achieves strong forecasting accuracy and stable operational performance after deployment, validating its effectiveness in real-world industrial settings.

[LG-65] LUCID: Attention with Preconditioned Representations

链接: https://arxiv.org/abs/2602.10410
作者: Sai Surya Duvvuri,Nirmal Patel,Nilesh Gupta,Inderjit S. Dhillon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Softmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally, LUCID’s preconditioning-based approach to retrieval bypasses the need for low temperature and the learnability problems associated with it. We validate our approach by training ~1 billion parameter language models evaluated on up to 128K tokens. Our results demonstrate significant gains on long-context retrieval tasks, specifically retrieval tasks from BABILong, RULER, SCROLLS and LongBench. For instance, LUCID achieves up to 18% improvement in BABILong and 14% improvement in RULER multi-needle performance compared to standard attention.

[LG-66] Experimental Demonstration of Online Learning-Based Concept Drift Adaptation for Failure Detection in Optical Networks

链接: https://arxiv.org/abs/2602.10401
作者: Yousuf Moiz Ali,Jaroslaw E. Prilepsky,João Pedro,Antonio Napoli,Sasipim Srivallapanondh,Sergei K. Turitsyn,Pedro Freire
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optics (physics.optics)
*备注: Accepted at Optical Fiber Communications Conference 2026 (OFC 2026)

点击查看摘要

Abstract:We present a novel online learning-based approach for concept drift adaptation in optical network failure detection, achieving up to a 70% improvement in performance over conventional static models while maintaining low latency.

[LG-67] nsor Methods: A Unified and Interpretable Approach for Material Design

链接: https://arxiv.org/abs/2602.10392
作者: Shaan Pakala,Aldair E. Gongora,Brian Giera,Evangelos E. Papalexakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When designing new materials, it is often necessary to tailor the material design (with respect to its design parameters) to have some desired properties (e.g. Young’s modulus). As the set of design parameters grow, the search space grows exponentially, making the actual synthesis and evaluation of all material combinations virtually impossible. Even using traditional computational methods such as Finite Element Analysis becomes too computationally heavy to search the design space. Recent methods use machine learning (ML) surrogate models to more efficiently determine optimal material designs; unfortunately, these methods often (i) are notoriously difficult to interpret and (ii) under perform when the training data comes from a non-uniform sampling of the design space. We suggest the use of tensor completion methods as an all-in-one approach for interpretability and predictions. We observe classical tensor methods are able to compete with traditional ML in predictions, with the added benefit of their interpretable tensor factors (which are given completely for free, as a result of the prediction). In our experiments, we are able to rediscover physical phenomena via the tensor factors, indicating that our predictions are aligned with the true underlying physics of the problem. This also means these tensor factors could be used by experimentalists to identify potentially novel patterns, given we are able to rediscover existing ones. We also study the effects of both types of surrogate models when we encounter training data from a non-uniform sampling of the design space. We observe more specialized tensor methods that can give better generalization in these non-uniforms sampling scenarios. We find the best generalization comes from a tensor model, which is able to improve upon the baseline ML methods by up to 5% on aggregate R^2 , and halve the error in some out of distribution regions.

[LG-68] Colorful Talks with Graphs: Human-Interpretable Graph Encodings for Large Language Models

链接: https://arxiv.org/abs/2602.10386
作者: Angelo Zangari,Peyman Baghershahi,Sourav Medya
类目: Machine Learning (cs.LG)
*备注: 18 pages, 10 tables, 5 figures

点击查看摘要

Abstract:Graph problems are fundamentally challenging for large language models (LLMs). While LLMs excel at processing unstructured text, graph tasks require reasoning over explicit structure, permutation invariance, and computationally complex relationships, creating a mismatch with the representations of text-based models. Our work investigates how LLMs can be effectively applied to graph problems despite these barriers. We introduce a human-interpretable structural encoding strategy for graph-to-text translation that injects graph structure directly into natural language prompts. Our method involves computing a variant of Weisfeiler-Lehman (WL) similarity classes and maps them to human-like color tokens rather than numeric labels. The key insight is that semantically meaningful and human-interpretable cues may be more effectively processed by LLMs than opaque symbolic encoding. Experimental results on multiple algorithmic and predictive graph tasks show the considerable improvements by our method on both synthetic and real-world datasets. By capturing both local and global-range dependencies, our method enhances LLM performance especially on graph tasks that require reasoning over global graph structure.

[LG-69] Deep learning outperforms traditional machine learning methods in predicting childhood malnutrition: evidence from survey data

链接: https://arxiv.org/abs/2602.10381
作者: Deepak Bastola,Yang Li
类目: Machine Learning (cs.LG)
*备注: 21 pages, 10 figures

点击查看摘要

Abstract:Childhood malnutrition remains a major public health concern in Nepal and other low-resource settings, while conventional case-finding approaches are labor-intensive and frequently unavailable in remote areas. This study provides the first comprehensive assessment of machine learning and deep learning methodologies for identifying malnutrition among children under five years of age in Nepal. We systematically compared 16 algorithms spanning deep learning, gradient boosting, and traditional machine learning families, using data from the Nepal Multiple Indicator Cluster Survey (MICS) 2019. A composite malnutrition indicator was constructed by integrating stunting, wasting, and underweight status, and model performance was evaluated using ten metrics, with emphasis on F1-score and recall to account for substantial class imbalance and the high cost of failing to detect malnourished children. Among all models, TabNet demonstrated the best performance, likely attributable to its attention-based architecture, and outperformed both support vector machine and AdaBoost classifiers. A consensus feature importance analysis identified maternal education, household wealth index, and child age as the primary predictors of malnutrition, followed by geographic characteristics, vaccination status, and meal frequency. Collectively, these results demonstrate a scalable, survey-based screening framework for identifying children at elevated risk of malnutrition and for guiding targeted nutritional interventions. The proposed approach supports Nepal’s progress toward the Sustainable Development Goals and offers a transferable methodological template for similar low-resource settings globally.

[LG-70] Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores

链接: https://arxiv.org/abs/2602.10378
作者: Elliot L. Epstein,Rajat Vadiraj Dwaraknath,John Winnicki
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:Score-debiased kernel density estimation (SD-KDE) achieves improved asymptotic convergence rates over classical KDE, but its use of an empirical score has made it significantly slower in practice. We show that by re-ordering the SD-KDE computation to expose matrix-multiplication structure, Tensor Cores can be used to accelerate the GPU implementation. On a 32k-sample 16-dimensional problem, our approach runs up to 47\times faster than a strong SD-KDE GPU baseline and 3,300\times faster than scikit-learn’s KDE. On a larger 1M-sample 16-dimensional task evaluated on 131k queries, Flash-SD-KDE completes in 2.3 s on a single GPU, making score-debiased density estimation practical at previously infeasible scales.

[LG-71] Simple LLM Baselines are Competitive for Model Diffing

链接: https://arxiv.org/abs/2602.10371
作者: Elias Kempf,Simon Schrodi,Bartosz Cywiński,Thomas Brox,Neel Nanda,Arthur Conmy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard LLM evaluations only test capabilities or dispositions that evaluators designed them for, missing unexpected differences such as behavioral shifts between model revisions or emergent misaligned tendencies. Model diffing addresses this limitation by automatically surfacing systematic behavioral differences. Recent approaches include LLM-based methods that generate natural language descriptions and sparse autoencoder (SAE)-based methods that identify interpretable features. However, no systematic comparison of these approaches exists nor are there established evaluation criteria. We address this gap by proposing evaluation metrics for key desiderata (generalization, interestingness, and abstraction level) and use these to compare existing methods. Our results show that an improved LLM-based baseline performs comparably to the SAE-based method while typically surfacing more abstract behavioral differences.

[LG-72] heoretical Analysis of Contrastive Learning under Imbalanced Data: From Training Dynamics to a Pruning Solution

链接: https://arxiv.org/abs/2602.10357
作者: Haixu Liao,Yating Zhou,Songyang Zhang,Meng Wang,Shuai Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive learning has emerged as a powerful framework for learning generalizable representations, yet its theoretical understanding remains limited, particularly under imbalanced data distributions that are prevalent in real-world applications. Such an imbalance can degrade representation quality and induce biased model behavior, yet a rigorous characterization of these effects is lacking. In this work, we develop a theoretical framework to analyze the training dynamics of contrastive learning with Transformer-based encoders under imbalanced data. Our results reveal that neuron weights evolve through three distinct stages of training, with different dynamics for majority features, minority features, and noise. We further show that minority features reduce representational capacity, increase the need for more complex architectures, and hinder the separation of ground-truth features from noise. Inspired by these neuron-level behaviors, we show that pruning restores performance degraded by imbalance and enhances feature separation, offering both conceptual insights and practical guidance. Major theoretical findings are validated through numerical experiments.

[LG-73] Identifying Evidence-Based Nudges in Biomedical Literature with Large Language Models

链接: https://arxiv.org/abs/2602.10345
作者: Jaydeep Chauhan,Mark Seidman,Pezhman Raeisian Parvari,Zhi Zheng,Zina Ben-Miled,Cristina Barboi,Andrew Gonzalez,Malaz Boustani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a scalable, AI-powered system that identifies and extracts evidence-based behavioral nudges from unstructured biomedical literature. Nudges are subtle, non-coercive interventions that influence behavior without limiting choice, showing strong impact on health outcomes like medication adherence. However, identifying these interventions from PubMed’s 8 million+ articles is a bottleneck. Our system uses a novel multi-stage pipeline: first, hybrid filtering (keywords, TF-IDF, cosine similarity, and a “nudge-term bonus”) reduces the corpus to about 81,000 candidates. Second, we use OpenScholar (quantized LLaMA 3.1 8B) to classify papers and extract structured fields like nudge type and target behavior in a single pass, validated against a JSON schema. We evaluated four configurations on a labeled test set (N=197). The best setup (Title/Abstract/Intro) achieved a 67.0% F1 score and 72.0% recall, ideal for discovery. A high-precision variant using self-consistency (7 randomized passes) achieved 100% precision with 12% recall, demonstrating a tunable trade-off for high-trust use cases. This system is being integrated into Agile Nudge+, a real-world platform, to ground LLM-generated interventions in peer-reviewed evidence. This work demonstrates interpretable, domain-specific retrieval pipelines for evidence synthesis and personalized healthcare. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.10345 [cs.LG] (or arXiv:2602.10345v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.10345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

链接: https://arxiv.org/abs/2602.10314
作者: Jaeyeon Kim,Jonathan Geuter,David Alvarez-Melis,Sham Kakade,Sitan Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces. By generating sequences in any order and allowing for parallel decoding, they enable fast inference and strong performance on non-causal tasks. However, this flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns, which is not only computationally expensive, but also creates a train–test mismatch between the random masks used in training and the highly structured masks induced by inference-time unmasking. In this work, we propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns, thereby focusing optimization on inference-aligned masks and speeding up training. Empirically, PUMA speeds up pretraining at the 125M scale by \approx 2.5\times and offers complementary advantages on top of common recipes like autoregressive initialization. We open-source our codebase at this https URL.

[LG-75] R2RAG -Flood: A reasoning -reinforced training-free retrieval augmentation generation framework for flood damage nowcasting

链接: https://arxiv.org/abs/2602.10312
作者: Lipai Huang,Kai Yin,Chia-Fu Liu,Ali Mostafavi
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 figures, 7 tables, submitted to CACAIE journal

点击查看摘要

Abstract:R2RAG-Flood is a reasoning-reinforced, training-free retrieval-augmented generation framework for post-storm property damage nowcasting. Building on an existing supervised tabular predictor, the framework constructs a reasoning-centric knowledge base composed of labeled tabular records, where each sample includes structured predictors, a compact natural language text-mode summary, and a model-generated reasoning trajectory. During inference, R2RAG-Flood issues context-augmented prompts that retrieve and condition on relevant reasoning trajectories from nearby geospatial neighbors and canonical class prototypes, enabling the large language model backbone to emulate and adapt prior reasoning rather than learn new task-specific parameters. Predictions follow a two-stage procedure that first determines property damage occurrence and then refines severity within a three-level Property Damage Extent categorization, with a conditional downgrade step to correct over-predicted severity. In a case study of Harris County, Texas at the 12-digit Hydrologic Unit Code scale, the supervised tabular baseline trained directly on structured predictors achieves 0.714 overall accuracy and 0.859 damage class accuracy for medium and high damage classes. Across seven large language model backbones, R2RAG-Flood attains 0.613 to 0.668 overall accuracy and 0.757 to 0.896 damage class accuracy, approaching the supervised baseline while additionally producing a structured rationale for each prediction. Using a severity-per-cost efficiency metric derived from API pricing and GPU instance costs, lightweight R2RAG-Flood variants demonstrate substantially higher efficiency than both the supervised tabular baseline and larger language models, while requiring no task-specific training or fine-tuning.

[LG-76] ICODEN: Ordinary Differential Equation Neural Networks for Interval-Censored Data

链接: https://arxiv.org/abs/2602.10303
作者: Haoling Wang,Lang Zeng,Tao Sun,Youngjoo Cho,Ying Ding
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Predicting time-to-event outcomes when event times are interval censored is challenging because the exact event time is unobserved. Many existing survival analysis approaches for interval-censored data rely on strong model assumptions or cannot handle high-dimensional predictors. We develop ICODEN, an ordinary differential equation-based neural network for interval-censored data that models the hazard function through deep neural networks and obtains the cumulative hazard by solving an ordinary differential equation. ICODEN does not require the proportional hazards assumption or a prespecified parametric form for the hazard function, thereby permitting flexible survival modeling. Across simulation settings with proportional or non-proportional hazards and both linear and nonlinear covariate effects, ICODEN consistently achieves satisfactory predictive accuracy and remains stable as the number of predictors increases. Applications to data from multiple phases of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and to two Age-Related Eye Disease Studies (AREDS and AREDS2) for age-related macular degeneration (AMD) demonstrate ICODEN’s robust prediction performance. In both applications, predicting time-to-AD or time-to-late AMD, ICODEN effectively uses hundreds to more than 1,000 SNPs and supports data-driven subgroup identification with differential progression risk profiles. These results establish ICODEN as a practical assumption-lean tool for prediction with interval-censored survival data in high-dimensional biomedical settings.

[LG-77] Configuration-to-Performance Scaling Law with Neural Ansatz

链接: https://arxiv.org/abs/2602.10300
作者: Huaqing Zhang,Kaiyue Wen,Tengyu Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Researchers build scaling laws to forecast the training performance of expensive large-scale runs with larger model size N and data size D. These laws assume that other training hyperparameters are optimally chosen, which can require significant effort and, in some cases, be impossible due to external hardware constraints. To improve predictability across a broader set of hyperparameters and enable simpler tuning at scale, we propose learning a \textitConfiguration-to-Performance Scaling Law (CPL): a mapping from the \textitfull training configuration to training performance. Because no simple functional form can express this mapping, we parameterize it with a large language model (LLM), and fit it with diverse open-source pretraining logs across multiple sources, yielding a \textitNeural Configuration-to-Performance Scaling Law (NCPL). NCPL accurately predicts how training configurations influence the final pretraining loss, achieving 20-40% lower prediction error than the configuration-agnostic Chinchilla law and generalizing to runs using up to 10 x more compute than any run in the training set. It further supports joint tuning of multiple hyperparameters with performance comparable to hyperparameter scaling law baselines. Finally, NCPL naturally and effectively extends to richer prediction targets such as loss-curve prediction.

[LG-78] What Does Preference Learning Recover from Pairwise Comparison Data?

链接: https://arxiv.org/abs/2602.10286
作者: Rattana Pukdee,Maria-Florina Balcan,Pradeep Ravikumar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences. A typical dataset consists of triplets (x, y^+, y^-) , where response y^+ is preferred over response y^- for context x . The Bradley–Terry (BT) model is the predominant approach, modeling preference probabilities as a function of latent score differences. Standard practice assumes data follows this model and learns the latent scores accordingly. However, real data may violate this assumption, and it remains unclear what BT learning recovers in such cases. Starting from triplet comparison data, we formalize the preference information it encodes through the conditional preference distribution (CPRD). We give precise conditions for when BT is appropriate for modeling the CPRD, and identify factors governing sample efficiency – namely, margin and connectivity. Together, these results offer a data-centric foundation for understanding what preference learning actually recovers.

[LG-79] Linear-LLM -SCM: Benchmarking LLM s for Coefficient Elicitation in Linear-Gaussian Causal Models

链接: https://arxiv.org/abs/2602.10282
作者: Kanta Yamaoka,Sumantrak Mukherjee,Thomas Gärtner,David Antony Selby,Stefan Konigorski,Eyke Hüllermeier,Viktor Bengs,Sebastian Josef Vollmer
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures, preprint

点击查看摘要

Abstract:Large language models (LLMs) have shown potential in identifying qualitative causal relations, but their ability to perform quantitative causal reasoning – estimating effect sizes that parametrize functional relationships – remains underexplored in continuous domains. We introduce Linear-LLM-SCM, a plug-and-play benchmarking framework for evaluating LLMs on linear Gaussian structural causal model (SCM) parametrization when the DAG is given. The framework decomposes a DAG into local parent-child sets and prompts an LLM to produce a regression-style structural equation per node, which is aggregated and compared against available ground-truth parameters. Our experiments show several challenges in such benchmarking tasks, namely, strong stochasticity in the results in some of the models and susceptibility to DAG misspecification via spurious edges in the continuous domains. Across models, we observe substantial variability in coefficient estimates for some settings and sensitivity to structural and semantic perturbations, highlighting current limitations of LLMs as quantitative causal parameterizers. We also open-sourced the benchmarking framework so that researchers can utilize their DAGs and any off-the-shelf LLMs plug-and-play for evaluation in their domains effortlessly.

[LG-80] Kernel-Based Learning of Chest X-ray Images for Predicting ICU Escalation among COVID-19 Patients

链接: https://arxiv.org/abs/2602.10261
作者: Qiyuan Shi,Jian Kang,Yi Li
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Kernel methods have been extensively utilized in machine learning for classification and prediction tasks due to their ability to capture complex non-linear data patterns. However, single kernel approaches are inherently limited, as they rely on a single type of kernel function (e.g., Gaussian kernel), which may be insufficient to fully represent the heterogeneity or multifaceted nature of real-world data. Multiple kernel learning (MKL) addresses these limitations by constructing composite kernels from simpler ones and integrating information from heterogeneous sources. Despite these advances, traditional MKL methods are primarily designed for continuous outcomes. We extend MKL to accommodate the outcome variable belonging to the exponential family, representing a broader variety of data types, and refer to our proposed method as generalized linear models with integrated multiple additive regression with kernels (GLIMARK). Empirically, we demonstrate that GLIMARK can effectively recover or approximate the true data-generating mechanism. We have applied it to a COVID-19 chest X-ray dataset, predicting binary outcomes of ICU escalation and extracting clinically meaningful features, underscoring the practical utility of this approach in real-world scenarios.

[LG-81] Modeling Programming Skills with Source Code Embeddings for Context-aware Exercise Recommendation

链接: https://arxiv.org/abs/2602.10249
作者: Carlos Eduardo P. Silva,João Pedro M. Sena,Julio C. S. Reis,André G. Santos,Lucas N. Ferreira
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, to be published in LAK26: 16th International Learning Analytics and Knowledge Conference (LAK 2026)

点击查看摘要

Abstract:In this paper, we propose a context-aware recommender system that models students’ programming skills using embeddings of the source code they submit throughout a course. These embeddings predict students’ skills across multiple programming topics, producing profiles that are matched to the skills required by unseen homework problems. To generate recommendations, we compute the cosine similarity between student profiles and problem skill vectors, ranking exercises according to their alignment with each student’s current abilities. We evaluated our approach using real data from students and exercises in an introductory programming course at our university. First, we assessed the effectiveness of our source code embeddings for predicting skills, comparing them with token-based and graph-based alternatives. Results showed that Jina embeddings outperformed TF-IDF, CodeBERT-cpp, and GraphCodeBERT across most skills. Additionally, we evaluated the system’s ability to recommend exercises aligned with weekly course content by analyzing student submissions collected over seven course offerings. Our approach consistently produced more suitable recommendations than baselines based on correctness or solution time, indicating that predicted programming skills provide a stronger signal for problem recommendation.

[LG-82] Risk-Equalized Differentially Private Synthetic Data: Protecting Outliers by Controlling Record-Level Influence

链接: https://arxiv.org/abs/2602.10232
作者: Amir Asiaee,Chao Yan,Zachary B. Abrams,Bradley A. Malin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When synthetic data is released, some individuals are harder to protect than others. A patient with a rare disease combination or a transaction with unusual characteristics stands out from the crowd. Differential privacy provides worst-case guarantees, but empirical attacks – particularly membership inference – succeed far more often against such outliers, especially under moderate privacy budgets and with auxiliary information. This paper introduces risk-equalized DP synthesis, a framework that prioritizes protection for high-risk records by reducing their influence on the learned generator. The mechanism operates in two stages: first, a small privacy budget estimates each record’s “outlierness”; second, a DP learning procedure weights each record inversely to its risk score. Under Gaussian mechanisms, a record’s privacy loss is proportional to its influence on the output – so deliberately shrinking outliers’ contributions yields tighter per-instance privacy bounds for precisely those records that need them most. We prove end-to-end DP guarantees via composition and derive closed-form per-record bounds for the synthesis stage (the scoring stage adds a uniform per-record term). Experiments on simulated data with controlled outlier injection show that risk-weighting substantially reduces membership inference success against high-outlierness records; ablations confirm that targeting – not random downweighting – drives the improvement. On real-world benchmarks (Breast Cancer, Adult, German Credit), gains are dataset-dependent, highlighting the interplay between scorer quality and synthesis pipeline. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.10232 [cs.LG] (or arXiv:2602.10232v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.10232 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-83] Frame-Level Internal Tool Use for Temporal Grounding in Audio LMs

链接: https://arxiv.org/abs/2602.10230
作者: Joesph An,Phillip Keung,Jiaqi Wang,Orevaoghene Ahia,Noah A. Smith
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Under review. See this https URL

点击查看摘要

Abstract:Large audio language models are increasingly used for complex audio understanding tasks, but they struggle with temporal tasks that require precise temporal grounding, such as word alignment and speaker diarization. The standard approach, where we generate timestamps as sequences of text tokens, is computationally expensive and prone to hallucination, especially when processing audio lengths outside the model’s training distribution. In this work, we propose frame-level internal tool use, a method that trains audio LMs to use their own internal audio representations to perform temporal grounding directly. We introduce a lightweight prediction mechanism trained via two objectives: a binary frame classifier and a novel inhomogeneous Poisson process (IHP) loss that models temporal event intensity. Across word localization, speaker diarization, and event localization tasks, our approach outperforms token-based baselines. Most notably, it achieves a 50x inference speedup and demonstrates robust length generalization, maintaining high accuracy on out-of-distribution audio durations where standard token-based models collapse completely.

[LG-84] PRISM: Differentially Private Synthetic Data with Structure-Aware Budget Allocation for Prediction

链接: https://arxiv.org/abs/2602.10228
作者: Amir Asiaee,Chao Yan,Zachary B. Abrams,Bradley A. Malin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differential privacy (DP) provides a mathematical guarantee limiting what an adversary can learn about any individual from released data. However, achieving this protection typically requires adding noise, and noise can accumulate when many statistics are measured. Existing DP synthetic data methods treat all features symmetrically, spreading noise uniformly even when the data will serve a specific prediction task. We develop a prediction-centric approach operating in three regimes depending on available structural knowledge. In the causal regime, when the causal parents of Y are known and distribution shift is expected, we target the parents for robustness. In the graphical regime, when a Bayesian network structure is available and the distribution is stable, the Markov blanket of Y provides a sufficient feature set for optimal prediction. In the predictive regime, when no structural knowledge exists, we select features via differentially private methods without claiming to recover causal or graphical structure. We formalize this as PRISM, a mechanism that (i) identifies a predictive feature subset according to the appropriate regime, (ii) constructs targeted summary statistics, (iii) allocates budget to minimize an upper bound on prediction error, and (iv) synthesizes data via graphical-model inference. We prove end-to-end privacy guarantees and risk bounds. Empirically, task-aware allocation improves prediction accuracy compared to generic synthesizers. Under distribution shift, targeting causal parents achieves AUC \approx 0.73 while correlation-based selection collapses to chance ( \approx 0.49 ). Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.10228 [cs.LG] (or arXiv:2602.10228v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.10228 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-85] ACE-RTL: When Agent ic Context Evolution Meets RTL-Specialized LLM s

链接: https://arxiv.org/abs/2602.10218
作者: Chenhui Deng,Zhongzhi Yu,Guan-Ting Liu,Nathaniel Pinckney,Haoxing Ren
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have sparked growing interest in applying them to hardware design automation, particularly for accurate RTL code generation. Prior efforts follow two largely independent paths: (i) training domain-adapted RTL models to internalize hardware semantics, (ii) developing agentic systems that leverage frontier generic LLMs guided by simulation feedback. However, these two paths exhibit complementary strengths and weaknesses. In this work, we present ACE-RTL that unifies both directions through Agentic Context Evolution (ACE). ACE-RTL integrates an RTL-specialized LLM, trained on a large-scale dataset of 1.7 million RTL samples, with a frontier reasoning LLM through three synergistic components: the generator, reflector, and coordinator. These components iteratively refine RTL code toward functional correctness. We further introduce a parallel scaling strategy that significantly reduces the number of iterations required to reach correct solutions. On the Comprehensive Verilog Design Problems (CVDP) benchmark, ACE-RTL achieves up to a 44.87% pass rate improvement over 14 competitive baselines while requiring only four iterations on average.

[LG-86] mper-Then-Tilt: Principled Unlearning for Generative Models through Tempering and Classifier Guidance

链接: https://arxiv.org/abs/2602.10217
作者: Jacob L. Block,Mehryar Mohri,Aryan Mokhtari,Sanjay Shakkottai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study machine unlearning in large generative models by framing the task as density ratio estimation to a target distribution rather than supervised fine-tuning. While classifier guidance is a standard approach for approximating this ratio and can succeed in general, we show it can fail to faithfully unlearn with finite samples when the forget set represents a sharp, concentrated data distribution. To address this, we introduce Temper-Then-Tilt Unlearning (T3-Unlearning), which freezes the base model and applies a two-step inference procedure: (i) tempering the base distribution to flatten high-confidence spikes, and (ii) tilting the tempered distribution using a lightweight classifier trained to distinguish retain from forget samples. Our theoretical analysis provides finite-sample guarantees linking the surrogate classifier’s risk to unlearning error, proving that tempering is necessary to successfully unlearn for concentrated distributions. Empirical evaluations on the TOFU benchmark show that T3-Unlearning improves forget quality and generative utility over existing baselines, while training only a fraction of the parameters with a minimal runtime.

[LG-87] ELROND: Exploring and decomposing intrinsic capabilities of diffusion models

链接: https://arxiv.org/abs/2602.10216
作者: Paweł Skierś,Tomasz Trzciński,Kamil Deja
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A single text prompt passed to a diffusion model often yields a wide range of visual outputs determined solely by stochastic process, leaving users with no direct control over which specific semantic variations appear in the image. While existing unsupervised methods attempt to analyze these variations via output features, they omit the underlying generative process. In this work, we propose a framework to disentangle these semantic directions directly within the input embedding space. To that end, we collect a set of gradients obtained by backpropagating the differences between stochastic realizations of a fixed prompt that we later decompose into meaningful steering directions with either Principal Components Analysis or Sparse Autoencoder. Our approach yields three key contributions: (1) it isolates interpretable, steerable directions for precise, fine-grained control over a single concept; (2) it effectively mitigates mode collapse in distilled models by reintroducing lost diversity; and (3) it establishes a novel estimator for concept complexity under a specific model, based on the dimensionality of the discovered subspace.

[LG-88] Rank-Accuracy Trade-off for LoRA: A Gradient-Flow Analysis

链接: https://arxiv.org/abs/2602.10212
作者: Michael Rushka,Diego Klabjan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Previous empirical studies have shown that LoRA achieves accuracy comparable to full-parameter methods on downstream fine-tuning tasks, even for rank-1 updates. By contrast, the theoretical underpinnings of the dependence of LoRA’s accuracy on update rank remain relatively unexplored. In this work, we compare the accuracy of rank-r LoRA updates against full-parameter updates for fine-tuning tasks from a dynamical systems perspective. We perform gradient flow analysis in both full-rank and low-rank regimes to establish explicit relationships between rank and accuracy for two loss functions under LoRA. While gradient flow equations for LoRA are presented in prior work, we rigorously derive their form and show that they are identical for simultaneous and sequential LoRA parameter updates. We then use the resulting dynamical system equations to obtain closed-form relationships between LoRA rank and accuracy for trace-squared and Frobenius-norm low-rank approximation loss functions.

[LG-89] How Much Reasoning Do Retrieval-Augmented Models Add beyond LLM s? A Benchmarking Framework for Multi-Hop Inference over Hybrid Knowledge

链接: https://arxiv.org/abs/2602.10210
作者: Junhong Lin,Bing Zhang,Song Wang,Ziyan Liu,Dan Gutfreund,Julian Shun,Yada Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) continue to struggle with knowledge-intensive questions that require up-to-date information and multi-hop reasoning. Augmenting LLMs with hybrid external knowledge, such as unstructured text and structured knowledge graphs, offers a promising alternative to costly continual pretraining. As such, reliable evaluation of their retrieval and reasoning capabilities becomes critical. However, many existing benchmarks increasingly overlap with LLM pretraining data, which means answers or supporting knowledge may already be encoded in model parameters, making it difficult to distinguish genuine retrieval and reasoning from parametric recall. We introduce HybridRAG-Bench, a framework for constructing benchmarks to evaluate retrieval-intensive, multi-hop reasoning over hybrid knowledge. HybridRAG-Bench automatically couples unstructured text and structured knowledge graph representations derived from recent scientific literature on arXiv, and generates knowledge-intensive question-answer pairs grounded in explicit reasoning paths. The framework supports flexible domain and time-frame selection, enabling contamination-aware and customizable evaluation as models and knowledge evolve. Experiments across three domains (artificial intelligence, governance and policy, and bioinformatics) demonstrate that HybridRAG-Bench rewards genuine retrieval and reasoning rather than parametric recall, offering a principled testbed for evaluating hybrid knowledge-augmented reasoning systems. We release our code and data at this http URL.

[LG-90] Neural Network Quantum Field Theory from Transformer Architectures

链接: https://arxiv.org/abs/2602.10209
作者: Dmitry S. Ageev,Yulia A. Ageeva
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); High Energy Physics - Theory (hep-th)
*备注: 14 pages; comments are welcome

点击查看摘要

Abstract:We propose a neural-network construction of Euclidean scalar quantum field theories from transformer attention heads, defining n -point correlators by averaging over random network parameters in the NN-QFT framework. For a single attention head, shared random softmax weights couple different width coordinates and induce non-Gaussian field statistics that persist in the infinite-width limit d_k\to\infty . We compute the two-point function in an attention-weight representation and show how Euclidean-invariant kernels can be engineered via random-feature token embeddings. We then analyze the connected four-point function and identify an “independence-breaking” contribution, expressible as a covariance over query-key weights, which remains finite at infinite width. Finally, we show that summing many independent heads with standard 1/N_h normalization suppresses connected non-Gaussian correlators as 1/N_h , yielding a Gaussian NN-QFT in the large-head limit.

[LG-91] Adaptive Optimization via Momentum on Variance-Normalized Gradients

链接: https://arxiv.org/abs/2602.10204
作者: Francisco Patitucci,Aryan Mokhtari
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 28 pages

点击查看摘要

Abstract:We introduce MVN-Grad (Momentum on Variance-Normalized Gradients), an Adam-style optimizer that improves stability and performance by combining two complementary ideas: variance-based normalization and momentum applied after normalization. MVN-Grad scales each coordinate by an exponential moving average of gradient uncertainty and applies momentum to the resulting normalized gradients, eliminating the cross-time coupling between stale momentum and a stochastic normalizer present in standard Adam-type updates. We prove that this decoupling yields strictly smaller one-step conditional update variance than momentum-then-normalize variance methods under standard noise assumptions, and that MVN-Grad is robust to outliers: it has a uniformly bounded response to single gradient spikes. In low-variance regimes, we further show variance normalization avoids sign-type collapse associated with second-moment scaling and can yield accelerated convergence. Across CIFAR-100 image classification and GPT-style language modeling benchmarks, MVN-Grad matches or outperforms Adam, AdaBelief, and LaProp, delivering smoother training and improved generalization with no added overhead. Comments: 28 pages Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2602.10204 [cs.LG] (or arXiv:2602.10204v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.10204 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-92] Signature-Kernel Based Evaluation Metrics for Robust Probabilistic and Tail-Event Forecasting

链接: https://arxiv.org/abs/2602.10182
作者: Benjamin R. Redhead,Thomas L. Lee,Peng Gu,Víctor Elvira,Amos Storkey
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Main Paper: 8 pages 3 figures Including Appendix and References: 19 pages 7 figures

点击查看摘要

Abstract:Probabilistic forecasting is increasingly critical across high-stakes domains, from finance and epidemiology to climate science. However, current evaluation frameworks lack a consensus metric and suffer from two critical flaws: they often assume independence across time steps or variables, and they demonstrably lack sensitivity to tail events, the very occurrences that are most pivotal in real-world decision-making. To address these limitations, we propose two kernel-based metrics: the signature maximum mean discrepancy (Sig-MMD) and our novel censored Sig-MMD (CSig-MMD). By leveraging the signature kernel, these metrics capture complex inter-variate and inter-temporal dependencies and remain robust to missing data. Furthermore, CSig-MMD introduces a censoring scheme that prioritizes a forecaster’s capability to predict tail events while strictly maintaining properness, a vital property for a good scoring rule. These metrics enable a more reliable evaluation of direct multi-step forecasting, facilitating the development of more robust probabilistic algorithms.

[LG-93] Basic Legibility Protocols Improve Trusted Monitoring

链接: https://arxiv.org/abs/2602.10153
作者: Ashwin Sreevatsa,Sebastian Prasanna,Cody Rushing
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The AI Control research agenda aims to develop control protocols: safety techniques that prevent untrusted AI systems from taking harmful actions during deployment. Because human oversight is expensive, one approach is trusted monitoring, where weaker, trusted models oversee stronger, untrusted models \unicodex2013 but this often fails when the untrusted model’s actions exceed the monitor’s comprehension. We introduce legibility protocols, which encourage the untrusted model to take actions that are easier for a monitor to evaluate. We perform control evaluations in the APPS coding setting, where an adversarial agent attempts to write backdoored code without detection. We study legibility protocols that allow the untrusted model to thoroughly document its code with comments \unicodex2013 in contrast to prior work, which removed comments to prevent deceptive ones. We find that: (i) commenting protocols improve safety without sacrificing task performance relative to comment-removal baselines; (ii) commenting disproportionately benefits honest code, which typically has a natural explanation that resolves monitor suspicion, whereas backdoored code frequently lacks an easy justification; (iii) gains from commenting increase with monitor strength, as stronger monitors better distinguish genuine justifications from only superficially plausible ones. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2602.10153 [cs.CR] (or arXiv:2602.10153v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.10153 Focus to learn more arXiv-issued DOI via DataCite

[LG-94] Renet: Principled and Efficient Relaxation for the Elastic Net via Dynamic Objective Selection

链接: https://arxiv.org/abs/2602.11107
作者: Albert Dorador
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce Renet, a principled generalization of the Relaxed Lasso to the Elastic Net family of estimators. While, on the one hand, \ell_1 -regularization is a standard tool for variable selection in high-dimensional regimes and, on the other hand, the \ell_2 penalty provides stability and solution uniqueness through strict convexity, the standard Elastic Net nevertheless suffers from shrinkage bias that frequently yields suboptimal prediction accuracy. We propose to address this limitation through a framework called \textitrelaxation. Existing relaxation implementations rely on naive linear interpolations of penalized and unpenalized solutions, which ignore the non-linear geometry that characterizes the entire regularization path and risk violating the Karush-Kuhn-Tucker conditions. Renet addresses these limitations by enforcing sign consistency through an adaptive relaxation procedure that dynamically dispatches between convex blending and efficient sub-path refitting. Furthermore, we identify and formalize a unique synergy between relaxation and the ``One-Standard-Error’’ rule: relaxation serves as a robust debiasing mechanism, allowing practitioners to leverage the parsimony of the 1-SE rule without the traditional loss in predictive fidelity. Our theoretical framework incorporates automated stability safeguards for ultra-high dimensional regimes and is supported by a comprehensive benchmarking suite across 20 synthetic and real-world datasets, demonstrating that Renet consistently outperforms the standard Elastic Net and provides a more robust alternative to the Adaptive Elastic Net in high-dimensional, low signal-to-noise ratio and high-multicollinearity regimes. By leveraging an adaptive solver backend, Renet delivers these statistical gains while offering a computational profile that remains competitive with state-of-the-art coordinate descent implementations.

[LG-95] A Gibbs posterior sampler for inverse problem based on prior diffusion model

链接: https://arxiv.org/abs/2602.11059
作者: Jean-François Giovannelli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This paper addresses the issue of inversion in cases where (1) the observation system is modeled by a linear transformation and additive noise, (2) the problem is ill-posed and regularization is introduced in a Bayesian framework by an a prior density, and (3) the latter is modeled by a diffusion process adjusted on an available large set of examples. In this context, it is known that the issue of posterior sampling is a thorny one. This paper introduces a Gibbs algorithm. It appears that this avenue has not been explored, and we show that this approach is particularly effective and remarkably simple. In addition, it offers a guarantee of convergence in a clearly identified situation. The results are clearly confirmed by numerical simulations.

[LG-96] Characterizing Trainability of Instantaneous Quantum Polynomial Circuit Born Machines

链接: https://arxiv.org/abs/2602.11042
作者: Kevin Shen,Susanne Pielawa,Vedran Dunjko,Hao Wang
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 14 pages, 1 figure

点击查看摘要

Abstract:Instantaneous quantum polynomial quantum circuit Born machines (IQP-QCBMs) have been proposed as quantum generative models with a classically tractable training objective based on the maximum mean discrepancy (MMD) and a potential quantum advantage motivated by sampling-complexity arguments, making them an exciting model worth deeper investigation. While recent works have further proven the universality of a (slightly generalized) model, the next immediate question pertains to its trainability, i.e., whether it suffers from the exponentially vanishing loss gradients, known as the barren plateau issue, preventing effective use, and how regimes of trainability overlap with regimes of possible quantum advantage. Here, we provide significant strides in these directions. To study the trainability at initialization, we analytically derive closed-form expressions for the variances of the partial derivatives of the MMD loss function and provide general upper and lower bounds. With uniform initialization, we show that barren plateaus depend on the generator set and the spectrum of the chosen kernel. We identify regimes in which low-weight-biased kernels avoid exponential gradient suppression in structured topologies. Also, we prove that a small-variance Gaussian initialization ensures polynomial scaling for the gradient under mild conditions. As for the potential quantum advantage, we further argue, based on previous complexity-theoretic arguments, that sparse IQP families can output a probability distribution family that is classically intractable, and that this distribution remains trainable at initialization at least at lower-weight frequencies.

[LG-97] Variational Optimality of Föllm er Processes in Generative Diffusions

链接: https://arxiv.org/abs/2602.10989
作者: Yifan Chen,Eric Vanden-Eijnden
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We construct and analyze generative diffusions that transport a point mass to a prescribed target distribution over a finite time horizon using the stochastic interpolant framework. The drift is expressed as a conditional expectation that can be estimated from independent samples without simulating stochastic processes. We show that the diffusion coefficient can be tuned \empha~posteriori without changing the time-marginal distributions. Among all such tunings, we prove that minimizing the impact of estimation error on the path-space Kullback–Leibler divergence selects, in closed form, a Föllmer process – a diffusion whose path measure minimizes relative entropy with respect to a reference process determined by the interpolation schedules alone. This yields a new variational characterization of Föllmer processes, complementing classical formulations via Schrödinger bridges and stochastic control. We further establish that, under this optimal diffusion coefficient, the path-space Kullback–Leibler divergence becomes independent of the interpolation schedule, rendering different schedules statistically equivalent in this variational sense.

[LG-98] Optimal Initialization in Depth: Lyapunov Initialization and Limit Theorems for Deep Leaky ReLU Networks

链接: https://arxiv.org/abs/2602.10949
作者: Constantin Kogler,Tassilo Schwarz,Samuel Kittle
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Probability (math.PR)
*备注: 45 pages

点击查看摘要

Abstract:The development of effective initialization methods requires an understanding of random neural networks. In this work, a rigorous probabilistic analysis of deep unbiased Leaky ReLU networks is provided. We prove a Law of Large Numbers and a Central Limit Theorem for the logarithm of the norm of network activations, establishing that, as the number of layers increases, their growth is governed by a parameter called the Lyapunov exponent. This parameter characterizes a sharp phase transition between vanishing and exploding activations, and we calculate the Lyapunov exponent explicitly for Gaussian or orthogonal weight matrices. Our results reveal that standard methods, such as He initialization or orthogonal initialization, do not guarantee activation stabilty for deep networks of low width. Based on these theoretical insights, we propose a novel initialization method, referred to as Lyapunov initialization, which sets the Lyapunov exponent to zero and thereby ensures that the neural network is as stable as possible, leading empirically to improved learning.

[LG-99] Deep Learning of Compositional Targets with Hierarchical Spectral Methods

链接: https://arxiv.org/abs/2602.10867
作者: Hugo Tabanelli,Yatin Dandi,Luca Pesce,Florent Krzakala
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Why depth yields a genuine computational advantage over shallow methods remains a central open question in learning theory. We study this question in a controlled high-dimensional Gaussian setting, focusing on compositional target functions. We analyze their learnability using an explicit three-layer fitting model trained via layer-wise spectral estimators. Although the target is globally a high-degree polynomial, its compositional structure allows learning to proceed in stages: an intermediate representation reveals structure that is inaccessible at the input level. This reduces learning to simpler spectral estimation problems, well studied in the context of multi-index models, whereas any shallow estimator must resolve all components simultaneously. Our analysis relies on Gaussian universality, leading to sharp separations in sample complexity between two and three-layer learning strategies.

[LG-100] Self-Supervised Learning for Speaker Recognition: A study and review

链接: https://arxiv.org/abs/2602.10829
作者: Theo Lepage,Reda Dehak
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: accepted for publication in Speech Communication

点击查看摘要

Abstract:Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these frameworks, are also presented. An extensive review of these approaches is then conducted: (1) the effect of the main hyperparameters of SSL frameworks is investigated; (2) the role of SSL components is studied (e.g., data-augmentation, projector, positive sampling); and (3) SSL frameworks are evaluated on SR with in-domain and out-of-domain data, using a consistent experimental setup, and a comprehensive comparison of SSL methods from the literature is provided. Specifically, DINO achieves the best downstream performance and effectively models intra-speaker variability, although it is highly sensitive to hyperparameters and training conditions, while SimCLR and MoCo provide robust alternatives that effectively capture inter-speaker variability and are less prone to collapse. This work aims to highlight recent trends and advancements, identifying current challenges in the field.

[LG-101] Bayesian Signal Component Decomposition via Diffusion-within-Gibbs Sampling

链接: https://arxiv.org/abs/2602.10792
作者: Yi Zhang,Rui Guo,Yonina C. Eldar
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures. Submitted to journal

点击查看摘要

Abstract:In signal processing, the data collected from sensing devices is often a noisy linear superposition of multiple components, and the estimation of components of interest constitutes a crucial pre-processing step. In this work, we develop a Bayesian framework for signal component decomposition, which combines Gibbs sampling with plug-and-play (PnP) diffusion priors to draw component samples from the posterior distribution. Unlike many existing methods, our framework supports incorporating model-driven and data-driven prior knowledge into the diffusion prior in a unified manner. Moreover, the proposed posterior sampler allows component priors to be learned separately and flexibly combined without retraining. Under suitable assumptions, the proposed DiG sampler provably produces samples from the posterior distribution. We also show that DiG can be interpreted as an extension of a class of recently proposed diffusion-based samplers, and that, for suitable classes of sensing operators, DiG better exploits the structure of the measurement model. Numerical experiments demonstrate the superior performance of our method over existing approaches.

[LG-102] Robust Assortment Optimization from Observational Data

链接: https://arxiv.org/abs/2602.10696
作者: Miao Lu,Yuxuan Han,Han Zhong,Zhengyuan Zhou,Jose Blanchet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: 65 pages, 9 figures

点击查看摘要

Abstract:Assortment optimization is a fundamental challenge in modern retail and recommendation systems, where the goal is to select a subset of products that maximizes expected revenue under complex customer choice behaviors. While recent advances in data-driven methods have leveraged historical data to learn and optimize assortments, these approaches typically rely on strong assumptions – namely, the stability of customer preferences and the correctness of the underlying choice models. However, such assumptions frequently break in real-world scenarios due to preference shifts and model misspecification, leading to poor generalization and revenue loss. Motivated by this limitation, we propose a robust framework for data-driven assortment optimization that accounts for potential distributional shifts in customer choice behavior. Our approach models potential preference shift from a nominal choice model that generates data and seeks to maximize worst-case expected revenue. We first establish the computational tractability of robust assortment planning when the nominal model is known, then advance to the data-driven setting, where we design statistically optimal algorithms that minimize the data requirements while maintaining robustness. Our theoretical analysis provides both upper bounds and matching lower bounds on the sample complexity, offering theoretical guarantees for robust generalization. Notably, we uncover and identify the notion of ``robust item-wise coverage’’ as the minimal data requirement to enable sample-efficient robust assortment learning. Our work bridges the gap between robustness and statistical efficiency in assortment learning, contributing new insights and tools for reliable assortment optimization under uncertainty.

[LG-103] Convergence Rates for Distribution Matching with Sliced Optimal Transport

链接: https://arxiv.org/abs/2602.10691
作者: Gauthier Thurin(ENS-PSL),Claire Boyer(LMO, IUF),Kimia Nadjahi(ENS-PSL)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the slice-matching scheme, an efficient iterative method for distribution matching based on sliced optimal transport. We investigate convergence to the target distribution and derive quantitative non-asymptotic rates. To this end, we establish __ojasiewicz-type inequalities for the Sliced-Wasserstein objective. A key challenge is to control along the trajectory the constants in these inequalities. We show that this becomes tractable for Gaussian distributions. Specifically, eigenvalues are controlled when matching along random orthonormal bases at each iteration. We complement our theory with numerical experiments and illustrate the predicted dependence on dimension and step-size, as well as the stabilizing effect of orthonormal-basis sampling.

[LG-104] A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization

链接: https://arxiv.org/abs/2602.10680
作者: Vicente Conde Mendes,Lorenzo Bardone,Cédric Koller,Jorge Medina Moreira,Vittorio Erba,Emanuele Troiani,Lenka Zdeborová
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many real-world datasets contain hidden structure that cannot be detected by simple linear correlations between input features. For example, latent factors may influence the data in a coordinated way, even though their effect is invisible to covariance-based methods such as PCA. In practice, nonlinear neural networks often succeed in extracting such hidden structure in unsupervised and self-supervised learning. However, constructing a minimal high-dimensional model where this advantage can be rigorously analyzed has remained an open theoretical challenge. We introduce a tractable high-dimensional spiked model with two latent factors: one visible to covariance, and one statistically dependent yet uncorrelated, appearing only in higher-order moments. PCA and linear autoencoders fail to recover the latter, while a minimal nonlinear autoencoder provably extracts both. We analyze both the population risk, and empirical risk minimization. Our model also provides a tractable example where self-supervised test loss is poorly aligned with representation quality: nonlinear autoencoders recover latent structure that linear methods miss, even though their reconstruction loss is higher.

[LG-105] From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks ICASSP

链接: https://arxiv.org/abs/2602.10666
作者: Riccardo Miccini,Clément Laroche,Tobias Piechowiak,Xenofon Fafoutis,Luca Pezzarossa
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted for publication at the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

点击查看摘要

Abstract:Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.

[LG-106] Beyond Kemeny Medians: Consensus Ranking Distributions Definition Properties and Statistical Learning

链接: https://arxiv.org/abs/2602.10640
作者: Stephan Clémençon,Ekhine Irurozki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this article we develop a new method for summarizing a ranking distribution, \textiti.e. a probability distribution on the symmetric group \mathfrakS_n , beyond the classical theory of consensus and Kemeny medians. Based on the notion of \textitlocal ranking median, we introduce the concept of \textitconsensus ranking distribution ( \crd ), a sparse mixture model of Dirac masses on \mathfrakS_n , in order to approximate a ranking distribution with small distortion from a mass transportation perspective. We prove that by choosing the popular Kendall \tau distance as the cost function, the optimal distortion can be expressed as a function of pairwise probabilities, paving the way for the development of efficient learning methods that do not suffer from the lack of vector space structure on \mathfrakS_n . In particular, we propose a top-down tree-structured statistical algorithm that allows for the progressive refinement of a CRD based on ranking data, from the Dirac mass at a Kemeny median at the root of the tree to the empirical ranking data distribution itself at the end of the tree’s exhaustive growth. In addition to the theoretical arguments developed, the relevance of the algorithm is empirically supported by various numerical experiments.

[LG-107] Highly Adaptive Principal Component Regression

链接: https://arxiv.org/abs/2602.10613
作者: Mingxun Wang,Alejandro Schuler,Mark van der Laan,Carlos García Meixide
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Highly Adaptive Lasso (HAL) is a nonparametric regression method that achieves almost dimension-free convergence rates under minimal smoothness assumptions, but its implementation can be computationally prohibitive in high dimensions due to the large basis matrix it requires. The Highly Adaptive Ridge (HAR) has been proposed as a scalable alternative. Building on both procedures, we introduce the Principal Component based Highly Adaptive Lasso (PCHAL) and Principal Component based Highly Adaptive Ridge (PCHAR). These estimators constitute an outcome-blind dimension reduction which offer substantial gains in computational efficiency and match the empirical performances of HAL and HAR. We also uncover a striking spectral link between the leading principal components of the HAL/HAR Gram operator and a discrete sinusoidal basis, revealing an explicit Fourier-type structure underlying the PC truncation.

[LG-108] Bayesian Inference of Contextual Bandit Policies via Empirical Likelihood

链接: https://arxiv.org/abs/2602.10608
作者: Jiangrong Ouyang,Mingming Gong,Howard Bondell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted for publication in JMLR

点击查看摘要

Abstract:Policy inference plays an essential role in the contextual bandit problem. In this paper, we use empirical likelihood to develop a Bayesian inference method for the joint analysis of multiple contextual bandit policies in finite sample regimes. The proposed inference method is robust to small sample sizes and is able to provide accurate uncertainty measurements for policy value evaluation. In addition, it allows for flexible inferences on policy comparison with full uncertainty quantification. We demonstrate the effectiveness of the proposed inference method using Monte Carlo simulations and its application to an adolescent body mass index data set.

[LG-109] Deep Bootstrap

链接: https://arxiv.org/abs/2602.10587
作者: Jinyuan Chang,Yuling Jiao,Lican Kang,Junjie Shi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we propose a novel deep bootstrap framework for nonparametric regression based on conditional diffusion models. Specifically, we construct a conditional diffusion model to learn the distribution of the response variable given the covariates. This model is then used to generate bootstrap samples by pairing the original covariates with newly synthesized responses. We reformulate nonparametric regression as conditional sample mean estimation, which is implemented directly via the learned conditional diffusion model. Unlike traditional bootstrap methods that decouple the estimation of the conditional distribution, sampling, and nonparametric regression, our approach integrates these components into a unified generative framework. With the expressive capacity of diffusion models, our method facilitates both efficient sampling from high-dimensional or multimodal distributions and accurate nonparametric estimation. We establish rigorous theoretical guarantees for the proposed method. In particular, we derive optimal end-to-end convergence rates in the Wasserstein distance between the learned and target conditional distributions. Building on this foundation, we further establish the convergence guarantees of the resulting bootstrap procedure. Numerical studies demonstrate the effectiveness and scalability of our approach for complex regression tasks.

[LG-110] Why Agent ic Theorem Prover Works: A Statistical Provability Theory of Mathematical Reasoning Models

链接: https://arxiv.org/abs/2602.10538
作者: Sho Sonoda,Shunta Akiyama,Yuya Uezato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Agentic theorem provers – pipelines that couple a mathematical reasoning model with library retrieval, subgoal-decomposition/search planner, and a proof assistant verifier – have recently achieved striking empirical success, yet it remains unclear which components drive performance and why such systems work at all despite classical hardness of proof search. We propose a distributional viewpoint and introduce statistical provability, defined as the finite-horizon success probability of reaching a verified proof, averaged over an instance distribution, and formalize modern theorem-proving pipelines as time-bounded MDPs. Exploiting Bellman structure, we prove existence of optimal policies under mild regularity, derive provability certificates via sub-/super-solution inequalities, and bound the performance gap of score-guided planning (greedy/top-(k)/beam/rollouts) in terms of approximation error, sequential statistical complexity, representation geometry (metric entropy/doubling structure), and action-gap margin tails. Together, our theory provides a principled, component-sensitive explanation of when and why agentic theorem provers succeed on biased real-world problem distributions, while clarifying limitations in worst-case or adversarial regimes.

[LG-111] Statistical Inference and Learning for Shapley Additive Explanations (SHAP)

链接: https://arxiv.org/abs/2602.10532
作者: Justin Whitehouse,Ayush Sawarni,Vasilis Syrgkanis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 48 pages, 1 figure

点击查看摘要

Abstract:The SHAP (short for Shapley additive explanation) framework has become an essential tool for attributing importance to variables in predictive tasks. In model-agnostic settings, SHAP uses the concept of Shapley values from cooperative game theory to fairly allocate credit to the features in a vector X based on their contribution to an outcome Y . While the explanations offered by SHAP are local by nature, learners often need global measures of feature importance in order to improve model explainability and perform feature selection. The most common approach for converting these local explanations into global ones is to compute either the mean absolute SHAP or mean squared SHAP. However, despite their ubiquity, there do not exist approaches for performing statistical inference on these quantities. In this paper, we take a semi-parametric approach for calibrating confidence in estimates of the p th powers of Shapley additive explanations. We show that, by treating the SHAP curve as a nuisance function that must be estimated from data, one can reliably construct asymptotically normal estimates of the p th powers of SHAP. When p \geq 2 , we show a de-biased estimator that combines U-statistics with Neyman orthogonal scores for functionals of nested regressions is asymptotically normal. When 1 \leq p 2 (and the hence target parameter is not twice differentiable), we construct de-biased U-statistics for a smoothed alternative. In particular, we show how to carefully tune the temperature parameter of the smoothing function in order to obtain inference for the true, unsmoothed p th power. We complement these results by presenting a Neyman orthogonal loss that can be used to learn the SHAP curve via empirical risk minimization and discussing excess risk guarantees for commonly used function classes. Comments: 48 pages, 1 figure Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2602.10532 [stat.ML] (or arXiv:2602.10532v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.10532 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-112] From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources

链接: https://arxiv.org/abs/2602.10531
作者: Soham Bakshi,Sunrit Chakraborty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.

[LG-113] Generalized Robust Adaptive-Bandwidth Multi-View Manifold Learning in High Dimensions with Noise

链接: https://arxiv.org/abs/2602.10530
作者: Xiucai Ding,Chao Shen,Hau-Tieng Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 4 figures

点击查看摘要

Abstract:Multiview datasets are common in scientific and engineering applications, yet existing fusion methods offer limited theoretical guarantees, particularly in the presence of heterogeneous and high-dimensional noise. We propose Generalized Robust Adaptive-Bandwidth Multiview Diffusion Maps (GRAB-MDM), a new kernel-based diffusion geometry framework for integrating multiple noisy data sources. The key innovation of GRAB-MDM is a view-dependent bandwidth selection strategy that adapts to the geometry and noise level of each view, enabling a stable and principled construction of multiview diffusion operators. Under a common-manifold model, we establish asymptotic convergence results and show that the adaptive bandwidths lead to provably robust recovery of the shared intrinsic structure, even when noise levels and sensor dimensions differ across views. Numerical experiments demonstrate that GRAB-MDM significantly improves robustness and embedding quality compared with fixed-bandwidth and equal-bandwidth baselines, and usually outperform existing algorithms. The proposed framework offers a practical and theoretically grounded solution for multiview sensor fusion in high-dimensional noisy environments.

[LG-114] Privacy-Utility Tradeoffs in Quantum Information Processing

链接: https://arxiv.org/abs/2602.10510
作者: Theshani Nuradha,Sujeet Bhalerao,Felix Leditzky
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 23 pages, 2 figures

点击查看摘要

Abstract:When sensitive information is encoded in data, it is important to ensure the privacy of information when attempting to learn useful information from the data. There is a natural tradeoff whereby increasing privacy requirements may decrease the utility of a learning protocol. In the quantum setting of differential privacy, such tradeoffs between privacy and utility have so far remained largely unexplored. In this work, we study optimal privacy-utility tradeoffs for both generic and application-specific utility metrics when privacy is quantified by (\varepsilon,\delta) -quantum local differential privacy. In the generic setting, we focus on optimizing fidelity and trace distance between the original state and the privatized state. We show that the depolarizing mechanism achieves the optimal utility for given privacy requirements. We then study the specific application of learning the expectation of an observable with respect to an input state when only given access to privatized states. We derive a lower bound on the number of samples of privatized data required to achieve a fixed accuracy guarantee with high probability. To prove this result, we employ existing lower bounds on private quantum hypothesis testing, thus showcasing the first operational use of them. We also devise private mechanisms that achieve optimal sample complexity with respect to the privacy parameters and accuracy parameters, demonstrating that utility can be significantly improved for specific tasks in contrast to the generic setting. In addition, we show that the number of samples required to privately learn observable expectation values scales as \Theta((\varepsilon \beta)^-2) , where \varepsilon \in (0,1) is the privacy parameter and \beta is the accuracy tolerance. We conclude by initiating the study of private classical shadows, which promise useful applications for private learning tasks.

[LG-115] Unlocked Backpropagation using Wave Scattering

链接: https://arxiv.org/abs/2602.10461
作者: Christian Pehle,Jean-Jacques Slotine
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Both the backpropagation algorithm in machine learning and the maximum principle in optimal control theory are posed as a two-point boundary problem, resulting in a “forward-backward” lock. We derive a reformulation of the maximum principle in optimal control theory as a hyperbolic initial value problem by introducing an additional “optimization time” dimension. We introduce counter-propagating wave variables with finite propagation speed and recast the optimization problem in terms of scattering relationships between them. This relaxation of the original problem can be interpreted as a physical system that equilibrates and changes its physical properties in order to minimize reflections. We discretize this continuum theory to derive a family of fully unlocked algorithms suitable for training neural networks. Different parameter dynamics, including gradient descent, can be derived by demanding dissipation and minimization of reflections at parameter ports. These results also imply that any physical substrate that supports the scattering and dissipation of waves can be interpreted as solving an optimization problem.

[LG-116] Distributed Online Convex Optimization with Nonseparable Costs and Constraints

链接: https://arxiv.org/abs/2602.10452
作者: Zhaoye Pan,Haozhe Lei,Fan Zuo,Zilin Bian,Tao Li
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies distributed online convex optimization with time-varying coupled constraints, motivated by distributed online control in network systems. Most prior work assumes a separability condition: the global objective and coupled constraint functions are sums of local costs and individual constraints. In contrast, we study a group of agents, networked via a communication graph, that collectively select actions to minimize a sequence of nonseparable global cost functions and to stratify nonseparable long-term constraints based on full-information feedback and intra-agent communication. We propose a distributed online primal-dual belief consensus algorithm, where each agent maintains and updates a local belief of the global collective decisions, which are repeatedly exchanged with neighboring agents. Unlike the previous consensus primal-dual algorithms under separability that ask agents to only communicate their local decisions, our belief-sharing protocol eliminates coupling between the primal consensus disagreement and the dual constraint violation, yielding sublinear regret and cumulative constraint violation (CCV) bounds, both in O(T^1/2) , where T denotes the time horizon. Such a result breaks the long-standing O(T^3/4) barrier for CCV and matches the lower bound of online constrained convex optimization, indicating the online learning efficiency at the cost of communication overhead.

[LG-117] Causal Effect Estimation with Learned Instrument Representations

链接: https://arxiv.org/abs/2602.10370
作者: Frances Dean,Jenna Fields,Radhika Bhalerao,Marie Charpignon,Ahmed Alaa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Instrumental variable (IV) methods mitigate bias from unobserved confounding in observational causal inference but rely on the availability of a valid instrument, which can often be difficult or infeasible to identify in practice. In this paper, we propose a representation learning approach that constructs instrumental representations from observed covariates, which enable IV-based estimation even in the absence of an explicit instrument. Our model (ZNet) achieves this through an architecture that mirrors the structural causal model of IVs; it decomposes the ambient feature space into confounding and instrumental components, and is trained by enforcing empirical moment conditions corresponding to the defining properties of valid instruments (i.e., relevance, exclusion restriction, and instrumental unconfoundedness). Importantly, ZNet is compatible with a wide range of downstream two-stage IV estimators of causal effects. Our experiments demonstrate that ZNet can (i) recover ground-truth instruments when they already exist in the ambient feature space and (ii) construct latent instruments in the embedding space when no explicit IVs are available. This suggests that ZNet can be used as a ``plug-and-play’’ module for causal inference in general observational settings, regardless of whether the (untestable) assumption of unconfoundedness is satisfied.

[LG-118] Efficient reduction of stellar contamination and noise in planetary transmission spectra using neural networks

链接: https://arxiv.org/abs/2602.10330
作者: David S. Duque-Castaño,Lauren Flor-Torres,Jorge I. Zuluaga
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 16 pages, 11 figures. Submitted to Astronomy Astrophysics

点击查看摘要

Abstract:Context: JWST has enabled transmission spectroscopy at unprecedented precision, but stellar heterogeneities (spots and faculae) remain a dominant contamination source that can bias atmospheric retrievals if uncorrected. Aims: We present a fast, unsupervised methodology to reduce stellar contamination and instrument-specific noise in exoplanet transmission spectra using denoising autoencoders, improving the reliability of retrieved atmospheric parameters. Methods: We design and train denoising autoencoder architectures on large synthetic datasets of terrestrial (TRAPPIST-1e analogues) and sub-Neptune (K2-18b analogues) planets. Reconstruction quality is evaluated with the \chi^2 statistic over a wide range of signal-to-noise ratios, and atmospheric retrieval experiments on contaminated spectra are used to compare against standard correction approaches in accuracy and computational cost. Results: The autoencoders reconstruct uncontaminated spectra while preserving key molecular features, even at low S/N. In retrieval tests, pre-processing with denoising autoencoders reduces bias in inferred abundances relative to uncorrected baselines and matches the accuracy of simultaneous stellar-contamination fitting while reducing computational time by a factor of three to six. Conclusions: Denoising autoencoders provide an efficient alternative to conventional correction strategies and are promising components of future atmospheric characterization pipelines for both rocky and gaseous exoplanets.

[LG-119] Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning

链接: https://arxiv.org/abs/2602.10273
作者: Seyedarmin Azizi,Erfan Baghaei Potraghloo,Minoo Ahmadi,Souvik Kundu,Massoud Pedram
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many recent reasoning gains in large language models can be explained as distribution sharpening: biasing generation toward high-likelihood trajectories already supported by the pretrained model, rather than modifying its weights. A natural formalization is the sequence-level power distribution \pi_\alpha(y\mid x)\propto p_\theta(y\mid x)^\alpha ( \alpha1 ), which concentrates mass on whole sequences instead of adjusting token-level temperature. Prior work shows that Metropolis–Hastings (MH) sampling from this distribution recovers strong reasoning performance, but at order-of-magnitude inference slowdowns. We introduce Power-SMC, a training-free Sequential Monte Carlo scheme that targets the same objective while remaining close to standard decoding latency. Power-SMC advances a small particle set in parallel, corrects importance weights token-by-token, and resamples when necessary, all within a single GPU-friendly batched decode. We prove that temperature \tau=1/\alpha is the unique prefix-only proposal minimizing incremental weight variance, interpret residual instability via prefix-conditioned Rényi entropies, and introduce an exponent-bridging schedule that improves particle stability without altering the target. On MATH500, Power-SMC matches or exceeds MH power sampling while reducing latency from 16 – 28\times to 1.4 – 3.3\times over baseline decoding.

[LG-120] Dissecting Performative Prediction: A Comprehensive Survey

链接: https://arxiv.org/abs/2602.10176
作者: Thomas Kehrenberg,Javier Sanguino,Jose A. Lozano,Novi Quadrianto
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of performative prediction had its beginnings in 2020 with the seminal paper “Performative Prediction” by Perdomo et al., which established a novel machine learning setup where the deployment of a predictive model causes a distribution shift in the environment, which in turn causes a mismatch between the distribution expected by the predictive model and the real distribution. This shift is defined by a so-called distribution map. In the half-decade since, a literature has emerged which has, among other things, introduced new solution concepts to the original setup, extended the setup, offered new theoretical analyses, and examined the intersection of performative prediction and other established fields. In this survey, we first lay out the performative prediction setting and explain the different optimization targets: performative stability and performative optimality. We introduce a new way of classifying different performative prediction settings, based on how much information is available about the distribution map. We survey existing implementations of distribution maps and existing methods to address the problem of performative prediction, while examining different ways to categorize them. Finally, we point out known and previously unknown connections that can be drawn to other fields, in the hopes of stimulating future research.

[LG-121] STRAND: Sequence-Conditioned Transport for Single-Cell Perturbations

链接: https://arxiv.org/abs/2602.10156
作者: Boyang Fu,George Dasoulas,Sameer Gabbita,Xiang Lin,Shanghua Gao,Xiaorui Su,Soumya Ghosh,Marinka Zitnik
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
*备注: 8 pages for main draft, 6 main figures

点击查看摘要

Abstract:Predicting how genetic perturbations change cellular state is a core problem for building controllable models of gene regulation. Perturbations targeting the same gene can produce different transcriptional responses depending on their genomic locus, including different transcription start sites and regulatory elements. Gene-level perturbation models collapse these distinct interventions into the same representation. We introduce STRAND, a generative model that predicts single-cell transcriptional responses by conditioning on regulatory DNA sequence. STRAND represents a perturbation by encoding the sequence at its genomic locus and uses this representation to parameterize a conditional transport process from control to perturbed cell states. Representing perturbations by sequence, rather than by a fixed set of gene identifiers, supports zero-shot inference at loci not seen during training and expands inference-time genomic coverage from ~1.5% for gene-level single-cell foundation models to ~95% of the genome. We evaluate STRAND on CRISPR perturbation datasets in K562, Jurkat, and RPE1 cells. STRAND improves discrimination scores by up to 33% in low-sample regimes, achieves the best average rank on unseen gene perturbation benchmarks, and improves transfer to novel cell lines by up to 0.14 in Pearson correlation. Ablations isolate the gains to sequence conditioning and transport, and case studies show that STRAND resolves functionally alternative transcription start sites missed by gene-level models. Comments: 8 pages for main draft, 6 main figures Subjects: Genomics (q-bio.GN); Machine Learning (cs.LG); Cell Behavior (q-bio.CB) Cite as: arXiv:2602.10156 [q-bio.GN] (or arXiv:2602.10156v1 [q-bio.GN] for this version) https://doi.org/10.48550/arXiv.2602.10156 Focus to learn more arXiv-issued DOI via DataCite

[LG-122] Validating Interpretability in siRNA Efficacy Prediction: A Perturbation-Based Dataset-Aware Protocol

链接: https://arxiv.org/abs/2602.10152
作者: Zahra Khodagholi,Niloofar Yousefi
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Saliency maps are increasingly used as \emphdesign guidance in siRNA efficacy prediction, yet attribution methods are rarely validated before motivating sequence edits. We introduce a \textbfpre-synthesis gate: a protocol for \emphcounterfactual sensitivity faithfulness that tests whether mutating high-saliency positions changes model output more than composition-matched controls. Cross-dataset transfer reveals two failure modes that would otherwise go undetected: \emphfaithful-but-wrong (saliency valid, predictions fail) and \emphinverted saliency (top-saliency edits less impactful than random). Strikingly, models trained on mRNA-level assays collapse on a luciferase reporter dataset, demonstrating that protocol shifts can silently invalidate deployment. Across four benchmarks, 19/20 fold instances pass; the single failure shows inverted saliency. A biology-informed regularizer (BioPrior) strengthens saliency faithfulness with modest, dataset-dependent predictive trade-offs. Our results establish saliency validation as essential pre-deployment practice for explanation-guided therapeutic design. Code is available at this https URL.

附件下载

点击下载今日全部论文列表