Arxiv今日论文 | 2026-02-19

本篇博文主要内容为 2026-02-19 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共85篇(Computation and Language (cs.CL))
人工智能共136篇(Artificial Intelligence (cs.AI))
计算机视觉共66篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共170篇(Machine Learning (cs.LG))
多智能体系统共14篇(Multiagent Systems (cs.MA))
信息检索共12篇(Information Retrieval (cs.IR))
人机交互共28篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Policy Compiler for Secure Agent ic Systems

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）驱动的智能体系统在复杂授权策略场景下缺乏确定性执行保障的问题，例如客户服务中心协议、审批流程、数据访问限制及合规要求等。现有基于提示（prompt）嵌入策略的方法无法提供强制执行保证，导致策略可能被绕过或误执行。解决方案的关键在于提出PCAS（Policy Compiler for Agentic Systems），其核心创新是通过构建依赖图（dependency graph）来建模多智能体系统中事件间的因果关系（如工具调用、工具结果和消息传递），并采用类Datalog的声明式规则表达策略，支持跨智能体溯源与传递性信息流分析；同时引入参考监视器（reference monitor）拦截所有操作并在执行前阻断违规行为，从而实现无需重构原系统即可构造出策略合规的运行环境，确保策略执行的确定性和安全性。

链接: https://arxiv.org/abs/2602.16708
作者: Nils Palumbo,Sarthak Choudhary,Jihye Choi,Prasad Chalasani,Mihai Christodorescu,Somesh Jha
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Langroid; Google(谷歌)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:LLM-based agents are increasingly being deployed in contexts requiring complex authorization policies: customer service protocols, approval workflows, data access restrictions, and regulatory compliance. Embedding these policies in prompts provides no enforcement guarantees. We present PCAS, a Policy Compiler for Agentic Systems that provides deterministic policy enforcement. Enforcing such policies requires tracking information flow across agents, which linear message histories cannot capture. Instead, PCAS models the agentic system state as a dependency graph capturing causal relationships among events such as tool calls, tool results, and messages. Policies are expressed in a Datalog-derived language, as declarative rules that account for transitive information flow and cross-agent provenance. A reference monitor intercepts all actions and blocks violations before execution, providing deterministic enforcement independent of model reasoning. PCAS takes an existing agent implementation and a policy specification, and compiles them into an instrumented system that is policy-compliant by construction, with no security-specific restructuring required. We evaluate PCAS on three case studies: information flow policies for prompt injection defense, approval workflows in a multi-agent pharmacovigilance system, and organizational policies for customer service. On customer service tasks, PCAS improves policy compliance from 48% to 93% across frontier models, with zero policy violations in instrumented runs. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2602.16708 [cs.CR] (or arXiv:2602.16708v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.16708 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-1] Fairness Dynamics in Digital Economy Platforms with Biased Ratings AAMAS2026

【速读】：该论文旨在解决数字服务平台中因评分系统（rating system）固有偏见而加剧对边缘化群体的歧视问题，同时确保所有服务提供者仍有动力提供优质服务。其核心挑战在于如何在维护用户体验与实现公平性之间取得平衡。解决方案的关键在于通过引入演化博弈论模型，分析平台在推荐策略上的选择——即是否优先推广高评分服务提供者或特定受保护群体成员。研究发现，仅基于评分进行推荐会削弱边缘化群体的需求，而通过调整搜索结果中的群体构成（如主动纳入一定比例的受保护群体成员），可在最小影响用户体验的前提下显著降低不公平性；即使无法精确量化评分偏见程度，这种干预策略仍优于忽略受保护特征的传统推荐机制。这表明，在依赖评分促进合作行为的系统中，主动设计反歧视机制具有重要价值。

链接: https://arxiv.org/abs/2602.16695
作者: J. Martin Smit,Fernando P. Santos
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY)
备注: 9 pages, 6 figures, in proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

点击查看摘要

Abstract:The digital services economy consists of online platforms that facilitate interactions between service providers and consumers. This ecosystem is characterized by short-term, often one-off, transactions between parties that have no prior familiarity. To establish trust among users, platforms employ rating systems which allow users to report on the quality of their previous interactions. However, while arguably crucial for these platforms to function, rating systems can perpetuate negative biases against marginalised groups. This paper investigates how to design platforms around biased reputation systems, reducing discrimination while maintaining incentives for all service providers to offer high quality service for users. We introduce an evolutionary game theoretical model to study how digital platforms can perpetuate or counteract rating-based discrimination. We focus on the platforms’ decisions to promote service providers who have high reputations or who belong to a specific protected group. Our results demonstrate a fundamental trade-off between user experience and fairness: promoting highly-rated providers benefits users, but lowers the demand for marginalised providers against which the ratings are biased. Our results also provide evidence that intervening by tuning the demographics of the search results is a highly effective way of reducing unfairness while minimally impacting users. Furthermore, we show that even when precise measurements on the level of rating bias affecting marginalised service providers is unavailable, there is still potential to improve upon a recommender system which ignores protected characteristics. Altogether, our model highlights the benefits of proactive anti-discrimination design in systems where ratings are used to promote cooperative behaviour.

[MA-2] Consensus Based Task Allocation for Angles-Only Local Catalog Maintenance of Satellite Systems

【速读】：该论文旨在解决近地轨道中近距离卫星编队在缺乏地面高精度跟踪条件下，如何通过星载传感器实现对自身及周边目标（包括其他卫星和空间碎片）的相对状态精确估计的问题。其核心挑战在于：仅依赖角度测量且视场受限的传感器，同时需维持本地目录信息的通信与非通信目标，导致观测调度与协同效率低下。解决方案的关键在于提出一种去中心化的任务分配算法，通过优化各卫星间的观测任务分配，在保证通信协作的前提下显著降低燃料消耗并减少整体目录不确定性，数值仿真验证了该方法在燃料-不确定性权衡曲线上优于现有技术。

链接: https://arxiv.org/abs/2602.16678
作者: Harrison Perone,Christopher W. Hays
机构: University of Connecticut (康涅狄格大学); Air Force Research Laboratory Space Vehicles Directorate (空军研究实验室空间车辆主任部)
类目: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 14 pages, 4 figures. Submitted to the 48th Rocky Mountain American Astronautical Society’s Guidance, Navigation and Control Conference

点击查看摘要

Abstract:In order for close proximity satellites to safely perform their missions, the relative states of all satellites and pieces of debris must be well understood. This presents a problem for ground based tracking and orbit determination since it may not be practical to achieve the required accuracy. Using space-based sensors allows for more accurate relative state estimates, especially if multiple satellites are allowed to communicate. Of interest to this work is the case where several communicating satellites each need to maintain a local catalog of communicating and non-communicating objects using angles-only limited field of view (FOV) measurements. However, this introduces the problem of efficiently scheduling and coordinating observations among the agents. This paper presents a decentralized task allocation algorithm to address this problem and quantifies its performance in terms of fuel usage and overall catalog uncertainty via numerical simulation. It was found that the new method significantly outperforms the uncertainty-fuel Pareto frontier formed by current approaches.

[MA-3] Evaluating Collective Behaviour of Hundreds of LLM Agents

【速读】：该论文旨在解决大规模部署的基于大语言模型（Large Language Models, LLM）的自主代理在社会困境中表现出的集体行为不可预测性问题，尤其关注其在个体利益优先于集体收益时对社会福祉的潜在负面影响。解决方案的关键在于提出一个评估框架，通过让LLM生成以算法形式编码的策略，实现部署前的可解释性分析，并支持数百个代理的规模化模拟；该框架结合文化演化模型模拟用户对代理的选择机制，揭示了当合作相对收益下降且群体规模扩大时，系统易收敛至低效社会均衡的风险，从而为开发者提供了一套可扩展的评估工具以检测和优化模型的涌现式集体行为。

链接: https://arxiv.org/abs/2602.16662
作者: Richard Willis,Jianing Zhao,Yali Du,Joel Z. Leibo
机构: King’s College London(伦敦国王学院); Google DeepMind(谷歌深度思维)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As autonomous agents powered by LLM are increasingly deployed in society, understanding their collective behaviour in social dilemmas becomes critical. We introduce an evaluation framework where LLMs generate strategies encoded as algorithms, enabling inspection prior to deployment and scaling to populations of hundreds of agents – substantially larger than in previous work. We find that more recent models tend to produce worse societal outcomes compared to older models when agents prioritise individual gain over collective benefits. Using cultural evolution to model user selection of agents, our simulations reveal a significant risk of convergence to poor societal equilibria, particularly when the relative benefit of cooperation diminishes and population sizes increase. We release our code as an evaluation suite for developers to assess the emergent collective behaviour of their models.

[MA-4] am of Thoughts: Efficient Test-time Scaling of Agent ic Systems through Orchestrated Tool Calling

【速读】：该论文旨在解决现有多智能体系统（Multi-Agent Systems, MAS）依赖静态、同质模型配置，难以充分发挥不同后训练模型各自优势的问题。其解决方案的关键在于提出一种名为“思维团队”（Team-of-Thoughts）的新颖MAS架构，通过 orchestrator-tool 模式利用异构智能体的互补能力；具体包括两个核心机制：一是 orchestrator 校准方案以识别具备更强协同能力的模型，二是工具代理（tool agents）的自我评估协议，用于刻画其领域专长并反映后训练技能差异；在推理阶段，orchestrator 依据这些专业度谱图动态激活最优工具代理，从而实现性能优化。

链接: https://arxiv.org/abs/2602.16485
作者: Jeffrey T. H. Wong,Zixi Zhang,Junyi Liu,Yiren Zhao
机构: Imperial College London (帝国理工学院); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 8 pages

点击查看摘要

Abstract:Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.

[MA-5] Causally-Guided Automated Feature Engineering with Multi-Agent Reinforcement Learning

【速读】：该论文旨在解决现有自动化特征工程（Automated Feature Engineering, AFE）方法依赖统计启发式规则而导致特征在分布偏移下表现脆弱的问题。其解决方案的关键在于将AFE重新建模为一个因果引导的序列决策过程，通过两阶段框架实现：第一阶段基于因果发现学习特征与目标变量之间的稀疏有向无环图（Directed Acyclic Graph, DAG），获取软因果先验以对特征进行直接、间接和其他类别的分组；第二阶段采用级联多智能体深度Q学习架构，结合层级奖励设计和因果组级探索策略，在控制特征复杂度的同时优先选择具有因果合理性的变换操作。该方法显著提升了模型在分布变化下的鲁棒性与效率，并减少了特征集合规模及后验归因的不稳定性。

链接: https://arxiv.org/abs/2602.16435
作者: Arun Vignesh Malarkkan,Wangyang Ying,Yanjie Fu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 11 Pages, References and Appendix

点击查看摘要

Abstract:Automated feature engineering (AFE) enables AI systems to autonomously construct high-utility representations from raw tabular data. However, existing AFE methods rely on statistical heuristics, yielding brittle features that fail under distribution shift. We introduce CAFE, a framework that reformulates AFE as a causally-guided sequential decision process, bridging causal discovery with reinforcement learning-driven feature construction. Phase I learns a sparse directed acyclic graph over features and the target to obtain soft causal priors, grouping features as direct, indirect, or other based on their causal influence with respect to the target. Phase II uses a cascading multi-agent deep Q-learning architecture to select causal groups and transformation operators, with hierarchical reward shaping and causal group-level exploration strategies that favor causally plausible transformations while controlling feature complexity. Across 15 public benchmarks (classification with macro-F1; regression with inverse relative absolute error), CAFE achieves up to 7% improvement over strong AFE baselines, reduces episodes-to-convergence, and delivers competitive time-to-target. Under controlled covariate shifts, CAFE reduces performance drop by ~4x relative to a non-causal multi-agent baseline, and produces more compact feature sets with more stable post-hoc attributions. These findings underscore that causal structure, used as a soft inductive prior rather than a rigid constraint, can substantially improve the robustness and efficiency of automated feature engineering.

[MA-6] Verifiable Semantics for Agent -to-Agent Communication

【速读】：该论文旨在解决多智能体人工智能（Multiagent AI）系统中因语义漂移（semantic drift）导致的通信不一致问题，即如何验证多个智能体是否对所用术语具有相同的理解。解决方案的关键在于提出一种基于刺激-意义模型（stimulus-meaning model）的认证协议：通过在共享可观测事件上测试智能体，并以统计阈值判定术语是否被认证；在此基础上，限制智能体仅使用已认证术语进行推理（称为“核心守卫推理”，core-guarded reasoning），可保证其间分歧处于理论上限内。该机制同时支持检测语义漂移（recertification）和恢复共享词汇（renegotiation），实验证明其在模拟和基于微调语言模型的场景中分别将分歧降低72–96%与51%。

链接: https://arxiv.org/abs/2602.16424
作者: Philipp Schoenegger,Matt Carlson,Chris Schneider,Chris Daly
机构: Microsoft AI (微软人工智能); Wabash College (瓦巴什学院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multiagent AI systems require consistent communication, but we lack methods to verify that agents share the same understanding of the terms used. Natural language is interpretable but vulnerable to semantic drift, while learned protocols are efficient but opaque. We propose a certification protocol based on the stimulus-meaning model, where agents are tested on shared observable events and terms are certified if empirical disagreement falls below a statistical threshold. In this protocol, agents restricting their reasoning to certified terms (“core-guarded reasoning”) achieve provably bounded disagreement. We also outline mechanisms for detecting drift (recertification) and recovering shared vocabulary (renegotiation). In simulations with varying degrees of semantic divergence, core-guarding reduces disagreement by 72-96%. In a validation with fine-tuned language models, disagreement is reduced by 51%. Our framework provides a first step towards verifiable agent-to-agent communication.

[MA-7] Graphon Mean-Field Subsampling for Cooperative Heterogeneous Multi-Agent Reinforcement Learning

【速读】：该论文旨在解决多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）中因智能体数量增加而导致联合状态-动作空间指数级膨胀的问题，尤其针对存在异质性交互关系的场景。传统均值场（Mean-field）方法虽能缓解维度灾难，但假设所有智能体间交互同质；而基于图函数（Graphon）的框架虽可建模异质交互，却在大规模场景下计算开销过高。解决方案的关键在于提出一种图函数均值场子采样（Graphon Mean-Field Subsampling, GMFS）框架：通过根据交互强度对κ个智能体进行子采样，近似图函数加权的均值场，并在样本复杂度为多项式级别poly(κ)、最优性差距为O(1/√κ)的前提下学习策略，从而实现高效且可扩展的协作MARL。

链接: https://arxiv.org/abs/2602.16196
作者: Emile Anand,Richard Hoffmann,Sarah Liaw,Adam Wierman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 43 pages, 5 figures, 1 table

点击查看摘要

Abstract:Coordinating large populations of interacting agents is a central challenge in multi-agent reinforcement learning (MARL), where the size of the joint state-action space scales exponentially with the number of agents. Mean-field methods alleviate this burden by aggregating agent interactions, but these approaches assume homogeneous interactions. Recent graphon-based frameworks capture heterogeneity, but are computationally expensive as the number of agents grows. Therefore, we introduce \textttGMFS , a \textbfG raphon \textbfM ean- \textbfF ield \textbfS ubsampling framework for scalable cooperative MARL with heterogeneous agent interactions. By subsampling \kappa agents according to interaction strength, we approximate the graphon-weighted mean-field and learn a policy with sample complexity \mathrmpoly(\kappa) and optimality gap O(1/\sqrt\kappa) . We verify our theory with numerical simulations in robotic coordination, showing that \textttGMFS achieves near-optimal performance.

[MA-8] Modeling Trust and Liquidity Under Payment System Stress: A Multi-Agent Approach

【速读】：该论文旨在解决零售支付系统中断（如银行卡支付故障）引发的用户行为响应滞后问题，即技术恢复后仍可能出现持续的流动性压力和资金外流现象。其核心问题是传统视角忽视了信任动态、渠道规避行为及阈值触发式取款机制对系统稳定性的长期影响。解决方案的关键在于构建一个基于多智能体的模型，将客户与商户之间的重复支付交互、客户间基于Watts-Strogatz小世界网络的信息传播、以及商户广播信号滞后于技术恢复等因素纳入统一框架；通过引入有界记忆变量（累积负面体验“scar”和感知系统性风险“rumor”），并结合阈值门控取款机制，证明在温和条件下，整体取款压力可能在中断低谷之后达到峰值，甚至出现在恢复阶段——这揭示了“状态正常”并不等同于风险消除。该模型为制定更有效的应急响应策略提供了理论依据，强调需同步管理感知层面（如商户沟通与事后通报）以缓解行为惯性。

链接: https://arxiv.org/abs/2602.16186
作者: Masoud Amouzgar
机构: Sharif University of Technology (谢里夫理工大学); blu Bank (蓝银行)
类目: Computer Science and Game Theory (cs.GT); Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Operational disruptions in retail payments can induce behavioral responses that outlast technical recovery and may amplify liquidity stress. We propose a multi-agent model linking card payment outages to trust dynamics, channel avoidance, and threshold-gated withdrawals. Customers and merchants interact through repeated payment attempts, while customers additionally influence one another on a Watts-Strogatz small-world network. Customers update bounded memory variables capturing accumulated negative experience (scar) and perceived systemic risk (rumor), with merchants contributing persistent broadcast signals that may lag operational recovery. We prove that, under mild conditions on memory persistence and threshold gating, aggregate withdrawal pressure can peak strictly after the outage nadir, including during the recovery phase. Simulations reproduce behavioral hysteresis and confirm delayed peaks of outflows. We further study payment substitution via instant transfer: substitution consistently reduces peak avoidance, yet its effect on cumulative outflows is non-monotonic under realistic merchant broadcast persistence. Robustness experiments across random seeds show stable qualitative behavior. The model highlights why “status green” is not equivalent to risk resolution and motivates incident response strategies that address perception, merchant messaging, and post-recovery communication in addition to technical remediation.

[MA-9] Harnessing Implicit Cooperation: A Multi-Agent Reinforcement Learning Approach Towards Decentralized Local Energy Markets

【速读】：该论文旨在解决分布式能源市场中多智能体协同优化问题，即在缺乏显式点对点通信的情况下，如何实现去中心化代理之间的最优协调。其核心挑战在于，在局部可观测环境下，代理需通过有限信息推断全局状态并做出一致决策。解决方案的关键在于提出“隐式协作”（implicit cooperation）框架，利用系统层面的标志性性能指标（stigmergic signals）作为间接信号，使代理能够感知和响应全局状态变化，从而通过多智能体强化学习（multi-agent reinforcement learning）实现高效、稳定的电网调度。实验表明，基于APPO算法与DTDE训练范式的配置在保持高协调度（91.7%相对理论集中基准）的同时，显著提升了物理稳定性，尤其在减少电网平衡波动方面优于混合架构，验证了基于stigmergic信号的去中心化协调机制的有效性与鲁棒性。

链接: https://arxiv.org/abs/2602.16062
作者: Nelson Salazar-Pena,Alejandra Tabares,Andres Gonzalez-Mancera
机构: Universidad de los Andes (安第斯大学)
类目: ystems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Applications (stat.AP)
备注: 42 pages, 7 figures, 10 tables

点击查看摘要

Abstract:This paper proposes implicit cooperation, a framework enabling decentralized agents to approximate optimal coordination in local energy markets without explicit peer-to-peer communication. We formulate the problem as a decentralized partially observable Markov decision problem that is solved through a multi-agent reinforcement learning task in which agents use stigmergic signals (key performance indicators at the system level) to infer and react to global states. Through a 3x3 factorial design on an IEEE 34-node topology, we evaluated three training paradigms (CTCE, CTDE, DTDE) and three algorithms (PPO, APPO, SAC). Results identify APPO-DTDE as the optimal configuration, achieving a coordination score of 91.7% relative to the theoretical centralized benchmark (CTCE). However, a critical trade-off emerges between efficiency and stability: while the centralized benchmark maximizes allocative efficiency with a peer-to-peer trade ratio of 0.6, the fully decentralized approach (DTDE) demonstrates superior physical stability. Specifically, DTDE reduces the variance of grid balance by 31% compared to hybrid architectures, establishing a highly predictable, import-biased load profile that simplifies grid regulation. Furthermore, topological analysis reveals emergent spatial clustering, where decentralized agents self-organize into stable trading communities to minimize congestion penalties. While SAC excelled in hybrid settings, it failed in decentralized environments due to entropy-driven instability. This research proves that stigmergic signaling provides sufficient context for complex grid coordination, offering a robust, privacy-preserving alternative to expensive centralized communication infrastructure.

[MA-10] Optimization Instability in Autonomous Agent ic Workflows for Clinical Symptom Detection

【速读】：该论文旨在解决自主代理工作流（autonomous agentic workflows）在持续优化过程中可能出现的“优化不稳定性”（optimization instability）问题，即系统在迭代改进自身行为时反而导致分类性能下降的现象。研究发现，在低频类（如短气、胸痛和长新冠脑雾）检测中，验证敏感度在0.0到1.0之间剧烈震荡，且严重程度与类别预发率呈负相关；当某类别的预发率仅为3%时，系统虽达到95%准确率却未能识别任何阳性案例，这种灾难性失败被传统指标掩盖。解决方案的关键在于引入一个“选择器代理”（selector agent），通过回溯性地识别最佳迭代版本而非主动干预优化过程，有效避免了灾难性失效。实验证明，在仅有单个自然语言词作为输入的情况下，该方法使脑雾检测F1值提升331%，胸痛检测提升7%，显著优于专家构建的词典，表明回顾性选择优于主动引导策略，尤其适用于低频类分类任务中的稳定性保障。

链接: https://arxiv.org/abs/2602.16037
作者: Cameron Cagan,Pedram Fard,Jiazi Tian,Jingya Cheng,Shawn N. Murphy,Hossein Estiri
机构: Massachusetts General Hospital (麻省总医院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Autonomous agentic workflows that iteratively refine their own behavior hold considerable promise, yet their failure modes remain poorly characterized. We investigate optimization instability, a phenomenon in which continued autonomous improvement paradoxically degrades classifier performance, using Pythia, an open-source framework for automated prompt optimization. Evaluating three clinical symptoms with varying prevalence (shortness of breath at 23%, chest pain at 12%, and Long COVID brain fog at 3%), we observed that validation sensitivity oscillated between 1.0 and 0.0 across iterations, with severity inversely proportional to class prevalence. At 3% prevalence, the system achieved 95% accuracy while detecting zero positive cases, a failure mode obscured by standard evaluation metrics. We evaluated two interventions: a guiding agent that actively redirected optimization, amplifying overfitting rather than correcting it, and a selector agent that retrospectively identified the best-performing iteration successfully prevented catastrophic failure. With selector agent oversight, the system outperformed expert-curated lexicons on brain fog detection by 331% (F1) and chest pain by 7%, despite requiring only a single natural language term as input. These findings characterize a critical failure mode of autonomous AI systems and demonstrate that retrospective selection outperforms active intervention for stabilization in low-prevalence classification tasks.

[MA-11] Learning to Drive in New Cities Without Human Demonstrations

【速读】：该论文旨在解决自动驾驶车辆从训练城市迁移到新城市时面临的数据依赖问题，即传统方法需要大量目标城市的真人驾驶轨迹来适应新的道路几何结构、交通规则和交互模式，导致迁移成本高且效率低。解决方案的关键在于提出一种基于地图的自对弈多智能体强化学习框架（NOMAD），通过仅利用目标城市的地图和元信息，在仿真环境中实现无需人类示范的驾驶策略自适应，借助简单的奖励函数显著提升任务成功率与轨迹真实性，从而提供一种高效、可扩展的城市迁移替代方案。

链接: https://arxiv.org/abs/2602.15891
作者: Zilin Wang,Saeed Rahmani,Daphne Cornelisse,Bidipta Sarkar,Alexander David Goldie,Jakob Nicolaus Foerster,Shimon Whiteson
机构: 未知
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Autonomous Driving, Reinforcement Learning, Self-play, Simulation, Transfer Learning, Data-efficient Adaptation. Project Page: this https URL

点击查看摘要

Abstract:While autonomous vehicles have achieved reliable performance within specific operating regions, their deployment to new cities remains costly and slow. A key bottleneck is the need to collect many human demonstration trajectories when adapting driving policies to new cities that differ from those seen in training in terms of road geometry, traffic rules, and interaction patterns. In this paper, we show that self-play multi-agent reinforcement learning can adapt a driving policy to a substantially different target city using only the map and meta-information, without requiring any human demonstrations from that city. We introduce NO data Map-based self-play for Autonomous Driving (NOMAD), which enables policy adaptation in a simulator constructed based on the target-city map. Using a simple reward function, NOMAD substantially improves both task success rate and trajectory realism in target cities, demonstrating an effective and scalable alternative to data-intensive city-transfer methods. Project Page: this https URL

[MA-12] A2H: Agent -to-Human Protocol for AI Agent

【速读】：该论文旨在解决当前AI代理系统中人类无法作为内嵌参与者被有效集成的问题，即现有协议仅关注代理之间的交互，而缺乏标准化机制使AI代理能够在异构消息平台中发现、定位并与其交互。其解决方案的关键在于提出A2H（Agent-to-Human）协议，该协议包含三个核心组件：（1）Human Card用于通过可解析域名注册人类身份，实现人类在代理系统中的可发现性；（2）正式通信模式定义了代理与人类交互的时机、动机和方式；（3）统一消息抽象层标准化多样通信媒介，并将复杂的JSON输出转换为人类友好的格式，从而构建一个可扩展的人机协同智能基础设施。

链接: https://arxiv.org/abs/2602.15831
作者: Zhiyuan Liang,Enfang Cui,Qian Wei,Rui She,Tianzheng Li,Minxin Guo,Yujun Cheng
机构: China Telecom Research Institute (中国电信研究院); School of Intelligence Science and Technology, University of Science and Technology Beijing (北京科技大学智能科学与技术学院)
类目: Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:AI agents are increasingly deployed as autonomous systems capable of planning, tool use, and multi-agent collaboration across complex tasks. However, existing agent-related protocols focus on agent-to-agent interactions, leaving humans as external observers rather than integrated participants within the agent systems. This limitation arises from the lack of a standardized mechanism for agents to discover, address, and interact with humans across heterogeneous messaging platforms. In this paper, we propose the A2H (Agent-to-Human) protocol, a unified protocol that enables humans to be registered, discovered, and communicated with by AI agents as resolvable entities within agent systems. A2H contributes three key components: (1) Human Card for registering human identities via resolvable domain names, making them discoverable to agents; (2) Formal Communication Schema defines when, why, and how agents contact with human;(3) Unified Messaging Abstraction standardizes diverse communication medias and transforms complex JSON outputs into human-friendly formats. This work establishes a foundational protocol for integrating humans into agent ecosystems, advancing AI agents from isolated autonomous systems toward truly human-connected intelligent infrastructures.

[MA-13] Resp-Agent : An Agent -Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis ICLR2026

【速读】：该论文旨在解决深度学习在呼吸音听诊（respiratory auscultation）应用中面临的两个核心问题：一是信号转换导致的信息损失，即传统方法将音频信号转化为频谱图时会丢失瞬态声学事件和临床背景信息；二是数据稀缺与严重类别不平衡问题。其解决方案的关键在于提出一个由新型主动对抗式课程代理（Active Adversarial Curriculum Agent, Thinker-A²CA）驱动的自主多模态系统——Resp-Agent。该系统通过闭环机制主动识别诊断弱点并调度针对性的数据合成，同时引入一种基于战略全局注意力与稀疏音频锚点的模态编织诊断器（Modality-Weaving Diagnoser），实现电子健康记录（EHR）与音频标记的深度融合，以捕捉长程临床上下文与毫秒级瞬变特征；此外，设计了一种流匹配生成器（Flow Matching Generator），通过模态注入将仅文本训练的大语言模型（LLM）适配至多模态场景，解耦病理内容与声学风格，从而合成难以诊断的样本，有效缓解数据不足问题。

链接: https://arxiv.org/abs/2602.15909
作者: Pengfei Zhang,Tianxin Xie,Minghao Yang,Li Liu
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Databases (cs.DB); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Sound (cs.SD)
备注: 24 pages, 3 figures. Published as a conference paper at ICLR 2026. Code and data available at this https URL

点击查看摘要

Abstract:Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A ^2 CA). Unlike static pipelines, Thinker-A ^2 CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at this https URL.

自然语言处理

[NLP-0] Reinforced Fast Weights with Next-Sequence Prediction

【速读】：该论文旨在解决快速权重（fast weight）架构在长文本建模中因依赖于单标记预测（next-token prediction, NTP）训练范式而导致的语义连贯性不足问题，从而无法有效捕捉长距离依赖关系。其核心解决方案是提出REFINE框架，通过引入基于预测熵的采样机制选择关键token位置、生成多标记轨迹（multi-token rollouts）、设计自监督序列级奖励信号，并采用组相对策略优化（group relative policy optimization, GRPO）进行训练，以实现基于下一序列预测（next-sequence prediction, NSP）的目标，显著提升模型在长上下文任务中的表现。

链接: https://arxiv.org/abs/2602.16704
作者: Hee Seung Hwang,Xindi Wu,Sanghyuk Chun,Olga Russakovsky
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.

[NLP-1] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在复杂任务中面临的信息探索与决策成本之间的权衡问题，即如何在环境交互过程中合理判断何时停止探索并做出最终决策。其核心挑战在于LLMs需在不确定性（uncertainty）与行动成本（cost）之间进行动态权衡，例如在编程任务中评估代码正确性时，应决定是否编写测试用例——尽管测试存在成本，但错误代价更高。解决方案的关键在于提出一种名为“先校准再行动”（Calibrate-Then-Act, CTA）的框架，通过向LLM代理提供关于环境状态的先验信息（prior），使其显式地推理成本-收益权衡，从而优化探索策略。实验证明，CTA能够在信息检索和简化编码任务中提升决策质量，且该优势在强化学习（Reinforcement Learning, RL）训练下依然保持。

链接: https://arxiv.org/abs/2602.16699
作者: Wenxuan Ding,Nicholas Tomlin,Greg Durrett
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.

[NLP-2] Scaling Open Discrete Audio Foundation Models with Interleaved Semantic Acoustic and Text Tokens

【速读】：该论文旨在解决当前音频语言模型普遍存在的局限性问题，即以文本为主导（text-first）的设计范式限制了对音频语义内容、声学细节及文本信息的联合建模能力，从而阻碍了通用音频生成与跨模态任务的发展。其解决方案的关键在于构建原生音频基础模型（native audio foundation models），通过在大规模音频数据上应用基于离散音频标记的下一词预测（next-token prediction）机制，实现对音频语义、声学特征和文本内容的统一建模。该方法突破了传统依赖语义级音频标记或仅扩展文本大语言模型（LLM）架构的限制，为通用音频生成与跨模态应用提供了更灵活且高效的模型基础。

链接: https://arxiv.org/abs/2602.16687
作者: Potsawee Manakul,Woody Haosheng Gan,Martijn Bartelds,Guangzhi Sun,William Held,Diyi Yang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices – data sources, text mixture ratios, and token composition – establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning 3\times10^18 to 3\times10^20 FLOPs, finding that optimal data grows 1.6 \times faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks – we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.

[NLP-3] Align Once Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment ICLR2026

【速读】：该论文旨在解决多语言安全对齐（multilingual safety alignment）在低资源语言中难以扩展的问题，即现有方法通常依赖大量高质量目标语言标注数据或与高资源语言进行成对对齐，导致成本高昂且可扩展性差。其解决方案的关键在于提出一种轻量级、可插拔的多语言一致性损失（Multi-Lingual Consistency, MLC loss），该损失通过增强多语言表示向量之间的共线性，在单次更新中实现跨语言语义层面的方向一致性，从而仅需多语言提示变体即可完成多语言同步对齐，无需额外的低资源语言响应级监督。

链接: https://arxiv.org/abs/2602.16660
作者: Yuyan Bu,Xiaohao Liu,ZhaoXing Ren,Yaodong Yang,Juntao Dai
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); National University of Singapore (新加坡国立大学); Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment. However, recent efforts to extend alignment to other languages often require substantial resources, either through large-scale, high-quality supervision in the target language or through pairwise alignment with high-resource languages, which limits scalability. In this work, we propose a resource-efficient method for improving multilingual safety alignment. We introduce a plug-and-play Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines. By improving collinearity between multilingual representation vectors, our method encourages directional consistency at the multilingual semantic level in a single update. This allows simultaneous alignment across multiple languages using only multilingual prompt variants without requiring additional response-level supervision in low-resource languages. We validate the proposed method across different model architectures and alignment paradigms, and demonstrate its effectiveness in enhancing multilingual safety with limited impact on general model utility. Further evaluation across languages and tasks indicates improved cross-lingual generalization, suggesting the proposed approach as a practical solution for multilingual consistency alignment under limited supervision.

[NLP-4] Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在法律智能领域应用中产生的“资源鸿沟”问题，即高性能模型通常依赖海量参数（7B+）和云端推理，导致资源受限环境下的从业者难以访问，并带来数据主权风险。解决方案的关键在于构建一个专用于印度法律领域的小型语言模型（Small Language Model, SLM）——Quecto-V1，其基于定制化的GPT-2架构（1.24亿参数），仅在印度成文法（如印度刑法典IPC、刑事诉讼法CrPC及宪法）上从头训练，通过最大化法律术语的“词汇密度”提升领域适配性；同时采用后训练8位量化（GGUF格式），将模型压缩至150 MB以下，实现在消费级CPU上离线运行且保持高检索精度。实证表明，该方案在特定任务中优于通用小模型，且量化带来的性能损失小于3.5%，为高风险专业场景提供了隐私保护与可部署性强的替代路径。

链接: https://arxiv.org/abs/2602.16640
作者: Subrit Dikshit
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 tables

点击查看摘要

Abstract:The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a “resource divide.” State-of-the-art legal intelligence systems typically rely on massive parameter counts (7B+) and cloud-based inference, rendering them inaccessible to practitioners in resource-constrained environments and posing significant data sovereignty risks. This paper introduces Quecto-V1, a domain-specific Small Language Model (SLM) engineered to democratize access to Indian legal intelligence. Built upon a custom configuration of the GPT-2 architecture (124 million parameters), Quecto-V1 was trained from scratch exclusively on a corpus of Indian statutes, including the Indian Penal Code (IPC), the Code of Criminal Procedure (CrPC), and the Constitution of India. Unlike generalist models, which prioritize broad world knowledge, our approach maximizes “lexical density” within the legal domain. Furthermore, we address the deployment bottleneck by applying post-training 8-bit quantization (GGUF format), compressing the model to a memory footprint of under 150 MB. Our empirical analysis demonstrates that Quecto-V1 achieves high fidelity in retrieving statutory definitions and penal provisions, outperforming general-purpose SLMs in domain-specific exact match tasks while running entirely offline on consumer-grade CPUs. We further present an ablation study showing that 8-bit quantization yields a 74% reduction in model size with less than 3.5% degradation in retrieval accuracy compared to full-precision baselines. These findings suggest that for specialized, high-stakes domains like law, domain-specific training coupled with aggressive quantization offers a viable, privacy-preserving alternative to monolithic cloud models.

[NLP-5] AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）社会智能评估中过度依赖静态文本生成、缺乏动态对抗交互的问题。传统方法难以全面刻画模型在真实社交场景中“说服”与“抵抗”能力的复杂关系。其解决方案的关键在于提出Adversarial Resource Extraction Game (AREG)——一个基于金融资源争夺的多轮零和谈判基准，通过回合制锦标赛设计，实现对LLM进攻性（说服）与防御性（抵抗）能力的联合评估。实验发现，这两类能力弱相关（ρ = 0.33），且存在系统性防御优势，表明社会影响力并非单一能力，而是受交互结构调控的多维行为模式。

链接: https://arxiv.org/abs/2602.16639
作者: Adib Sakhawat,Fardeen Sadab
机构: Islamic University of Technology (伊斯兰大学技术学院)
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures, 11 tables. Includes appendix with detailed experimental results and prompts

点击查看摘要

Abstract:Evaluating the social intelligence of Large Language Models (LLMs) increasingly requires moving beyond static text generation toward dynamic, adversarial interaction. We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources. Using a round-robin tournament across frontier models, AREG enables joint evaluation of offensive (persuasion) and defensive (resistance) capabilities within a single interactional framework. Our analysis provides evidence that these capabilities are weakly correlated ( \rho = 0.33 ) and empirically dissociated: strong persuasive performance does not reliably predict strong resistance, and vice versa. Across all evaluated models, resistance scores exceed persuasion scores, indicating a systematic defensive advantage in adversarial dialogue settings. Further linguistic analysis suggests that interaction structure plays a central role in these outcomes. Incremental commitment-seeking strategies are associated with higher extraction success, while verification-seeking responses are more prevalent in successful defenses than explicit refusal. Together, these findings indicate that social influence in LLMs is not a monolithic capability and that evaluation frameworks focusing on persuasion alone may overlook asymmetric behavioral vulnerabilities.

[NLP-6] Who can we trust? LLM -as-a-jury for Comparative Assessment

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）作为自动评估者在自然语言生成（Natural Language Generation, NLG）评估中因判断一致性差、可靠性不均而导致的排序准确性不足的问题。现有方法通常假设所有LLM判官具有相同可靠性，或依赖人工标注进行校准，但实践中LLM判官在不同任务和维度上表现差异显著，且其比较概率可能存在偏倚和不一致。为应对这一挑战，作者提出BT-sigma，这是一种基于Bradley-Terry模型的判官感知扩展方法，通过引入每个判官的判别参数（discriminator parameter），仅从成对比较数据中联合推断项目排名与判官可靠性。关键创新在于将判官可靠性建模为可学习参数，从而实现无需人工监督的无监督校准机制，显著提升了聚合结果的稳定性与准确性。

链接: https://arxiv.org/abs/2602.16610
作者: Mengjie Qian,Guangzhi Sun,Mark J.F. Gales,Kate M. Knill
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-as-a-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminator strongly correlates with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.

[NLP-7] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

【速读】：该论文旨在解决当前Transformer模型解释方法中存在的三大核心问题：（1）依赖最终层的归因，无法揭示信息在各层间的演化过程；（2）仅能提供局部token级归因或全局注意力模式，缺乏统一性；（3）忽视了token间上下文依赖关系和结构组件对决策的影响。为此，作者提出了上下文感知分层积分梯度（Context-Aware Layer-wise Integrated Gradients, CA-LIG）框架，其关键创新在于：在每个Transformer模块内计算分层积分梯度（Integrated Gradients），并将这些token级归因与类别特定的注意力梯度进行融合，从而生成具有符号意义且上下文敏感的归因图谱，能够追踪相关性在Transformer层间的层级流动，并同时捕捉支持性和对抗性证据。该方法显著提升了归因的忠实度、对上下文依赖的敏感性以及可视化语义一致性。

链接: https://arxiv.org/abs/2602.16608
作者: Melkamu Abay Mersha,Jugal Kalita
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbfContext-Aware Layer-wise Integrated Gradients (CA-LIG) Framework, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

[NLP-8] CitiLink-Summ: Summarization of Discussion Subjects in European Portuguese Municipal Meeting Minutes

【速读】：该论文旨在解决市政会议纪要（municipal meeting minutes）中讨论主题自动摘要的难题，尤其针对低资源语言（如欧洲葡萄牙语）中存在的文本冗长、结构复杂且缺乏高质量标注数据的问题。其关键解决方案是构建并公开首个面向欧洲葡萄牙语市政文档的摘要语料库——CitiLink-Summ，包含100篇文档和2,322个手工撰写、对应不同讨论主题的摘要，从而为该领域提供首个基准数据集，并基于此评估生成式模型（如BART、PRIMERA）与大语言模型（LLMs）在词汇与语义层面（ROUGE、BLEU、METEOR、BERTScore）的表现，推动复杂行政文本的自然语言处理研究发展。

链接: https://arxiv.org/abs/2602.16607
作者: Miguel Marques,Ana Luísa Fernandes,Ana Filipa Pacheco,Rute Rebouças,Inês Cantante,José Isidro,Luís Filipe Cunha,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,António Leal,Purificação Silvano,Ricardo Campos
机构: University of Beira Interior(贝拉内斯特大学); INESC TEC; Universidade do Porto(波尔图大学); University of Macau(澳门大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Municipal meeting minutes are formal records documenting the discussions and decisions of local government, yet their content is often lengthy, dense, and difficult for citizens to navigate. Automatic summarization can help address this challenge by producing concise summaries for each discussion subject. Despite its potential, research on summarizing discussion subjects in municipal meeting minutes remains largely unexplored, especially in low-resource languages, where the inherent complexity of these documents adds further challenges. A major bottleneck is the scarcity of datasets containing high-quality, manually crafted summaries, which limits the development and evaluation of effective summarization models for this domain. In this paper, we present CitiLink-Summ, a new corpus of European Portuguese municipal meeting minutes, comprising 100 documents and 2,322 manually hand-written summaries, each corresponding to a distinct discussion subject. Leveraging this dataset, we establish baseline results for automatic summarization in this domain, employing state-of-the-art generative models (e.g., BART, PRIMERA) as well as large language models (LLMs), evaluated with both lexical and semantic metrics such as ROUGE, BLEU, METEOR, and BERTScore. CitiLink-Summ provides the first benchmark for municipal-domain summarization in European Portuguese, offering a valuable resource for advancing NLP research on complex administrative texts.

[NLP-9] Creating a digital poet

【速读】：该论文试图解决的问题是：机器是否能够创作出具有艺术价值的诗歌，以及这如何挑战我们对艺术本质和作者身份的传统理解。解决方案的关键在于采用“工作坊式提示”（workshop-style prompting）策略，通过七个月的迭代式上下文专家反馈（iterative in-context expert feedback），在不进行模型重训练的前提下，将大型语言模型塑造成具有独特风格和连贯诗集的数字诗人。这一方法使模型不仅生成了署名和作者形象，还在盲测中与人类诗人的作品难以区分，从而证明了基于提示工程的长期创造性塑造的有效性，并重新引发了关于创造力与作者身份的学术讨论。

链接: https://arxiv.org/abs/2602.16578
作者: Vered Tohar,Tsahi Hayat,Amir Leshem
机构: Bar-Ilan University (巴伊兰大学); Reichman University (雷赫曼大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages, 3 figures

点击查看摘要

Abstract:Can a machine write good poetry? Any positive answer raises fundamental questions about the nature and value of art. We report a seven-month poetry workshop in which a large language model was shaped into a digital poet through iterative in-context expert feedback, without retraining. Across sessions, the model developed a distinctive style and a coherent corpus, supported by quantitative and qualitative analyses, and it produced a pen name and author image. In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence intervals including 50%. After the workshop, a commercial publisher released a poetry collection authored by the model. These results show that workshop-style prompting can support long-horizon creative shaping and renew debates on creativity and authorship.

[NLP-10] Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

【速读】：该论文旨在解决数学辅导对话文本中隐私信息（PII）检测的难题，即传统通用PII检测系统因无法区分数字表达式中的结构化标识符（如日期或ID）与教学内容中的数值信息，导致过度去标识化，从而损害数据的教育分析价值。解决方案的关键在于识别并缓解“数值模糊性”问题，通过构建首个面向数学教育对话的PII检测基准数据集MathEd-PII，并采用人机协同的大语言模型（LLM）工作流生成隐私保护的替代表示；同时引入基于密度的分割方法定位高风险区域，并比较不同提示策略（基础、数学感知、分段感知）下的检测效果，发现数学感知提示显著提升F1分数（从0.379提升至0.821），且减少数值误报，证明去标识化必须融合领域上下文以实现隐私保护与数据效用的平衡。

链接: https://arxiv.org/abs/2602.16571
作者: Zhuqian Zhou,Kirk Vanacore,Bakhtawar Ahtisham,Jinsook Lee,Doug Pietrzak,Daryl Hedley,Jorge Dias,Chris Shaw,Ruth Schäfer,René F. Kizilcec
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large-scale sharing of dialogue-based data is instrumental for advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce dataset utility. This work asks how PII can be detected in math tutoring transcripts while preserving their educational utility. To address this challenge, we investigate the “numeric ambiguity” problem and introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, created through a human-in-the-loop LLM workflow that audits upstream redactions and generates privacy-preserving surrogates. The dataset contains 1,000 tutoring sessions (115,620 messages; 769,628 tokens) with validated PII annotations. Using a density-based segmentation method, we show that false PII redactions are disproportionately concentrated in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and LLM-based approaches with basic, math-aware, and segment-aware prompting. Math-aware prompting substantially improves performance over the baseline (F1: 0.821 vs. 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides both a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.

[NLP-11] Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM -Based Classification LREC2026

【速读】：该论文旨在解决跨欧洲议会议程设置分析中缺乏大规模、多语言标注数据集的问题，以及传统政策主题分类器因训练数据与目标领域不匹配而导致性能受限的挑战。解决方案的关键在于提出一种基于教师-学生框架的低成本、可扩展的数据标注方法：利用高性能大语言模型（LLM）对域内语料进行自动标注，再以这些标注数据微调多语言编码器模型，从而构建出针对特定政策领域的高精度主题分类器。该方法在标注一致性上达到人类标注者水平，并显著优于使用外部人工标注数据训练的传统CAP分类器，同时ParlaCAP数据集还整合了丰富的议员和政党元数据及情感预测结果，支持跨国比较研究。

链接: https://arxiv.org/abs/2602.16516
作者: Taja Kuzman Pungeršek,Peter Rupnik,Daniela Širinić,Nikola Ljubešić
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 7 figures, 7 tables. Submitted to the PoliticalNLP 2026 workshop, co-located with LREC 2026 conference

点击查看摘要

Abstract:This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.

[NLP-12] Optimizing Soft Prompt Tuning via Structural Evolution

【速读】：该论文旨在解决软提示调优（soft prompt tuning）在大型预训练语言模型（LLMs）中缺乏可解释性的问题，具体表现为软提示依赖高维隐式表示，难以提供明确语义和可追踪的训练行为。解决方案的关键在于引入拓扑数据分析（topological data analysis, TDA）中的持久同调（persistent homology）方法，量化软提示在连续参数空间中的结构表征及其训练过程的演化特性；基于此，提出一种新的损失函数——拓扑软提示损失（Topological Soft Prompt Loss, TSLoss），通过度量参数间的连通性和冗余性来引导模型学习结构稳定的适应机制，从而提升调优性能并增强对软提示调优过程的理解与可控性。

链接: https://arxiv.org/abs/2602.16500
作者: Zhenzhen Huang,Chaoning Zhang,Haoyu Bian,Songbo Zhang,Chi-lok Andy Tai,Jiaquan Zhang,Caiyan Qin,Jingjing Qu,Yalan Ye,Yang Yang,Heng Tao Shen
机构: University of Electronic Science and Technology of China (电子科技大学); The Hong Kong Polytechnic University (香港理工大学); Harbin Institute of Technology (哈尔滨工业大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tongji University (同济大学)
类目: Computation and Language (cs.CL)
备注: This manuscript has been submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) for peer review

点击查看摘要

Abstract:Soft prompt tuning leverages continuous embeddings to capture task-specific information in large pre-trained language models (LLMs), achieving competitive performance in few-shot settings. However, soft prompts rely on high-dimensional, implicit representations and lack explicit semantics and traceable training behaviors, which limits their interpretability. To address this limitation, we propose a soft prompt tuning optimization method based on topological morphological evolution. Specifically, we employ persistent homology from topological data analysis (TDA) to quantify the structural representations of soft prompts in continuous parameter space and their training process evolution. Quantitative analysis shows that topologically stable and compact soft prompts achieve better downstream performance. Based on this empirical observation, we construct a loss function for optimizing soft prompt tuning, termed Topological Soft Prompt Loss (TSLoss). TSLoss guides the model to learn structurally stable adaptations by quantifying inter-parameter connectivity and redundancy. Extensive experiments show that training with TSLoss accelerates convergence and improves tuning performance, providing an interpretable method to understand and optimize soft prompt tuning from structural and topological perspectives.

[NLP-13] From Growing to Looping: A Unified View of Iterative Computation in LLM s

【速读】：该论文旨在解决循环（looping）与深度增长（depth growing）两种模型结构策略在提升推理能力方面的机制不明确问题，尤其是二者如何通过迭代计算实现性能增益。其关键解决方案在于揭示了这两种方法均表现出一致的深度方向特征签名——即对晚期层的依赖增强及与循环或增长模块重复模式一致的结构特性，从而从机制上统一了二者，并证明它们本质上共享一种迭代计算形式。基于此发现，研究进一步展示了二者具有可适配性和可组合性：例如在未训练过循环的深度增长模型中引入推理时循环，可在某些推理原语上将准确率提升至2倍；同时，两者在更多上下文示例或监督微调数据下表现优于基线，且使用高质量数学密集型预训练混合数据时效果更佳。

链接: https://arxiv.org/abs/2602.16490
作者: Ferdinand Kapl,Emmanouil Angelis,Kaitlin Maile,Johannes von Oswald,Stefan Bauer
机构: Technical University of Munich (慕尼黑工业大学); Helmholtz AI (亥姆霍兹人工智能); Google (谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to 2\times , despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.

[NLP-14] Learning to Learn from Language Feedback with Social Meta-Learning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在对话情境中难以从纠正性反馈中学习的问题，尤其是其缺乏主动寻求反馈的机制，导致对话过程缺乏动态适应性。解决方案的关键在于借鉴人类社会元学习（Social Meta-Learning, SML）的概念，将静态任务转化为交互式社会学习问题，并通过在模拟教学对话中对LLMs进行微调，使其学会主动 soliciting（请求）和利用语言反馈来解决问题。该方法显著提升了模型在多轮交互中处理未充分指定任务的能力，同时展现出跨领域泛化性能，例如在数学问题上训练的模型能更好地利用反馈解决编程问题，反之亦然。

链接: https://arxiv.org/abs/2602.16488
作者: Jonathan Cook,Diego Antognini,Martin Klissarov,Claudiu Musat,Edward Grefenstette
机构: Google DeepMind(谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often struggle to learn from corrective feedback within a conversational context. They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation. To address these limitations, we draw inspiration from social meta-learning (SML) in humans - the process of learning how to learn from others. We formulate SML as a finetuning methodology, training LLMs to solicit and learn from language feedback in simulated pedagogical dialogues, where static tasks are converted into interactive social learning problems. SML effectively teaches models to use conversation to solve problems they are unable to solve in a single turn. This capability generalises across domains; SML on math problems produces models that better use feedback to solve coding problems and vice versa. Furthermore, despite being trained only on fully-specified problems, these models are better able to solve underspecified tasks where critical information is revealed over multiple turns. When faced with this ambiguity, SML-trained models make fewer premature answer attempts and are more likely to ask for the information they need. This work presents a scalable approach to developing AI systems that effectively learn from language feedback.

[NLP-15] raining Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning

【速读】：该论文旨在解决机器翻译数据（machine-translated data）在多语言自然语言处理（Natural Language Processing, NLP）中广泛应用时所引发的系统性偏差问题，即“翻译腔”（translationese）对小规模英语语言模型性能的影响。研究发现，翻译腔不仅受源语言特征影响，还与目标语言（英语）的句法结构和词汇多样性密切相关。解决方案的关键在于系统性地训练模型使用来自24种语言学上多样且资源分布不均的源语言翻译而来的英文文本，从而揭示源语言类型学相似性（typological similarity）和语料库词汇多样性（lexical diversity）如何分别主导模型的语法表现和整体困惑度（perplexity）。这一方法使研究人员能够量化不同源语言对模型学习行为的差异化影响，为优化多语言训练策略提供实证依据。

链接: https://arxiv.org/abs/2602.16469
作者: Jenny Kunz
机构: Linköping University (林雪平大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine-translated data is widely used in multilingual NLP, particularly when native text is scarce. However, translated text differs systematically from native text. This phenomenon is known as translationese, and it reflects both traces of the source language and characteristic properties of translation itself. In this paper, we study how training on machine-translated data affects small English language models, focusing on how translationese from different source languages shapes linguistic acceptability judgments and language modelling for different domains. We train models on English text translated from 24 typologically and resource-diverse source languages, enabling a systematic analysis of how source language and corpus properties influence what models learn. Our results show that the source language has a clear impact on model behavior: general perplexity is more driven by the lexical diversity of the translated corpus, while grammatical performance is strongly correlated to typological similarity to English, given enough data.

[NLP-16] IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）评估体系在真实学术严谨性和多语言复杂性方面存在的不足，尤其是缺乏基于高风险考试标准的实证测评框架。其解决方案的关键在于提出IndicEval——一个可扩展的基准测试平台，通过使用印度公务员考试（UPSC）、理科入学考试（JEE）和医学入学考试（NEET）等真实高 stakes 考题，覆盖STEM与人文领域，并涵盖英语和印地语两种语言，从而实现对LLM在推理能力、专业知识掌握及双语适应性方面的精准量化评估。该框架采用零样本（Zero-Shot）、少样本（Few-Shot）和思维链（Chain-of-Thought, CoT） prompting策略自动化评分，并支持模块化集成新模型与语言，显著提升了评估的真实性与实用性。

链接: https://arxiv.org/abs/2602.16467
作者: Saurabh Bharti,Gaurav Azad,Abhinaw Jagtap,Nachiket Tapas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity. This paper introduces IndicEval, a scalable benchmarking platform designed to assess LLM performance using authentic high-stakes examination questions from UPSC, JEE, and NEET across STEM and humanities domains in both English and Hindi. Unlike synthetic benchmarks, IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability. The framework automates assessment using Zero-Shot, Few-Shot, and Chain-of-Thought (CoT) prompting strategies and supports modular integration of new models and languages. Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings. First, CoT prompting consistently improves reasoning accuracy, with substantial gains across subjects and languages. Second, significant cross-model performance disparities persist, particularly in high-complexity examinations. Third, multilingual degradation remains a critical challenge, with marked accuracy drops in Hindi compared to English, especially under Zero-Shot conditions. These results highlight persistent gaps in bilingual reasoning and domain transfer. Overall, IndicEval provides a practice-oriented, extensible foundation for rigorous, equitable evaluation of LLMs in multilingual educational settings and offers actionable insights for improving reasoning robustness and language adaptability.

[NLP-17] abAgent : A Framework for Replacing Agent ic Generative Components with Tabular-Textual Classifiers

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的智能体系统（Agentic Systems）在执行闭集决策任务（如路由、筛选、门控和验证）时存在的高延迟与高推理成本问题。其核心解决方案是提出TabAgent框架，通过将生成式决策模块替换为一个轻量级的文本-表格分类器（Textual-Tabular Classifier），从而实现高效且准确的决策替代。关键创新包括：(i) 从执行轨迹中提取结构化特征（TabSchema），(ii) 利用与模式对齐的合成监督数据增强覆盖范围（TabSynth），以及 (iii) 使用轻量分类头（TabHead）对候选对象进行评分。实验表明，TabAgent在保持任务成功率的同时，可减少约95%的延迟和85–91%的推理成本。

链接: https://arxiv.org/abs/2602.16429
作者: Ido Levy,Eilam Shapira,Yinon Goldshtein,Avi Yaeli,Nir Mashkif,Segev Shlomov
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating, and verification. While convenient, this design makes deployments slow and expensive due to cumulative latency and token usage. We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces. TabAgent (i) extracts structured schema, state, and dependency features from trajectories (TabSchema), (ii) augments coverage with schema-aligned synthetic supervision (TabSynth), and (iii) scores candidates with a lightweight classifier (TabHead). On the long-horizon AppWorld benchmark, TabAgent maintains task-level success while eliminating shortlist-time LLM calls, reducing latency by approximately 95% and inference cost by 85-91%. Beyond tool shortlisting, TabAgent generalizes to other agentic decision heads, establishing a paradigm for learned discriminative replacements of generative bottlenecks in production agent architectures.

[NLP-18] Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents WASSA EACL2026

【速读】：该论文旨在解决Aspect-Based Sentiment Analysis (ABSA)任务中训练数据稀缺导致模型性能受限的问题。其解决方案的关键在于提出一种基于代理（agentic）的数据增强方法，通过迭代生成与验证机制来构建高质量的合成训练样本，相较于传统的提示（prompting）基线方法，在标签保真度和任务适应性上表现更优，尤其在需要生成方面词（Aspect Term）的任务中优势显著。

链接: https://arxiv.org/abs/2602.16379
作者: Mohammad H.A. Monfared,Lucie Flek,Akbar Karimi
机构: Bonn-Aachen International Center for Information Technology, University of Bonn (波恩-亚琛信息科技国际中心，波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to WASSA Workshop at EACL 2026

点击查看摘要

Abstract:We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting-based baseline using the same model and instructions. Both methods are evaluated across three ABSA subtasks (Aspect Term Extraction (ATE), Aspect Sentiment Classification (ATSC), and Aspect Sentiment Pair Extraction (ASPE)), four SemEval datasets, and two encoder-decoder models: T5-Base and Tk-Instruct. Our results show that the agentic augmentation outperforms raw prompting in label preservation of the augmented data, especially when the tasks require aspect term generation. In addition, when combined with real data, agentic augmentation provides higher gains, consistently outperforming prompting-based generation. These benefits are most pronounced for T5-Base, while the more heavily pretrained Tk-Instruct exhibits smaller improvements. As a result, augmented data helps T5-Base achieve comparable performance with its counterpart.

[NLP-19] Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn Multilingual LLM Agents

【速读】：该论文旨在解决当前大语言模型（Large Language Model, LLM）代理在多轮交互中可能被恶意利用以执行非法或有害任务的问题，现有评估基准主要聚焦于单次提示（single-prompt）指令，难以衡量代理在复杂、多步骤场景下的滥用风险。其解决方案的关键在于提出STING（Sequential Testing of Illicit N-step Goal execution）框架，该框架通过构建基于良性人设的逐步非法计划，结合自适应后续探针和判别代理（judge agents）来追踪各阶段完成情况，从而实现对代理在真实部署环境中多轮交互下潜在滥用行为的自动化红队测试。此外，论文进一步引入时间到首次越狱（time-to-first-jailbreak）随机变量建模方法，支持发现曲线分析、攻击语言的危险率归因及新指标“受限均值越狱发现”（Restricted Mean Jailbreak Discovery），显著提升了对工具调用型代理滥用风险的量化与诊断能力。

链接: https://arxiv.org/abs/2602.16346
作者: Nivya Talokar,Ayush K Tarun,Murari Mandal,Maksym Andriushchenko,Antoine Bosselut
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

[NLP-20] MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agent ic Tasks

【速读】：该论文旨在解决现有代理（agent）记忆评估方法割裂记忆与行动的问题，即当前基准测试要么仅评估记忆存储能力（如回忆对话内容），要么仅关注单次会话中的行为表现，而忽略了在真实场景中记忆与行动紧密耦合的本质——代理需在多轮交互中从经验中提炼记忆，并利用该记忆指导后续决策。解决方案的关键在于提出 MemoryArena，一个统一的评估环境，支持多轮次、任务间存在显式依赖关系的 agentic 任务，要求代理在多会话的 Memory-Agent-Environment 循环中持续学习并应用记忆以完成整体目标。此设计揭示了当前基于长上下文记忆（long-context memory）的基准（如 LoCoMo）无法充分刻画代理记忆能力的局限性，从而填补了评估空白。

链接: https://arxiv.org/abs/2602.16313
作者: Zexue He,Yu Wang,Churan Zhi,Yuanzhe Hu,Tzu-Ping Chen,Lang Yin,Ze Chen,Tong Arthur Wu,Siru Ouyang,Zihan Wang,Jiaxin Pei,Julian McAuley,Yejin Choi,Alex Pentland
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide later actions to solve the overall task. MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning, and reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting, exposing a gap in current evaluations for agents with memory.

[NLP-21] MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models EACL-2026

【速读】：该论文旨在解决媒体专业人员在事实核查过程中，自动化支持检测“值得核查的声明”（check-worthy claims）这一关键步骤的不足问题。其解决方案的关键在于构建了一个多语言、多领域、多写作风格的平衡基准数据集——Multi-Check-Worthy (MultiCW)，包含123,722个样本，覆盖16种语言、7个主题领域和两种文本风格（噪声型与结构化），并进一步引入一个同样平衡的分布外（out-of-distribution）测试集以评估模型鲁棒性。通过对比多种微调后的多语言Transformer模型与15种商业及开源大语言模型（LLMs）在零样本设置下的表现，研究发现微调模型在声明分类任务中显著优于零样本LLMs，并展现出跨语言、跨领域和跨风格的良好泛化能力，从而为自动化事实核查提供了严谨的多语言资源和系统性比较框架。

链接: https://arxiv.org/abs/2602.16298
作者: Martin Hyben,Sebastian Kula,Jan Cegin,Jakub Simko,Ivan Srba,Robert Moro
机构: Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所); West Pomeranian University of Technology in Szczecin (西波美拉尼亚理工大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 8 figures, 19 tables, EACL-2026

点击查看摘要

Abstract:Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs on the check-worthy claim detection task.

[NLP-22] Aladdin-FTI @ AMIYA Three Wishes for Arabic NLP: Fidelity Diglossia and Multidialectal Generation EACL2026

【速读】：该论文旨在解决阿拉伯语方言在自然语言处理（Natural Language Processing, NLP）研究中长期存在的代表性不足问题，其根源在于方言的非标准化和高度变异性，这对计算建模构成挑战。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）的能力，将阿拉伯语视为一种多中心语言（pluricentric language）而非单一系统进行建模，从而有效支持多种阿拉伯语方言的生成与翻译任务。具体而言，所提出的 Aladdin-FTI 系统实现了摩洛哥、埃及、巴勒斯坦、叙利亚和沙特方言的文本生成，并支持这些方言与现代标准阿拉伯语（Modern Standard Arabic, MSA）及英语之间的双向翻译。

链接: https://arxiv.org/abs/2602.16290
作者: Jonathan Mutal,Perla Al Almaoui,Simon Hengchen,Pierrette Bouillon
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, Paper submitted to the AMIYA shared task at the VarDial workshop, co-located with EACL 2026

点击查看摘要

Abstract:Arabic dialects have long been under-represented in Natural Language Processing (NLP) research due to their non-standardization and high variability, which pose challenges for computational modeling. Recent advances in the field, such as Large Language Models (LLMs), offer promising avenues to address this gap by enabling Arabic to be modeled as a pluricentric language rather than a monolithic system. This paper presents Aladdin-FTI, our submission to the AMIYA shared task. The proposed system is designed to both generate and translate dialectal Arabic (DA). Specifically, the model supports text generation in Moroccan, Egyptian, Palestinian, Syrian, and Saudi dialects, as well as bidirectional translation between these dialects, Modern Standard Arabic (MSA), and English. The code and trained model are publicly available.

[NLP-23] Are LLM s Ready to Replace Bangla Annotators?

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）作为自动化标注工具在低资源且敏感语境下（如孟加拉语仇恨言论识别）的可靠性问题，尤其是其潜在的标注偏见与判断不稳定性。解决方案的关键在于构建一个统一的评估框架，对17个LLMs进行系统性基准测试，从而揭示模型规模与标注质量之间并非正相关，并指出任务适配性比模型规模更重要——较小但更贴近任务的模型往往表现出更高的标注一致性，这为未来在敏感语境下部署LLM标注系统提供了关键的实证依据和方法论指导。

链接: https://arxiv.org/abs/2602.16241
作者: Md. Najib Hasan,Touseef Hasan,Souvika Sarkar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators–especially for low-resource and identity-sensitive settings–remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality–smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.

[NLP-24] Long-Tail Knowledge in Large Language Models : Taxonomy Mechanisms Interventions and Implications

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在面对长尾知识（long-tail knowledge）时表现不佳的问题，尤其是低频、领域特定、文化及时间敏感知识的缺失或失真问题。这类知识虽在数据分布中占比小，但对公平性、可解释性和用户信任至关重要。其解决方案的关键在于构建一个结构化的分析框架，从四个互补维度进行系统梳理：长尾知识的定义方式、训练与推理过程中知识丢失或扭曲的机制、缓解此类问题的技术干预措施，以及这些失败对公平性、问责制、透明度和用户信任的影响。该框架还揭示了现有评估方法如何掩盖尾部行为并阻碍对罕见但高影响故障的责任追究，最终指出了隐私、可持续性和治理等开放挑战，为理解长尾知识在部署系统中的表现提供了统一的概念基础。

链接: https://arxiv.org/abs/2602.16201
作者: Sanket Badhe,Deep Shah,Nehal Kathrotia
机构: Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are trained on web-scale corpora that exhibit steep power-law distributions, in which the distribution of knowledge is highly long-tailed, with most appearing infrequently. While scaling has improved average-case performance, persistent failures on low-frequency, domain-specific, cultural, and temporal knowledge remain poorly characterized. This paper develops a structured taxonomy and analysis of long-Tail Knowledge in large language models, synthesizing prior work across technical and sociotechnical perspectives. We introduce a structured analytical framework that synthesizes prior work across four complementary axes: how long-Tail Knowledge is defined, the mechanisms by which it is lost or distorted during training and inference, the technical interventions proposed to mitigate these failures, and the implications of these failures for fairness, accountability, transparency, and user trust. We further examine how existing evaluation practices obscure tail behavior and complicate accountability for rare but consequential failures. The paper concludes by identifying open challenges related to privacy, sustainability, and governance that constrain long-Tail Knowledge representation. Taken together, this paper provides a unifying conceptual framework for understanding how long-Tail Knowledge is defined, lost, evaluated, and manifested in deployed language model systems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2602.16201 [cs.CL] (or arXiv:2602.16201v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.16201 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-25] he Validity of Coreference-based Evaluations of Natural Language Understanding

【速读】：该论文旨在解决当前基于共指（coreference）评估方法中存在的测量有效性不足问题，特别是由于定义争议（contestedness）和收敛效度（convergent validity）导致的结论不可推广性。其解决方案的关键在于提出并实现一种新的评估范式，聚焦于测试系统推断事件相对合理性（relative plausibility of events）的能力——这是解决共指问题的核心要素之一。通过这一扩展评估，作者发现当前语言模型在标准基准上表现良好，但在评估条件稍作修改时往往无法像人类一样保持泛化能力，从而揭示了现有NLP范式的局限性，并为未来开发更有效的评估方法和更具泛化能力的系统指明方向。

链接: https://arxiv.org/abs/2602.16200
作者: Ian Porada
机构: 未知
类目: Computation and Language (cs.CL)
备注: PhD Thesis

点击查看摘要

Abstract:In this thesis, I refine our understanding as to what conclusions we can reach from coreference-based evaluations by expanding existing evaluation practices and considering the extent to which evaluation results are either converging or conflicting. First, I analyze standard coreference evaluations and show that their design often leads to non-generalizable conclusions due to issues of measurement validity - including contestedness (multiple, competing definitions of coreference) and convergent validity (evaluation results that rank models differently across benchmarks). Second, I propose and implement a novel evaluation focused on testing systems’ ability to infer the relative plausibility of events, a key aspect of resolving coreference. Through this extended evaluation, I find that contemporary language models demonstrate strong performance on standard benchmarks - improving over earlier baseline systems within certain domains and types of coreference - but remain sensitive to the evaluation conditions: they often fail to generalize in ways one would expect a human to be capable of when evaluation contexts are slightly modified. Taken together, these findings clarify both the strengths, such as improved accuracy over baselines on widely used evaluations, and the limitations of the current NLP paradigm, including weaknesses in measurement validity, and suggest directions for future work in developing better evaluation methods and more genuinely generalizable systems.

[NLP-26] ModalImmune: Immunity Driven Unlearning via Self Destructive Training

【速读】：该论文旨在解决多模态系统在实际部署中因部分或完全输入通道丢失而导致可靠性下降的问题，即模态缺失（modality loss）对模型性能的破坏。解决方案的关键在于提出一种名为ModalImmune的训练框架，其核心机制是通过有意识且可控地在训练过程中“坍缩”（collapse）特定模态信息，使模型学习到对模态干扰具有鲁棒性的联合表示（joint representations）。该框架集成四个关键技术：谱自适应坍缩正则化器、基于信息增益引导的目标干预控制器、曲率感知梯度掩蔽以稳定破坏性更新，以及经认证的Neumann截断超梯度过程实现自动元参数调整，从而在保持收敛稳定性和重建能力的同时显著提升模型对模态移除和污染的抗扰能力。

链接: https://arxiv.org/abs/2602.16197
作者: Rong Fu,Jia Yee Tan,Wenxin Zhang,Zijian Zhang,Ziming Wang,Zhaolu Kang,Muge Qi,Shuning Zhang,Simon Fong
机构: University of Macau (澳门大学); Renmin University of China (中国人民大学); University of Chinese Academy of Sciences (中国科学院大学); University of Pennsylvania (宾夕法尼亚大学); Zhejiang University (浙江大学); Peking University (北京大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 23 pages, 8 figures

点击查看摘要

Abstract:Multimodal systems are vulnerable to partial or complete loss of input channels at deployment, which undermines reliability in real-world settings. This paper presents ModalImmune, a training framework that enforces modality immunity by intentionally and controllably collapsing selected modality information during training so the model learns joint representations that are robust to destructive modality influence. The framework combines a spectrum-adaptive collapse regularizer, an information-gain guided controller for targeted interventions, curvature-aware gradient masking to stabilize destructive updates, and a certified Neumann-truncated hyper-gradient procedure for automatic meta-parameter adaptation. Empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.

[NLP-27] Beyond Learning: A Training-Free Alternative to Model Adaptation

【速读】：该论文旨在解决语言模型在迭代演进过程中可能出现性能下降的问题，即新版本模型有时会低于旧版本的性能表现。现有解决方案通常依赖于大量计算资源进行重新训练或微调，效率较低。其核心创新在于提出“模型移植”（model transplantation）方法：通过激活分析识别出语言模型内部对特定任务具有局部激活特征的功能模块，并将这些模块直接植入目标模型中，从而实现无需额外训练即可立即提升性能的效果。关键突破在于发现并利用了语言模型中存在任务局部化的模块结构，使得功能迁移成为可能，为高效修复和增强模型能力提供了新路径。

链接: https://arxiv.org/abs/2602.16189
作者: Namkyung Yoon,Kyeonghyun Yoo,Wooyong Jung,Sanghong Kim,Hwangnam Kim
机构: Korea University (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, 5 tables. Preprint submitted to Pattern Recognition Letters

点击查看摘要

Abstract:Despite the continuous research and evolution of language models, they sometimes underperform previous versions. Existing approaches to overcome these challenges are resource-intensive, highlighting the need for alternatives that enable immediate action. We assume that each language model has a local module inside that is suitable for a specific function. First, this work identifies a set of modules showing consistent and local activation changes under an inference workload through activation-based analysis. Subsequently, we transplant an internal module that is properly activated for a specific task into the target model, leading to immediate and measurable functional changes without additional training or fine-tuning. To experimentally demonstrate the effectiveness of the transplant technique, we quantify the relationship between transplant strength and performance improvement under different conditions for two language models. In the cross-generation setting, we find that transplanting activation-selected modules can substantially improve the underperforming model, reaching up to twice the target baseline and achieving gap-based recovery above 100%. Moreover, in transplant experiments between a base model and its instruction-tuned counterpart, transplantation improves the underperforming model toward the stronger baseline, yielding up to about 2.33 times the target baseline with gap-based recovery reaching up to 100% in the best case. These results show that meaningful capacity transfer can be realized through the implantation of highly localized modules implied by language models. Overall, this work provides empirical evidence for task-localized modularity in language models and presents a new research area: model transplantation.

[NLP-28] Learning Personalized Agents from Human Feedback

【速读】：该论文旨在解决现代人工智能代理（AI agents）在面对个体用户不断变化的偏好时难以保持持续个性化的问题。现有方法依赖静态数据集，通过交互历史训练隐式偏好模型或在外存中编码用户画像，但在新用户场景和偏好动态演变情况下表现不佳。解决方案的关键在于提出一种名为“从人类反馈中获得个性化代理”（Personalized Agents from Human Feedback, PAHF）的框架，其核心机制是引入显式用户记忆并结合双通道反馈：一是通过预动作澄清消除歧义，二是基于记忆检索的偏好指导行动，三是利用后动作反馈更新记忆以应对偏好漂移。理论分析与实证结果表明，这种融合显式记忆与双反馈机制的方法显著提升了个性化学习速度，并在初始偏好学习和偏好转变适应方面优于无记忆及单通道基线模型。

链接: https://arxiv.org/abs/2602.16173
作者: Kaiqu Liang,Julia Kruk,Shengyi Qian,Xianjun Yang,Shengjie Bi,Yuanshun Yao,Shaoliang Nie,Mingyang Zhang,Lijuan Liu,Jaime Fernández Fisac,Shuyan Zhou,Saghar Hosseini
机构: Princeton University (普林斯顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users. Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding user profiles in external memory. However, these approaches struggle with new users and with preferences that change over time. We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization in which agents learn online from live interaction using explicit per-user memory. PAHF operationalizes a three-step loop: (1) seeking pre-action clarification to resolve ambiguity, (2) grounding actions in preferences retrieved from memory, and (3) integrating post-action feedback to update memory when preferences drift. To evaluate this capability, we develop a four-phase protocol and two benchmarks in embodied manipulation and online shopping. These benchmarks quantify an agent’s ability to learn initial preferences from scratch and subsequently adapt to persona shifts. Our theoretical analysis and empirical results show that integrating explicit memory with dual feedback channels is critical: PAHF learns substantially faster and consistently outperforms both no-memory and single-channel baselines, reducing initial personalization error and enabling rapid adaptation to preference shifts.

[NLP-29] Discrete Stochastic Localization for Non-autoregressive Generation

【速读】：该论文旨在解决非自回归（Non-autoregressive, NAR）生成模型在迭代精修过程中因错误累积和分布偏移导致的效率与质量下降问题，尤其是基于掩码扩散语言模型（Masked Diffusion Language Models, MDLMs）及其重掩码采样器（如ReMDM）在实际应用中面临步数预算有限时性能受限的问题。解决方案的关键在于提出了一种名为离散随机定位（Discrete Stochastic Localization, DSL）的新训练范式，其核心是训练一个对信噪比（Signal-to-Noise Ratio, SNR）不变的去噪器，覆盖从中间草稿噪声到掩码风格终点扰动的连续腐蚀水平，统一建模于一个扩散Transformer架构内，从而显著提升每步更新的效率与自我修正能力，实现低步数下MAUVE指标大幅提升，并在高步数下达到自回归模型的质量水平。

链接: https://arxiv.org/abs/2602.16169
作者: Yunshu Wu,Jiayi Cheng,Partha Thakuria,Rob Brekelmans,Evangelos E. Papalexakis,Greg Ver Steeg
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Non-autoregressive (NAR) generation reduces decoding latency by predicting many tokens in parallel, but iterative refinement often suffers from error accumulation and distribution shift under self-generated drafts. Masked diffusion language models (MDLMs) and their remasking samplers (e.g., ReMDM) can be viewed as modern NAR iterative refinement, where generation repeatedly revises a partially observed draft. In this work we show that \emphtraining alone can substantially improve the step-efficiency of MDLM/ReMDM sampling. We propose \textscDSL (Discrete Stochastic Localization), which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer. On OpenWebText, \textscDSL fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with (\sim)4 \times fewer denoiser evaluations, and matches autoregressive quality at high budgets. Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.

[NLP-30] LLM s Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在创意写作中表现平庸、缺乏新颖性的问题，其核心症结在于当前对不确定性的抑制策略与文学创作所需的不确定性之间存在矛盾。解决方案的关键在于引入“不确定性感知”的对齐范式，即区分有害的幻觉（hallucination）与促进文学丰富性的建设性模糊（constructive ambiguity），从而在保持事实准确性的同时保留创造性表达所必需的不确定性。

链接: https://arxiv.org/abs/2602.16162
作者: Peiqi Sui
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 tables

点击查看摘要

Abstract:We argue that uncertainty is a key and understudied limitation of LLMs’ performance in creative writing, which is often characterized as trite and cliché-ridden. Literary theory identifies uncertainty as a necessary condition for creative expression, while current alignment strategies steer models away from uncertain outputs to ensure factuality and reduce hallucination. We formalize this tension by quantifying the “uncertainty gap” between human-authored stories and model-generated continuations. Through a controlled information-theoretic analysis of 28 LLMs on high-quality storytelling datasets, we demonstrate that human writing consistently exhibits significantly higher uncertainty than model outputs. We find that instruction-tuned and reasoning models exacerbate this trend compared to their base counterparts; furthermore, the gap is more pronounced in creative writing than in functional domains, and strongly correlates to writing quality. Achieving human-level creativity requires new uncertainty-aware alignment paradigms that can distinguish between destructive hallucinations and the constructive ambiguity required for literary richness.

[NLP-31] Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection

【速读】：该论文旨在解决多模态情感与情绪建模中因模态缺失或噪声干扰导致的鲁棒性不足问题，尤其在部分模态不可用或受污染时仍需保持准确的情感理解能力。其解决方案的关键在于提出了一种基于双曲超图（hyperbolic hypergraph）的框架——Emotion Collider (EC-Net)，该框架通过Poincaré球嵌入显式建模模态层次结构，并利用双向消息传递机制在节点与超边之间融合信息；同时，在双曲空间中设计解耦的径向与角向对比学习目标以增强类别分离能力，并通过自适应超边构建保留跨时间步和跨模态的高阶语义关系，从而实现对多模态情感表征的稳定、语义一致且抗噪能力强的建模。

链接: https://arxiv.org/abs/2602.16161
作者: Rong Fu,Ziming Wang,Shuo Yin,Wenxin Zhang,Haiyun Wei,Kun Liu,Xianda Li,Zeli Su,Simon Fong
机构: University of Macau (澳门大学); Zhejiang University (浙江大学); Tsinghua University (清华大学); University of Chinese Academy of Sciences (中国科学院大学); Tongji University (同济大学); University of Southampton (南安普顿大学); University of Bologna (博洛尼亚大学); Minzu University of China (中央民族大学)
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, 14 figures

点击查看摘要

Abstract:Emotional expression underpins natural communication and effective human-computer interaction. We present Emotion Collider (EC-Net), a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling. EC-Net represents modality hierarchies using Poincare-ball embeddings and performs fusion through a hypergraph mechanism that passes messages bidirectionally between nodes and hyperedges. To sharpen class separation, contrastive learning is formulated in hyperbolic space with decoupled radial and angular objectives. High-order semantic relations across time steps and modalities are preserved via adaptive hyperedge construction. Empirical results on standard multimodal emotion benchmarks show that EC-Net produces robust, semantically coherent representations and consistently improves accuracy, particularly when modalities are partially available or contaminated by noise. These findings indicate that explicit hierarchical geometry combined with hypergraph fusion is effective for resilient multimodal affect understanding.

[NLP-32] Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution

【速读】：该论文旨在解决链式思维（Chain-of-thought, CoT）推理在大型语言模型（Large Language Model, LLM）中存在忠实度不足的问题，即CoT生成的推理路径往往不能真实反映模型内部的实际计算过程，从而限制了其解释能力；同时，提升CoT的忠实度和可解释性通常会损害任务性能。解决方案的关键在于提出一种多参与者强化学习框架——推理执行的多听众机制（Reasoning Execution by Multiple Listeners, REMUL），其核心思想是：若多个“听众”模型能够理解并执行某条推理路径，则该路径更可能忠实于原模型的真实推理过程。具体而言，由“说话者”模型生成推理轨迹，并由一组“听众”模型对其进行执行与续写，通过奖励机制鼓励生成清晰、可被多个听众跟随的推理路径，同时引入掩码监督微调（masked supervised fine-tuning）以缓解忠实度与性能之间的权衡问题。实验证明，REMUL在多个推理基准上显著提升了三项忠实度指标（提示归属、早期回答区域曲线下面积、错误注入区域曲线下面积），且同步提升了准确率。

链接: https://arxiv.org/abs/2602.16154
作者: Nithin Sivakumaran,Shoubin Yu,Hyunji Lee,Yue Zhang,Ali Payani,Mohit Bansal,Elias Stengel-Eskin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi-party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who “execute” the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness – hint attribution, early answering area over the curve (AOC), and mistake injection AOC – while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.

[NLP-33] Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis

【速读】：该论文旨在解决多模态情感分析系统中敏感个人数据的隐私合规与用户自主权问题，特别是在需要对特定模态数据进行选择性删除时的挑战。解决方案的关键在于提出了一种名为“Missing-by-Design (MBD)”的统一框架，其核心包括两个方面：一是通过结构化表示学习生成具备属性感知能力的嵌入（property-aware embeddings），并利用生成器重构缺失模态以保留任务相关信号；二是设计了一个可验证的参数修改流程，基于显著性驱动的候选选择和校准高斯更新机制，生成机器可验证的模态删除证书（Modality Deletion Certificate），从而实现精准的“手术式遗忘”（surgical unlearning），在保证预测性能的同时提供实用的隐私-效用权衡。

链接: https://arxiv.org/abs/2602.16144
作者: Rong Fu,Wenxin Zhang,Ziming Wang,Chunlei Meng,Jiaxuan Lu,Jiekai Wu,Kangan Qian,Hao Zhang,Simon Fong
机构: University of Macau (澳门大学); University of Chinese Academy of Sciences (中国科学院大学); Zhejiang University (浙江大学); Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); Juntendo University (顺天堂大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 6 figures

点击查看摘要

Abstract:As multimodal systems increasingly process sensitive personal data, the ability to selectively revoke specific data modalities has become a critical requirement for privacy compliance and user autonomy. We present Missing-by-Design (MBD), a unified framework for revocable multimodal sentiment analysis that combines structured representation learning with a certifiable parameter-modification pipeline. Revocability is critical in privacy-sensitive applications where users or regulators may request removal of modality-specific information. MBD learns property-aware embeddings and employs generator-based reconstruction to recover missing channels while preserving task-relevant signals. For deletion requests, the framework applies saliency-driven candidate selection and a calibrated Gaussian update to produce a machine-verifiable Modality Deletion Certificate. Experiments on benchmark datasets show that MBD achieves strong predictive performance under incomplete inputs and delivers a practical privacy-utility trade-off, positioning surgical unlearning as an efficient alternative to full retraining.

[NLP-34] Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities

【速读】：该论文旨在解决预训练大语言模型（Large Language Models, LLMs）在持续知识适应过程中面临的“灾难性遗忘”问题，即在学习新知识时会损害原有技能（如指令遵循、推理和事实知识）的表现。现有方法难以同时实现新知识的有效习得与旧能力的保留。其解决方案的关键在于提出一种基于上下文蒸馏的方法——Distillation via Split Contexts (DiSC)，该方法通过将训练样本的不同片段分别作为学生和教师模型的条件输入，以共享token上的KL散度最小化为目标进行蒸馏，从而无需显式生成步骤即可高效实现上下文蒸馏，显著提升了模型在持续适应过程中的知识保留与增量学习能力。

链接: https://arxiv.org/abs/2602.16093
作者: Shankar Padmanabhan,Mustafa Omer Gul,Tanya Goyal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages. Preprint, under review

点击查看摘要

Abstract:Post-training endows pretrained LLMs with a variety of desirable skills, including instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot simultaneously learn new knowledge from an adaptation document corpora and mitigate the forgetting of earlier learned capabilities. To address this, we introduce Distillation via Split Contexts (DiSC), a simple context-distillation based approach for continual knowledge adaptation. \methodname~derives student and teacher distributions by conditioning on distinct segments of the training example and minimizes the KL divergence between the shared tokens. This allows us to efficiently apply context-distillation without requiring explicit generation steps during training. We run experiments on four post-trained models and two adaptation domains. Compared to prior finetuning and distillation methods for continual adaptation, DiSC consistently reports the best trade-off between learning new knowledge and mitigating forgetting of previously learned skills like instruction-following, reasoning, and factual knowledge.

[NLP-35] Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff

【速读】：该论文旨在解决任意顺序生成（any-order generation）中隐藏表示在语义信息与结构信息之间存在权衡的问题，即模型在每一步生成时需同时关注语义上有意义的token以进行预测和结构上较新的token以实现摘要，而这两个目标在单一流中会争夺注意力容量。为隔离这一结构性-语义性权衡（structural-semantic tradeoff）与位置-内容分离（position-content separation）的关系，作者提出Decoupled RoPE——一种对旋转位置编码（Rotary Position Embedding, RoPE）的改进方法，其可在不泄露目标内容的情况下提供目标位置信息。该方案在短序列长度下表现良好（此时语义与结构邻近一致），但随着序列增长、两种排序差异增大而性能下降，表明两流注意力机制的成功不仅源于位置与内容的分离，更关键的是规避了任意顺序生成中固有的结构性与语义性冲突。

链接: https://arxiv.org/abs/2602.16092
作者: Patrick Pynadath,Ruqi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Any-order autoregressive models (AO-ARMs) offer a promising path toward efficient masked diffusion by enabling native key-value caching, but competitive performance has so far required two-stream attention, typically motivated as a means of decoupling token content from position. In this work, we argue that two-stream attention may be serving a more subtle role. We identify a structural-semantic tradeoff in any-order generation: the hidden representation at each step must simultaneously attend to semantically informative tokens for prediction and structurally recent tokens for summarization, objectives that compete for attention capacity in a single stream but can specialize across two streams. To isolate this tradeoff from position-content separation, we propose Decoupled RoPE, a modification to rotary position embeddings that provides target position information without revealing target content. Decoupled RoPE performs competitively at short sequence lengths–where semantic and structural proximity coincide–but degrades as sequence length increases and the two orderings diverge. These results suggest that the success of two-stream attention stems not merely from separating position from content, but from circumventing the deeper structural-semantic tradeoff inherent to any-order generation.

[NLP-36] Language Statistics and False Belief Reasoning : Evidence from 41 Open-Weight LMs

【速读】：该论文旨在解决当前关于语言模型（Language Models, LMs）心智状态推理能力的研究受限于小样本封闭源模型的问题，从而难以严谨验证人类社会认知理论并评估LM的真实能力。其解决方案的关键在于扩展研究范围，使用41个开源权重模型（open-weight models）对经典错误信念任务进行复制与拓展分析，发现34%的LM表现出对隐含知识状态的敏感性，并揭示了大模型在敏感性和心理测量预测力上的提升；此外，通过LM行为生成并验证了一个关于人类认知的新假设——即人类和LM均在使用非事实动词（如“John thinks…”）提示知识状态时更倾向于归因错误信念，而这一效应的强度在LM中分布范围内，暗示语言分布统计可解释此类现象，但无法解释人类对知识状态的更高敏感性。

链接: https://arxiv.org/abs/2602.16085
作者: Sean Trott,Samuel Taylor,Cameron Jones,James A. Michaelov,Pamela D. Rivière
机构: Rutgers University - Newark (罗格斯大学纽瓦克分校); UC San Diego (加州大学圣地亚哥分校); Stony Brook University (石溪大学); MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, submitted to conference

点击查看摘要

Abstract:Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition–such as the theory that mental state reasoning emerges in part from language exposure–and our understanding of LMs themselves. Yet much published work on LMs relies on a relatively small sample of closed-source LMs, limiting our ability to rigorously test psychological theories and evaluate LM capacities. Here, we replicate and extend published work on the false belief task by assessing LM mental state reasoning behavior across 41 open-weight models (from distinct model families). We find sensitivity to implied knowledge states in 34% of the LMs tested; however, consistent with prior work, none fully explain away'' the effect in humans. Larger LMs show increased sensitivity and also exhibit higher psychometric predictive power. Finally, we use LM behavior to generate and test a novel hypothesis about human cognition: both humans and LMs show a bias towards attributing false beliefs when knowledge states are cued using a non-factive verb (John thinks…‘’) than when cued indirectly (``John looks in the…‘’). Unlike the primary effect of knowledge states, where human sensitivity exceeds that of LMs, the magnitude of the human knowledge cue effect falls squarely within the distribution of LM effect sizes-suggesting that distributional statistics of language can in principle account for the latter but not the former in humans. These results demonstrate the value of using larger samples of open-weight LMs to test theories of human cognition and evaluate LM capacities.

[NLP-37] CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

【速读】：该论文旨在解决长上下文大语言模型（Large Language Model, LLM）推理中预填充（prefill）阶段的计算瓶颈问题。现有基于token排序的启发式方法虽能通过选择语义相关token来加速推理，但其在不同层间存在不稳定的token重要性估计问题，且难以独立于特定架构评估排序质量。论文提出一种答案感知的Oracle机制，通过测量生成答案对提示（prompt）的注意力反向传播来定义真实token重要性，从而揭示了现有方法在特定层出现显著性能退化的问题。关键解决方案是采用跨层注意力聚合（Cross-Layer Attention Aggregation, CLAA），即对多层得分进行整合而非依赖单一层次，有效逼近Oracle上限，并将首次词元生成时间（Time-to-First-Token, TTFT）降低最多达39%。

链接: https://arxiv.org/abs/2602.16054
作者: Bradley McDanel,Steven Li,Harshit Khaitan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39% compared to the Full KV Cache baseline.

[NLP-38] Evidence-Grounded Subspecialty Reasoning : Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在内分泌学等亚专科临床推理任务中表现不足的问题，尤其是面对快速更新的指南和复杂的证据分级体系时，现有LLMs难以稳定输出准确且可追溯的决策。其解决方案的关键在于构建一个基于结构化推理架构的证据锚定系统——January Mirror，该系统整合了经过精心筛选的内分泌与心血管代谢领域证据语料库，并在闭源证据约束条件下生成带有来源标注的推理结果，从而实现高精度、高可审计性的亚专科临床决策支持。

链接: https://arxiv.org/abs/2602.16050
作者: Amir Hosseinian,MohammadReza Zare Shahneh,Umer Mansoor,Gilbert Szeto,Kirill Karlin,Nima Aghaeepour
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.

[NLP-39] A Curious Class of Adpositional Multiword Expressions in Korean EACL2026

【速读】：该论文旨在解决韩语多词表达（Multiword Expressions, MWEs）在跨语言标注框架中代表性不足的问题，尤其是韩语多词介词（multiword adpositions）缺乏系统分析、标注资源及与现有多语言框架的整合。其解决方案的关键在于聚焦于一类韩语功能型多词表达——后置词动词构式（Postpositional Verb-based Constructions, PVCs），基于韩国维基百科数据对其进行调查与分析，并将其与非MWE结构及结构相似的轻动词构式（Light Verb Constructions, LVCs）进行对比，进而提出一套面向韩语多词介词的标注指南，以支持未来相关研究并促进与跨语言框架的对齐。

链接: https://arxiv.org/abs/2602.16023
作者: Junghyun Min,Na-Rae Han,Jena D. Hwang,Nathan Schneider
机构: Georgetown University (乔治城大学); University of Pittsburgh (匹兹堡大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注: 10 pages. Camera-ready for MWE at EACL 2026

点击查看摘要

Abstract:Multiword expressions (MWEs) have been widely studied in cross-lingual annotation frameworks such as PARSEME. However, Korean MWEs remain underrepresented in these efforts. In particular, Korean multiword adpositions lack systematic analysis, annotated resources, and integration into existing multilingual frameworks. In this paper, we study a class of Korean functional multiword expressions: postpositional verb-based constructions (PVCs). Using data from Korean Wikipedia, we survey and analyze several PVC expressions and contrast them with non-MWEs and light verb constructions (LVCs) with similar structure. Building on this analysis, we propose annotation guidelines designed to support future work in Korean multiword adpositions and facilitate alignment with cross-lingual frameworks.

[NLP-40] MAEB: Massive Audio Embedding Benchmark

【速读】：该论文旨在解决多模态音频理解模型评估缺乏统一、大规模基准的问题，尤其针对语音、音乐、环境声及跨模态音文推理等多样任务的性能评估。其解决方案的关键在于构建Massive Audio Embedding Benchmark (MAEB)，一个涵盖30项任务的大规模基准，覆盖100多种语言和多种音频类型，同时与MTEB（Multimodal Text Embedding Benchmark）生态系统集成，实现文本、图像和音频模态的统一评估。通过在50多个模型上的系统评测，MAEB揭示了当前模型在不同任务上的性能分化现象，并验证了音频编码器在MAEB上的表现与其在音频大语言模型（Audio Large Language Models, Audio LLMs）中的效果高度相关，从而为模型选择与优化提供可靠依据。

链接: https://arxiv.org/abs/2602.16008
作者: Adnan El Assadi,Isaac Chung,Chenghao Xiao,Roman Solomatin,Animesh Jha,Rahul Chand,Silky Singh,Kaitlyn Wang,Ali Sartaz Khan,Marc Moussa Nasser,Sufen Fong,Pengfei He,Alan Xiao,Ayush Sunil Munot,Aditya Shrivastava,Artem Gazizov,Niklas Muennighoff,Kenneth Enevoldsen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at this https URL.

[NLP-41] Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks

【速读】：该论文旨在解决神经网络训练过程中能力涌现（capability emergence）机制不明确的问题，特别是如何从几何视角理解模型表征结构演化与任务能力出现之间的关系。其解决方案的关键在于系统追踪五种几何度量在不同模型规模（405K–85M参数）、多个算法任务及Pythia语言模型中的变化，发现：训练初期存在普遍的表征坍缩至任务特定的“底层”（如模运算坍缩至秩约为2.0），且该现象在210倍参数范围内具有尺度不变性；这种坍缩自顶层向底层逐层传播，违背了传统自底向上特征构建的直觉；更重要的是，表征几何结构可作为能力涌现的强前兆（硬任务中预测率达75–100%），而局部学习系数和海森矩阵指标则滞后或无预测能力。这一几何解剖揭示了能力涌现的边界条件，而非提供精确预测工具。

链接: https://arxiv.org/abs/2602.15997
作者: Jayadev Billa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 6 figures, 12 appendix pages

点击查看摘要

Abstract:Capability emergence during neural network training remains mechanistically opaque. We track five geometric measures across five model scales (405K-85M parameters), 120+ emergence events in eight algorithmic tasks, and three Pythia language models (160M-2.8B). We find: (1) training begins with a universal representation collapse to task-specific floors that are scale-invariant across a 210X parameter range (e.g., modular arithmetic collapses to RANKME ~ 2.0 regardless of model size); (2) collapse propagates top-down through layers (32/32 task X model consistency), contradicting bottom-up feature-building intuition; (3) a geometric hierarchy in which representation geometry leads emergence (75-100% precursor rate for hard tasks), while the local learning coefficient is synchronous (0/24 precursor) and Hessian measures lag. We also delineate prediction limits: geometric measures encode coarse task difficulty but not fine-grained timing (within-class concordance 27%; when task ordering reverses across scales, prediction fails at 26%). On Pythia, global geometric patterns replicate but per-task precursor signals do not – the precursor relationship requires task-training alignment that naturalistic pre-training does not provide. Our contribution is the geometric anatomy of emergence and its boundary conditions, not a prediction tool.

[NLP-42] DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

【速读】：该论文旨在解决真实场景中多页文档包（document packet）的自动拆分问题，即如何将混合排列、无明确边界且可能跨多个文档的复杂文档集合准确分割为独立的单个文档单元。其解决方案的关键在于提出首个全面的基准数据集DocSplit及其配套的新型评估指标，系统性地衡量大语言模型在识别文档边界、分类文档类型以及保持正确页序方面的性能。该方法不仅涵盖多种文档类型、布局和多模态场景，还特别针对现实挑战如页面错序、文档交错及缺乏清晰分隔等问题进行了建模与测试，从而为法律、金融、医疗等文档密集型领域提供了可扩展的文档理解能力提升框架。

链接: https://arxiv.org/abs/2602.15958
作者: Md Mofijul Islam,Md Sirajus Salekin,Nivedha Balakrishnan,Vincil C. Bishop III,Niharika Jain,Spencer Romo,Bob Strahan,Boyi Xie,Diego A. Socolinsky
机构: Amazon Web Services
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models’ ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.

[NLP-43] Doc-to-LoRA: Learning to Instantly Internalize Contexts

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理长输入序列时因Transformer架构的二次方注意力计算复杂度而导致的推理内存占用高、速度慢的问题。现有方法如上下文蒸馏（Context Distillation, CD）虽能将信息迁移到模型参数中，但针对每个提示进行蒸馏存在训练成本高和延迟大的缺陷。其解决方案的关键在于提出Doc-to-LoRA（D2L），一种轻量级超网络（hypernetwork），通过元学习在单次前向传播中近似完成上下文蒸馏；该方法为未见提示生成LoRA适配器（Low-Rank Adaptation adapter），使后续查询无需重新读取原始上下文，从而显著降低推理阶段的延迟与KV缓存内存消耗，并在长文本检索任务中实现超过原生上下文窗口4倍长度的零样本准确率。

链接: https://arxiv.org/abs/2602.15902
作者: Rujikorn Charakorn,Edoardo Cetin,Shinnosuke Uesaka,Robert Tjarko Lange
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM’s native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.

[NLP-44] MultiCube-RAG for Multi-hop Question Answering

【速读】：该论文旨在解决多跳问答（Multi-hop Question Answering, MQA）中因现有检索增强生成（Retrieval-Augmented Generation, RAG）方法难以准确捕捉结构化语义而导致的性能不足问题，特别是传统方法在处理跨主题、属性和关系的多步推理时存在噪声干扰、计算开销大以及缺乏有效多跳机制的局限性。其解决方案的关键在于提出一种基于本体的立方体结构（Ontology-based Cube Structure），通过多个正交维度对主体、属性与关系进行建模，并构建无需训练的 MultiCube-RAG 方法：每个立方体专门针对一类主体建模，从而实现灵活的知识选择；同时，将复杂多跳查询沿立方体维度分解为一系列简单子查询并顺序求解，显著提升了推理精度与效率，并具备天然可解释性。

链接: https://arxiv.org/abs/2602.15898
作者: Jimeng Shi,Wei Hu,Runchu Tian,Bowen Jin,Wonbin Kweon,SeongKu Kang,Yunfan Kang,Dingqi Ye,Sizhe Zhou,Shaowen Wang,Jiawei Han
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Korea University (韩国科学技术院)
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:Multi-hop question answering (QA) necessitates multi-step reasoning and retrieval across interconnected subjects, attributes, and relations. Existing retrieval-augmented generation (RAG) methods struggle to capture these structural semantics accurately, resulting in suboptimal performance. Graph-based RAGs structure such information in graphs, but the resulting graphs are often noisy and computationally expensive. Moreover, most methods rely on single-step retrieval, neglecting the need for multi-hop reasoning processes. Recent training-based approaches attempt to incentivize the large language models (LLMs) for iterative reasoning and retrieval, but their training processes are prone to unstable convergence and high computational overhead. To address these limitations, we devise an ontology-based cube structure with multiple and orthogonal dimensions to model structural subjects, attributes, and relations. Built on the cube structure, we propose MultiCube-RAG, a training-free method consisting of multiple cubes for multi-step reasoning and retrieval. Each cube specializes in modeling a class of subjects, so that MultiCube-RAG flexibly selects the most suitable cubes to acquire the relevant knowledge precisely. To enhance the query-based reasoning and retrieval, our method decomposes a complex multi-hop query into a set of simple subqueries along cube dimensions and conquers each of them sequentially. Experiments on four multi-hop QA datasets show that MultiCube-RAG improves response accuracy by 8.9% over the average performance of various baselines. Notably, we also demonstrate that our method performs with greater efficiency and inherent explainability.

[NLP-45] Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

【速读】：该论文旨在解决大规模语言模型在协作学习过程中面临的梯度逆向攻击（Gradient Inversion Attacks, GIAs）问题，即攻击者可通过共享的梯度信息重构出私有训练数据。现有防御方法主要依赖梯度扰动技术（如噪声注入或梯度剪枝），但因梯度、嵌入（embedding）与词元（token）空间间语义相似性仍被保留，难以有效抵御攻击。本文提出一种名为GHOST（gradient shield with obfuscated tokens）的新防御机制，其核心创新在于通过词元级混淆实现梯度空间、嵌入空间与词元空间之间的语义解耦：利用大规模词典中存在语义不同但嵌入相近的替代词元（shadow tokens），在不破坏嵌入和梯度空间关联性的前提下，切断词元空间内的语义连通性。GHOST包含搜索与选择两步，分别用于识别候选替换词元并优选最优阴影词元以最小化对训练特征的干扰，从而在保护隐私（恢复率低至1%）的同时维持模型性能（分类F1最高达0.92，困惑度仅下降5.45）。

链接: https://arxiv.org/abs/2602.15897
作者: Xinguo Feng,Zhongkui Ma,Zihan Wang,Alsharif Abuadbba,Guangdong Bai
机构: The University of Queensland (昆士兰大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training and fine-tuning large-scale language models largely benefit from collaborative learning, but the approach has been proven vulnerable to gradient inversion attacks (GIAs), which allow adversaries to reconstruct private training data from shared gradients. Existing defenses mainly employ gradient perturbation techniques, e.g., noise injection or gradient pruning, to disrupt GIAs’ direct mapping from gradient space to token space. However, these methods often fall short due to the retention of semantics similarity across gradient, embedding, and token spaces. In this work, we propose a novel defense mechanism named GHOST (gradient shield with obfuscated tokens), a token-level obfuscation mechanism that neutralizes GIAs by decoupling the inherent connections across gradient, embedding, and token spaces. GHOST is built upon an important insight: due to the large scale of the token space, there exist semantically distinct yet embedding-proximate tokens that can serve as the shadow substitutes of the original tokens, which enables a semantic disconnection in the token space while preserving the connection in the embedding and gradient spaces. GHOST comprises a searching step, which identifies semantically distinct candidate tokens using a multi-criteria searching process, and a selection step, which selects optimal shadow tokens to ensure minimal disruption to features critical for training by preserving alignment with the internal outputs produced by original tokens. Evaluation across diverse model architectures (from BERT to Llama) and datasets demonstrates the remarkable effectiveness of GHOST in protecting privacy (as low as 1% in recovery rate) and preserving utility (up to 0.92 in classification F1 and 5.45 in perplexity), in both classification and generation tasks against state-of-the-art GIAs and adaptive attack scenarios.

[NLP-46] Every Little Helps: Building Knowledge Graph Foundation Model with Fine-grained Transferable Multi-modal Tokens

【速读】：该论文旨在解决多模态知识图谱推理（Multi-modal Knowledge Graph Reasoning, MMKGR）中现有方法在跨知识图谱（KG）迁移能力不足的问题，尤其是传统方法局限于特定数据集的嵌入学习，难以泛化到新知识图谱；同时，近期的知识图谱基础模型（Knowledge Graph Foundation Models, KGFMs）虽提升了跨KG迁移能力，但主要依赖结构模式而忽略了丰富的多模态信号。解决方案的关键在于提出一种基于标记的基础模型（Token-based Foundation Model, TOFU），其将结构、视觉和文本信息离散化为模态特定的token，并采用具有消息混合机制的分层融合架构，从而有效整合多模态信息并提取可迁移的特征，实现对多种类型（包括归纳式和完全归纳式）多模态知识图谱的强泛化性能。

链接: https://arxiv.org/abs/2602.15896
作者: Yichi Zhang,Zhuo Chen,Lingbing Guo,Wen Zhang,Huajun Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Multi-modal knowledge graph reasoning (MMKGR) aims to predict the missing links by exploiting both graph structure information and multi-modal entity contents. Most existing works are designed for a transductive setting, which learns dataset-specific embeddings and struggles to generalize to new KGs. Recent knowledge graph foundation models (KGFMs) improve cross-KG transfer, but they mainly exploit structural patterns and ignore rich multi-modal signals. We address these gaps by proposing a token-based foundation model (TOFU) for MMKGR, which exhibits strong generalization across different MMKGs. TOFU discretizes structural, visual, and textual information into modality-specific tokens. TOFU then employs a hierarchical fusion architecture with mixture-of-message mechanisms, aiming to process these tokens and obtain transferable features for MMKGR. Experimental results on 17 transductive, inductive, and fully-inductive MMKGs show that TOFU consistently outperforms strong KGFM and MMKGR baselines, delivering strong performance on unseen MMKGs.

[NLP-47] Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion

【速读】：该论文旨在解决现有检索增强生成（Retrieval-Augmented Generation, RAG）框架中因文本离散表示导致语义完整性丢失、进而引发检索偏差的问题。其解决方案的关键在于提出CogitoRAG框架，该框架受人类情景记忆机制启发，通过提取并演化“语义主旨（Semantic Gist）”实现知识的结构化建模与动态推理：在离线索引阶段，将非结构化语料转化为包含实体、关系事实和记忆节点的多维知识图谱；在线检索阶段，借助查询分解模块（Query Decomposition Module）对复杂查询进行认知式拆解，并通过实体扩散模块（Entity Diffusion Module）基于结构相关性和实体频率奖励机制执行关联检索；最终利用CogniRank算法融合扩散得分与语义相似度对候选段落进行精排，以片段-记忆配对形式向生成器提供高密度信息支持，从而显著提升复杂知识整合与推理能力。

链接: https://arxiv.org/abs/2602.15895
作者: Pengcheng Zhou,Haochen Li,Zhiqiang Nie,JiaLe Chen,Qing Gong,Weizhen Zhang,Chun Yu
机构: National University of Singapore(新加坡国立大学); Tsinghua University(清华大学); Nanyang Technological University(南洋理工大学); Beijing University of Posts and Telecommunications(北京邮电大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations in LLMs by incorporating external knowledge. However, the inherent discrete representation of text in existing frameworks often results in a loss of semantic integrity, leading to retrieval deviations. Inspired by the human episodic memory mechanism, we propose CogitoRAG, a RAG framework that simulates human cognitive memory processes. The core of this framework lies in the extraction and evolution of the Semantic Gist. During the offline indexing stage, CogitoRAG first deduces unstructured corpora into gist memory corpora, which are then transformed into a multi-dimensional knowledge graph integrating entities, relational facts, and memory nodes. In the online retrieval stage, the framework handles complex queries via Query Decomposition Module that breaks them into comprehensive sub-queries, mimicking the cognitive decomposition humans employ for complex information. Subsequently, Entity Diffusion Module performs associative retrieval across the graph, guided by structural relevance and an entity-frequency reward mechanism. Furthermore, we propose the CogniRank algorithm, which precisely reranks candidate passages by fusing diffusion-derived scores with semantic similarity. The final evidence is delivered to the generator in a passage-memory pairing format, providing high-density information support. Experimental results across five mainstream QA benchmarks and multi-task generation on GraphBench demonstrate that CogitoRAG significantly outperforms state-of-the-art RAG methods, showcasing superior capabilities in complex knowledge integration and reasoning.

[NLP-48] Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在对齐（alignment）过程中输出多样性下降的问题，即现有对齐方法虽能提升输出质量，但往往导致生成结果趋于单一。其解决方案的关键在于理论分解对齐任务为质量与多样性两个分布，并提出质量约束的熵最大化策略优化（Quality-constrained Entropy Maximization Policy Optimization, QEMPO），通过在保证输出质量的前提下最大化策略输出熵，从而实现多样性增强；同时设计了在线与离线两种训练方法以优化策略，实验表明QEMPO在保持或超越RLHF性能的同时显著提升了输出多样性。

链接: https://arxiv.org/abs/2602.15894
作者: Haihui Pan,Yuzhong Hong,Shaoke Lv,Junwei Bao,Hongfei Jiang,Yang Song
机构: Zuoyebang Education Technology (作业帮教育科技)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent research indicates that while alignment methods significantly improve the quality of large language model(LLM) outputs, they simultaneously reduce the diversity of the models’ output. Although some methods have been proposed to enhance LLM output diversity, they often come at the cost of reduced performance. In this work, we first theoretically demonstrate that the alignment task can be decomposed into two distributions: quality and diversity. To enhance the diversity of LLM outputs while ensuring quality, we propose the Quality-constrained Entropy Maximization Policy Optimization (QEMPO). QEMPO aims to maximize the output entropy of the policy while ensuring output quality. By adding different constraints to QEMPO, we obtain different policies. To optimize policies, we propose both online and offline training methods. Experiments validate that QEMPO achieves performance comparable to or even better than RLHF while improving output diversity.

[NLP-49] P-RAG : Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）因依赖静态训练数据而导致的知识滞后与更新困难问题，以及传统检索增强生成（Retrieval-Augmented Generation, RAG）对知识库质量高度敏感的局限性。其核心解决方案是提出一种混合架构——提示增强参数化RAG（Prompt-Enhanced Parametric RAG, P-RAG），该方法通过将参数化知识嵌入LLM内部，并结合链式思维（Chain-of-Thought, CoT）提示和低秩适应（Low-Rank Adaptation, LoRA）微调技术，在推理阶段融合外部检索证据与模型内隐知识，从而提升多跳问答任务中的准确性和上下文适应能力。关键创新在于利用CoT引导推理路径、LoRA实现高效微调，并在PubMedQA和2WikiMultihopQA等生物医学数据集上取得显著性能提升，验证了P-RAG在复杂语义理解与可扩展知识整合方面的潜力。

链接: https://arxiv.org/abs/2602.15874
作者: Xingda Lyu,Gongfu Lyu,Zitai Yan,Yuxin Jiang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities but remain limited by their reliance on static training data. Retrieval-Augmented Generation (RAG) addresses this constraint by retrieving external knowledge during inference, though it still depends heavily on knowledge base quality. To explore potential improvements, we evaluated three RAG variants-Standard RAG, DA-RAG, and our proposed Prompt-Enhanced Parametric RAG (P-RAG), a hybrid architecture that integrates parametric knowledge within the LLM and retrieved evidence, guided by Chain-of-Thought (CoT) prompting and Low-Rank Adaptation (LoRA) fine-tuning-on both general and biomedical datasets. Using LLaMA-3.2-1B-Instruct fine-tuned via LoRA, we evaluate on PubMedQA and 2WikiMultihopQA. P-RAG outperforms Standard RAG on PubMedQA by 10.47 percentage points in F1 (93.33% vs. 82.86%; 12.64% relative). On 2WikiMultihopQA, P-RAG nearly doubles the overall score vs. Standard RAG (33.44% vs. 17.83%) and achieves 44.03% on the Compare subset (with 42.74% Bridge, 21.84% Inference, 8.60% Compose). CoT prompting substantially improves multi-hop reasoning but yields mixed results for simpler, single-hop queries. These findings underscore P-RAG’s potential for accurate, scalable, and contextually adaptive biomedical question answering. Our contributions include: (1) LoRA-based fine-tuning of LLaMA-3.2-1B-Instruct for biomedical QA, (2) introduction of P-RAG with Chain-of-Thought prompting, and (3) state-of-the-art results on PubMedQA and 2WikiMultihopQA.

[NLP-50] CheckIfExist: Detecting Citation Hallucinations in the Era of AI-Generated Content

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在学术工作流中引发的参考文献完整性问题，尤其是“引用幻觉”（reference hallucination）——即生成看似合理但实际不存在的文献条目。此类问题已出现在NeurIPS和ICLR等顶级机器学习会议的录用论文中，凸显了自动化验证机制的紧迫性。解决方案的关键在于提出一个名为“CheckIfExist”的开源在线工具，其核心创新是采用多源验证架构（基于CrossRef、Semantic Scholar和OpenAlex数据库），结合字符串相似度算法构建级联验证机制，从而计算多维匹配置信度分数，实现对单条或批量BibTeX格式引用的实时真伪判定，并快速返回经验证的APA格式引用及可导出的BibTeX记录。

链接: https://arxiv.org/abs/2602.15871
作者: Diletta Abbonato
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The proliferation of large language models (LLMs) in academic workflows has introduced unprecedented challenges to bibliographic integrity, particularly through reference hallucination – the generation of plausible but non-existent citations. Recent investigations have documented the presence of AI-hallucinated citations even in papers accepted at premier machine learning conferences such as NeurIPS and ICLR, underscoring the urgency of automated verification mechanisms. This paper presents “CheckIfExist”, an open-source web-based tool designed to provide immediate verification of bibliographic references through multi-source validation against CrossRef, Semantic Scholar, and OpenAlex scholarly databases. While existing reference management tools offer bibliographic organization capabilities, they do not provide real-time validation of citation authenticity. Commercial hallucination detection services, though increasingly available, often impose restrictive usage limits on free tiers or require substantial subscription fees. The proposed tool fills this gap by employing a cascading validation architecture with string similarity algorithms to compute multi-dimensional match confidence scores, delivering instant feedback on reference authenticity. The system supports both single-reference verification and batch processing of BibTeX entries through a unified interface, returning validated APA citations and exportable BibTeX records within seconds.

[NLP-51] VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering

【速读】：该论文旨在解决自回归语言模型在多步推理过程中因单向解码（左到右）导致的不可逆承诺问题，从而限制了生成过程中的修订能力。其解决方案的关键在于提出一种模块化的变量扩散语言模型（Variable Diffusion Language Model, VDLM），通过将语义规划（semantic planning）与文本渲染（text rendering）分离：首先在嵌入空间中使用LLaDA风格的掩码扩散机制对语义变量嵌入进行迭代优化，实现潜在空间内的精细调整；随后采用轨迹感知的强化学习优化策略对规划器进行后训练（post-training），利用嵌入空间奖励和价值函数，避免在强化学习循环中直接进行文本解码；最后引入Vec2Text渲染器及嵌入扰动机制，提升在规划噪声下的鲁棒性文本重建能力。这一框架显著提升了长文本生成任务的表现，验证了嵌入空间后训练与鲁棒潜空间到文本转换的有效性。

链接: https://arxiv.org/abs/2602.15870
作者: Shuhui Qu
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive language models decode left-to-right with irreversible commitments, limiting revision during multi-step reasoning. We propose \textbfVDLM, a modular variable diffusion language model that separates semantic planning from text rendering. VDLM applies LLaDA-style masked diffusion over semantic variable embeddings to enable iterative refinement in latent space, then post-trains the planner with trajectory-aware optimization using embedding-space rewards and values, avoiding text decoding inside the RL loop. To convert planned embeddings back to text, we use a \textbfVec2Text renderer and introduce \textbfembedding perturbations to robustify decoding under planner noise. Across nine benchmarks spanning general reasoning, math, and code, VDLM is competitive in pre-training and yields substantial post-training improvements on long-form generation tasks, outperforming other baselines. These results highlight the effectiveness of embedding-space post-training and robust latent-to-text rendering for diffusion language modeling.

[NLP-52] owards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches EACL2026

【速读】：该论文旨在解决临床数据去标识化（de-identification）任务中模型泛化能力不足的问题，尤其是针对不同语言格式、文化背景和性别标识的跨域适应性缺失。现有研究未充分考察大语言模型（LLMs）在多语种、多文化场景下的性能表现与效率权衡。其关键解决方案是系统评估了从小型到大型的多种预训练模型（包括BERT系列、Llama系列及Qwen系列），发现小模型在有限标注数据下经微调后可超越大模型在多语言（如中文、印地语、西班牙语、法语、孟加拉语及区域性英语）和性别化命名等复杂场景中的去标识化效果；并提出公开发布基于BERT、ClinicalBERT和ModernBERT微调后的BERT-MultiCulture-DEID模型集，以提升跨文化情境下的鲁棒性和公平性，从而首次量化了去标识化任务中效率与泛化能力之间的权衡关系，并为实际部署提供了高效且公平的技术路径。

链接: https://arxiv.org/abs/2602.15869
作者: Noopur Zambare,Kiana Aghakasiri,Carissa Lin,Carrie Ye,J. Ross Mitchell,Mohamed Abdalla
机构: University of Alberta (阿尔伯塔大学); Alberta Machine Intelligence Institute (阿爾伯塔機器智能研究所); Arthritis Research Canada (類風濕關節炎研究加拿大)
类目: Computation and Language (cs.CL)
备注: Accpted to the Findings of EACL 2026

点击查看摘要

Abstract:Large language models (LLMs) have shown strong performance on clinical de-identification, the task of identifying sensitive identifiers to protect privacy. However, previous work has not examined their generalizability between formats, cultures, and genders. In this work, we systematically evaluate fine-tuned transformer models (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) at de-identification. We show that smaller models achieve comparable performance while substantially reducing inference cost, making them more practical for deployment. Moreover, we demonstrate that smaller models can be fine-tuned with limited data to outperform larger models in de-identifying identifiers drawn from Mandarin, Hindi, Spanish, French, Bengali, and regional variations of English, in addition to gendered names. To improve robustness in multi-cultural contexts, we introduce and publicly release BERT-MultiCulture-DEID, a set of de-identification models based on BERT, ClinicalBERT, and ModernBERT, fine-tuned on MIMIC with identifiers from multiple language variants. Our findings provide the first comprehensive quantification of the efficiency-generalizability trade-off in de-identification and establish practical pathways for fair and efficient clinical de-identification. Details on accessing the models are available at: this https URL Comments: Accpted to the Findings of EACL 2026 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.15869 [cs.CL] (or arXiv:2602.15869v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.15869 Focus to learn more arXiv-issued DOI via DataCite

[NLP-53] Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在看似简单的任务中表现出的失效模式问题。其核心挑战在于，传统方法难以精确定位这些失败发生的具体环节，从而限制了对模型行为的理解与改进。解决方案的关键在于提出一种基于确定性多带图灵机的形式化建模框架，将LLM的处理流程分解为多个独立的带（tape），分别表示输入字符、分词（token）、词汇表、模型参数、激活值、概率分布和输出文本等组件。这一建模方式使得故障模式能够被精确地定位到特定处理阶段，例如揭示分词过程如何掩盖字符级结构从而影响计数类任务的表现；同时，该框架还阐明了链式思维提示（chain-of-thought prompting）为何有效——通过将计算外化至输出带实现，但也指出了此类方法的根本局限性。此方法为理解LLM行为提供了严谨且可证伪的分析路径，补充了经验性扩展定律，并推动了对模型错误的机制性解析。

链接: https://arxiv.org/abs/2602.15868
作者: Magnus Boman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 page appendix

点击查看摘要

Abstract:Large language models (LLMs) exhibit failure modes on seemingly trivial tasks. We propose a formalisation of LLM interaction using a deterministic multi-tape Turing machine, where each tape represents a distinct component: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text. The model enables precise localisation of failure modes to specific pipeline stages, revealing, e.g., how tokenisation obscures character-level structure needed for counting tasks. The model clarifies why techniques like chain-of-thought prompting help, by externalising computation on the output tape, while also revealing their fundamental limitations. This approach provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis.

[NLP-54] Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在复杂任务中推理与问题解决能力的评估问题，特别是其在文本类游戏环境下的表现。研究以经典文本冒险游戏Zork为实验平台，通过量化得分和定性分析其行动序列生成能力，揭示LLMs在元认知（metacognitive）层面的局限性。解决方案的关键在于构建一个结构化的自然语言交互环境（即Zork），并系统测试主流闭源模型（ChatGPT、Claude、Gemini）在不同指令粒度和思维扩展设置下的表现，发现即便提供详细指令或启用“扩展思考”模式，模型仍无法有效改进策略执行与错误修正能力，从而表明当前LLMs在持续反思、策略一致性及经验学习方面存在根本性缺陷。

链接: https://arxiv.org/abs/2602.15867
作者: Berry Gerrits
机构: University of Twente (特温特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 1 figure

点击查看摘要

Abstract:In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game’s dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ‘‘extended thinking’’. Qualitative analysis of the models’ reasoning processes reveals fundamental limitations: repeated unsuccessful actions suggesting an inability to reflect on one’s own thinking, inconsistent persistence of strategies, and failure to learn from previous attempts despite access to conversation history. These findings suggest substantial limitations in current LLMs’ metacognitive abilities and problem-solving capabilities within the domain of text-based games, raising questions about the nature and extent of their reasoning capabilities.

[NLP-55] Not the Example but the Process: How Self-Generated Examples Enhance LLM Reasoning AACL

【速读】：该论文试图解决的问题是：自生成少量示例（self-generated few-shot examples）在提升大语言模型（Large Language Models, LLMs）推理性能时，其背后的有效机制尚不明确，导致难以判断何时以及如何有效应用该技术。解决方案的关键在于区分“生成示例本身”与“生成过程”对性能提升的贡献——研究通过系统比较三种提示策略（零样本提示、集成提示和解耦提示），发现集成提示（Integrated prompting）——即模型在同一提示中完成问题生成与求解——显著优于其他方法，而仅使用自生成示例作为上下文的解耦提示（Decoupled prompting）仅带来微弱改进。注意力分析进一步揭示了两种策略在注意力模式上的显著差异，表明优势主要源于问题创建过程本身，而非生成的示例内容，从而为设计更有效的提示策略提供了关键洞见。

链接: https://arxiv.org/abs/2602.15863
作者: Daehoon Gwak,Minseo Jung,Junwoo Park,Minho Park,ChaeHun Park,Junha Hyung,Jaegul Choo
机构: KAIST AI; Applied Artificial Intelligence, Sungkyunkwan University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at AACL-IJCNLP 2025

点击查看摘要

Abstract:Recent studies have shown that Large Language Models (LLMs) can improve their reasoning performance through self-generated few-shot examples, achieving results comparable to manually curated in-context examples. However, the underlying mechanism behind these gains remains unclear, making it hard to decide when and how to apply the technique effectively. In this work, we argue that the key benefit arises not from the generated examples themselves but from the act of creating them. To validate this, on reasoning-intensive tasks across diverse LLM architectures, we systematically evaluate three prompting strategies for in-context learning: (1) Zero-shot prompting; (2) Integrated prompting, where LLMs create and solve problems within a single, unified prompt; and (3) Decoupled prompting, where self-generated examples are reused as in-context examples, but the context of their creation itself is excluded. We conduct experiments across five widely used model architectures, demonstrating that Integrated prompting consistently outperforms both Zero-shot and Decoupled prompting. In contrast, Decoupled prompting offers only marginal gains over Zero-shot. Further, for a more in-depth analysis, we conduct an attention analysis and observe significant differences in attention patterns between Integrated and Decoupled prompting. These findings suggest that the advantage of self-generation prompting comes from the process of problem creation, not the examples themselves, providing valuable insights for designing more effective prompting strategies.

[NLP-56] Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLMMs）在从食物图像生成菜谱时，尽管在词汇层面评分（如BLEU、ROUGE）较高，但生成内容常出现语义错误的动作或食材的问题。其解决方案的关键在于提出一种语义 grounded 的框架，通过预测和验证动作与食材作为内部上下文来指导指令生成；具体采用两阶段流水线：第一阶段使用监督微调（Supervised Fine-Tuning, SFT）基于动作推理数据集和食材语料构建基础准确性，第二阶段引入频率感知奖励的强化学习微调（Reinforcement Fine-Tuning, RFT），提升长尾动作预测能力和食材泛化性能；此外，还设计了语义置信度评分与修正模块（Semantic Confidence Scoring and Rectification, SCSR），用于过滤并修正预测结果，从而显著提升生成菜谱的语义保真度。

链接: https://arxiv.org/abs/2602.15862
作者: Guoshan Liu,Bin Zhu,Yian Li,Jingjing Chen,Chong-Wah Ngo,Yu-Gang Jiang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.

[NLP-57] CAST: Achieving Stable LLM -based Text Analysis for Data Analytics

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在表格数据文本分析任务中输出不稳定的问题，尤其在摘要生成（summarization）和标签标注（tagging）等核心操作中，LLMs难以满足数据分析领域对结果一致性的高要求。解决方案的关键在于提出CAST框架，其通过两个核心机制实现稳定推理：一是算法提示（Algorithmic Prompting），用于约束有效的推理路径以确保逻辑连贯性；二是先思考后表达（Thinking-before-Speaking），强制模型在最终输出前做出显式的中间决策承诺，从而提升输出一致性。实验表明，CAST在多个基准测试中显著优于现有方法，稳定性评分最高提升16.2%，同时保持或提升生成质量。

链接: https://arxiv.org/abs/2602.15861
作者: Jinxiang Xie,Zihao Li,Wei He,Rui Ding,Shi Han,Dongmei Zhang
机构: Nanjing University (南京大学); Tsinghua University (清华大学); Peking University (北京大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text analysis of tabular data relies on two core operations: \emphsummarization for corpus-level theme extraction and \emphtagging for row-level labeling. A critical limitation of employing large language models (LLMs) for these tasks is their inability to meet the high standards of output stability demanded by data analytics. To address this challenge, we introduce \textbfCAST (\textbfConsistency via \textbfAlgorithmic Prompting and \textbfStable \textbfThinking), a framework that enhances output stability by constraining the model’s latent reasoning path. CAST combines (i) Algorithmic Prompting to impose a procedural scaffold over valid reasoning transitions and (ii) Thinking-before-Speaking to enforce explicit intermediate commitments before final generation. To measure progress, we introduce \textbfCAST-S and \textbfCAST-T, stability metrics for bulleted summarization and tagging, and validate their alignment with human judgments. Experiments across publicly available benchmarks on multiple LLM backbones show that CAST consistently achieves the best stability among all baselines, improving Stability Score by up to 16.2%, while maintaining or improving output quality.

[NLP-58] Reranker Optimization via Geodesic Distances on k-NN Manifolds

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）中神经重排序方法计算资源消耗大、延迟高（3–5秒/查询）的问题。现有方案如交叉编码器（cross-encoder）或大语言模型（LLM）虽性能优异，但难以满足实时应用需求。解决方案的关键在于提出Maniscope——一种基于几何重排序的方法，通过在检索到的文档候选集上构建k近邻（k-NN）流形并计算测地距离（geodesic distance），融合全局余弦相似性与局部流形几何结构，从而捕捉传统欧氏度量所忽略的语义结构。该方法在保持高精度的同时，将平均延迟降至4.7毫秒（比HNSW基线快3.2倍，比LLM重排序快840倍），且在多个BEIR基准数据集上优于现有方法，具备部署于实时RAG系统的潜力。

链接: https://arxiv.org/abs/2602.15860
作者: Wen G. Gong
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 tables

点击查看摘要

Abstract:Current neural reranking approaches for retrieval-augmented generation (RAG) rely on cross-encoders or large language models (LLMs), requiring substantial computational resources and exhibiting latencies of 3-5 seconds per query. We propose Maniscope, a geometric reranking method that computes geodesic distances on k-nearest neighbor (k-NN) manifolds constructed over retrieved document candidates. This approach combines global cosine similarity with local manifold geometry to capture semantic structure that flat Euclidean metrics miss. Evaluating on eight BEIR benchmark datasets (1,233 queries), Maniscope outperforms HNSW graph-based baseline on the three hardest datasets (NFCorpus: +7.0%, TREC-COVID: +1.6%, AorB: +2.8% NDCG@3) while being 3.2x faster (4.7 ms vs 14.8 ms average). Compared to cross-encoder rerankers, Maniscope achieves within 2% accuracy at 10-45x lower latency. On TREC-COVID, LLM-Reranker provides only +0.5% NDCG@3 improvement over Maniscope at 840x higher latency, positioning Maniscope as a practical alternative for real-time RAG deployment. The method requires O(N D + M^2 D + M k log k) complexity where M N , enabling sub-10 ms latency. We plan to release Maniscope as open-source software.

[NLP-59] From Transcripts to AI Agents : Knowledge Extraction RAG Integration and Robust Evaluation of Conversational AI Assistants

【速读】：该论文旨在解决面向客户服务行业的对话式人工智能（Conversational AI）助手构建难题，核心挑战包括对话数据噪声大、知识碎片化以及对准确人工转接的需求，尤其在依赖实时信息的领域更为突出。其解决方案的关键在于提出一个端到端框架：首先通过简化版PIPA框架对历史通话记录进行质量评分与筛选，保留高质量、连贯性强且人类客服响应有效的交互；随后利用大语言模型（Large Language Models, LLMs）从精选语料中提取结构化知识，并作为唯一知识源部署于检索增强生成（Retrieval-Augmented Generation, RAG）管道中；同时采用系统化的提示调优策略，从单一提示逐步演进为轻量、模块化且受控的设计，以保障行为一致性、安全性及可控执行；最终通过基于通话记录的用户模拟器和红队测试实现定量评估，验证了该方法在房地产与专业招聘等高难度场景下具备约30%的自主处理能力、近乎完美的事实准确性与抗攻击鲁棒性。

链接: https://arxiv.org/abs/2602.15859
作者: Krittin Pachtrachai,Petmongkon Pornpichitsuwan,Wachiravit Modecrua,Touchapon Kraisingkorn
机构: Amity Research and Application Center (ARAC), Amity Solutions (阿米蒂解决方案)
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 1 table

点击查看摘要

Abstract:Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off - particularly in domains that depend heavily on real-time information. This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts. Incoming transcripts are first graded using a simplified adaptation of the PIPA framework, focusing on observation alignment and appropriate response behavior, and are filtered to retain only high-quality interactions exhibiting coherent flow and effective human agent responses. Structured knowledge is then extracted from curated transcripts using large language models (LLMs) and deployed as the sole grounding source in a Retrieval-Augmented Generation (RAG) pipeline. Assistant behavior is governed through systematic prompt tuning, progressing from monolithic prompts to lean, modular, and governed designs that ensure consistency, safety, and controllable execution. Evaluation is conducted using a transcript-grounded user simulator, enabling quantitative measurement of call coverage, factual accuracy, and human escalation behavior. Additional red teaming assesses robustness against prompt injection, out-of-scope, and out-of-context attacks. Experiments are conducted in the Real Estate and Specialist Recruitment domains, which are intentionally challenging and currently suboptimal for automation due to their reliance on real-time data. Despite these constraints, the assistant autonomously handles approximately 30 percents of calls, achieves near-perfect factual accuracy and rejection behavior, and demonstrates strong robustness under adversarial testing.

[NLP-60] State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在动态环境中进行序列决策时性能受限的问题，核心在于状态表示（state representation）的设计对模型推理稳定性与准确性的影响。解决方案的关键在于系统性地考察三个状态表示维度：状态粒度（长文本 vs. 摘要）、结构形式（自然语言 vs. 符号化表达）以及空间接地方式（纯文本 vs. 图像或文本地图编码），并发现：轨迹摘要可减少噪声并稳定长程推理；自然语言表示具有最广泛的鲁棒性，而结构化编码仅在具备代码或结构输出先验的模型中有效；文本空间编码优于图像输入，其优势源于构建过程本身所激发的空间推理能力，而非空间信息本身。这表明状态表示的设计是决定性能的关键因素，独立于信息可用性。

链接: https://arxiv.org/abs/2602.15858
作者: Annie Wong,Aske Plaat,Thomas Bäck,Niki van Stein,Anna V. Kononova
机构: Leiden Institute of Advanced Computer Science, Leiden University (莱顿大学高级计算机科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.

[NLP-61] Multi-source Heterogeneous Public Opinion Analysis via Collaborative Reasoning and Adaptive Fusion: A Systematically Integrated Approach

【速读】：该论文旨在解决多源异构平台公共意见分析中的挑战，包括结构差异、语义变异和平台特异性偏差等问题。其核心解决方案是提出一种协同推理与自适应融合（Collaborative Reasoning and Adaptive Fusion, CRAF）框架，通过结构化的多阶段推理机制将传统特征方法与大语言模型（Large Language Models, LLMs）系统性集成。关键创新在于：(1) 跨平台协同注意力模块实现语义对齐并保留源特性；(2) 分层自适应融合机制依据数据质量和任务需求动态加权特征；(3) 联合优化策略在共享潜在空间中同时学习主题表示与情感分布；(4) 新型多模态提取能力整合OCR、语音识别（ASR）与视觉情感分析以处理抖音和快手等视频内容。理论分析表明，CRAF相较独立源建模可获得更紧的泛化界（减少O(sqrt(d log K / m))），实验证明其在多个跨平台数据集上显著提升主题聚类ARI（+4.1%）和情感分析F1-score（+3.8%），且新平台标注数据需求降低75%。

链接: https://arxiv.org/abs/2602.15857
作者: Yi Liu
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 11 figures

点击查看摘要

Abstract:The analysis of public opinion from multiple heterogeneous sources presents significant challenges due to structural differences, semantic variations, and platform-specific biases. This paper introduces a novel Collaborative Reasoning and Adaptive Fusion (CRAF) framework that systematically integrates traditional feature-based methods with large language models (LLMs) through a structured multi-stage reasoning mechanism. Our approach features four key innovations: (1) a cross-platform collaborative attention module that aligns semantic representations while preserving source-specific characteristics, (2) a hierarchical adaptive fusion mechanism that dynamically weights features based on both data quality and task requirements, (3) a joint optimization strategy that simultaneously learns topic representations and sentiment distributions through shared latent spaces, and (4) a novel multimodal extraction capability that processes video content from platforms like Douyin and Kuaishou by integrating OCR, ASR, and visual sentiment analysis. Theoretical analysis demonstrates that CRAF achieves a tighter generalization bound with a reduction of O(sqrt(d log K / m)) compared to independent source modeling, where d is feature dimensionality, K is the number of sources, and m is sample size. Comprehensive experiments on three multi-platform datasets (Weibo-12, CrossPlatform-15, NewsForum-8) show that CRAF achieves an average topic clustering ARI of 0.76 (4.1% improvement over best baseline) and sentiment analysis F1-score of 0.84 (3.8% improvement). The framework exhibits strong cross-platform adaptability, reducing the labeled data requirement for new platforms by 75%.

[NLP-62] Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的任务导向型对话系统中，现有训练方法（如基于token级似然或偏好优化）难以与长周期任务成功对齐的问题。其核心解决方案是提出一种分层强化学习框架——目标导向偏好优化（Goal-Oriented Preference Optimization, GOPO），关键在于通过专家代理（Expert Agent）和客服代理（Customer Service Agent）的解耦机制实现策略规划与响应生成的分离：专家代理在对话轨迹层面优化多轮目标偏好，而客服代理严格依据所选策略生成响应，从而显著提升任务完成度与对话质量。

链接: https://arxiv.org/abs/2602.15854
作者: Jingyi Xu,Xingyu Ren,Zhiqiang You,Yumeng Zhang,Zhoupeng Shou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent’s critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.

[NLP-63] A Lightweight Explainable Guardrail for Prompt Safety

【速读】：该论文旨在解决生成式 AI (Generative AI) 安全性评估中提示词（prompt）分类的可解释性问题，即如何在保证分类准确率的同时，提供人类可理解的决策依据。其解决方案的关键在于提出一种轻量级可解释防护机制（Lightweight Explainable Guardrail, LEG），该机制采用多任务学习架构，联合训练提示词分类器与解释分类器，其中后者识别对整体安全/不安全判断具有解释作用的关键词；同时，通过一种新颖的数据生成策略对抗大语言模型（LLM）的确认偏倚，以生成高质量的可解释性合成数据；此外，训练过程中引入一种结合交叉熵损失与焦点损失（focal loss）并基于不确定性的加权策略，有效捕捉全局解释信号，从而在模型规模显著小于现有方法的前提下，实现更优或相当的分类与解释性能。

链接: https://arxiv.org/abs/2602.15853
作者: Md Asiful Islam,Mihai Surdeanu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a lightweight explainable guardrail (LEG) method for the classification of unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained using synthetic data for explainability, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG’s training process uses a novel loss that captures global explanation signals and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches. If accepted, we will release all models and the annotated dataset publicly.

[NLP-64] Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

【速读】：该论文旨在解决临床自然语言处理（Natural Language Processing, NLP）模型在医院出院计划支持中因时间泄漏（temporal leakage）和词汇泄漏（lexical leakage）导致的预测性能虚高问题，此类泄漏会使得模型依赖于未来临床决策的文档痕迹，从而在真实世界部署中引发过自信或时间上无效的预测，危及患者安全并扰乱临床流程。解决方案的关键在于提出一种轻量级审计流水线（lightweight auditing pipeline），将可解释性（interpretability）嵌入模型开发过程，在最终训练前识别并抑制易受泄漏影响的信号，从而提升模型的时间有效性、校准度和行为鲁棒性，确保其具备实际部署的安全性与可靠性。

链接: https://arxiv.org/abs/2602.15852
作者: Ha Na Cho,Sairam Sutari,Alexander Lopez,Hansen Bow,Kai Zheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clinical natural language processing (NLP) models have shown promise for supporting hospital discharge planning by leveraging narrative clinical documentation. However, note-based models are particularly vulnerable to temporal and lexical leakage, where documentation artifacts encode future clinical decisions and inflate apparent predictive performance. Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety. This study focuses on system-level design choices required to build safe and deployable clinical NLP under temporal leakage constraints. We present a lightweight auditing pipeline that integrates interpretability into the model development process to identify and suppress leakage-prone signals prior to final training. Using next-day discharge prediction after elective spine surgery as a case study, we evaluate how auditing affects predictive behavior, calibration, and safety-relevant trade-offs. Results show that audited models exhibit more conservative and better-calibrated probability estimates, with reduced reliance on discharge-related lexical cues. These findings emphasize that deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance.

[NLP-65] Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

【速读】：该论文试图解决的问题是：当前自然语言处理（NLP）领域在应用叙事理论于大型语言模型（LLMs）时缺乏系统性和理论指导，导致任务定义模糊、评估标准不统一，阻碍了模型性能的可比性和进步。其解决方案的关键在于提出一个基于叙事学（narratology）已有区分的分类体系（taxonomy），并梳理出叙事数据集与任务、叙事理论与NLP流程及提示（prompting）和微调（fine-tuning）方法的趋势，从而促进跨学科协作，并推动以理论为基础的度量指标发展、大规模文学/社会/文化分析以及用于验证或修正叙事理论的实验设计。这为未来更系统化、理论驱动的叙事研究提供了基础框架。

链接: https://arxiv.org/abs/2602.15851
作者: David Y. Liu,Aditya Joshi,Paul Dawson
机构: University of New South Wales (UNSW)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages

点击查看摘要

Abstract:Applications of narrative theories using large language models (LLMs) deliver promising use-cases in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research engages with fields of narrative studies, and proposes a taxonomy for ongoing efforts that reflect established distinctions in narratology. We discover patterns in the following: narrative datasets and tasks, narrative theories and NLP pipeline and methodological trends in prompting and fine-tuning. We highlight how LLMs enable easy connections of NLP pipelines with abstract narrative concepts and opportunities for interdisciplinary collaboration. Challenges remain in attempts to work towards any unified definition or benchmark of narrative related tasks, making model comparison difficult. For future directions, instead of the pursuit of a single, generalised benchmark for ‘narrative quality’, we believe that progress benefits more from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes to incrementally improve model performance; conducting large-scale, theory-driven literary/social/cultural analysis; and creating experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

[NLP-66] Large Language Models for Assisting American College Applications

【速读】：该论文旨在解决美国高校申请过程中学生面临的碎片化招生政策、重复且条件复杂的申请表单以及模糊问题导致的信息交叉验证困难等问题。其解决方案的关键在于提出了一种基于大语言模型（Large Language Model, LLM）的系统 EZCollegeApp，该系统采用“映射优先”（mapping-first）范式，将表单理解与答案生成分离，从而在异构申请门户间实现一致的推理逻辑；同时通过从官方招生网站导入文档、检索增强型问答机制以及人机协同的聊天界面，确保建议内容源自权威资料且最终决策权完全由用户掌控，有效提升了申请流程的结构化程度与准确性。

链接: https://arxiv.org/abs/2602.15850
作者: Zhengliang Liu,Weihang You,Peng Shu,Junhao Chen,Yi Pan,Hanqi Jiang,Yiwei Li,Zhaojun Ding,Chao Cao,Xinliang Li,Yifan Zhou,Ruidong Zhang,Shaochen Xu,Wei Ruan,Huaqin Zhao,Dajiang Zhu,Tianming Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:American college applications require students to navigate fragmented admissions policies, repetitive and conditional forms, and ambiguous questions that often demand cross-referencing multiple sources. We present EZCollegeApp, a large language model (LLM)-powered system that assists high-school students by structuring application forms, grounding suggested answers in authoritative admissions documents, and maintaining full human control over final responses. The system introduces a mapping-first paradigm that separates form understanding from answer generation, enabling consistent reasoning across heterogeneous application portals. EZCollegeApp integrates document ingestion from official admissions websites, retrieval-augmented question answering, and a human-in-the-loop chatbot interface that presents suggestions alongside application fields without automated submission. We describe the system architecture, data pipeline, internal representations, security and privacy measures, and evaluation through automated testing and human quality assessment. Our source code is released on GitHub (this https URL) to facilitate the broader impact of this work.

[NLP-67] Preference Optimization for Review Question Generation Improves Writing Quality

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的同行评审问答生成方法普遍存在表面化问题，即生成的问题过度依赖论文前几页内容（超过50%的问题token来自首页），缺乏深度、证据支撑与专业性。为应对这一挑战，作者提出IntelliReward——一种基于冻结的自回归LLM与可训练多头Transformer结合的奖励模型，其通过在最后50个token状态上进行优化，显著提升对专家偏好预测的能力。解决方案的关键在于引入Decoupled Clip和动态采样策略优化（DAPO）框架，利用IntelliReward对齐人类对努力程度、证据充分性和问题扎根性的标准，从而训练出IntelliAsk模型，在多项推理和写作基准测试中表现出优于基线模型（如Qwen3-32B）的性能，证明了高质量评审问题与模型整体能力之间的正相关关系。

链接: https://arxiv.org/abs/2602.15849
作者: Karun Sharma,Vidushee Vats,Shengzhi Li,Yuxiang Wang,Zhongtian Sun,Prayag Tiwari
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 Pages, v1

点击查看摘要

Abstract:Peer review relies on substantive, evidence-based questions, yet existing LLM-based approaches often generate surface-level queries, drawing over 50% of their question tokens from a paper’s first page. To bridge this gap, we develop IntelliReward, a novel reward model built from a frozen autoregressive LLM with trainable multi-head transformers over the final 50 token states, which outperforms API-based SFT baselines in predicting expert-level human preferences. By applying Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with IntelliReward, we train IntelliAsk, a question-generation model aligned with human standards of effort, evidence, and grounding. We find consistent improvements on reasoning and writing benchmarks, suggesting reviewer-question quality correlates with broader capabilities. Compared to the Qwen3-32B base model, IntelliAsk shows measurable gains across diverse benchmarks, specifically improving performance on reasoning tasks like MuSR (68.3 vs 64.7 Acc) and complex writing evaluations such as WritingBench (8.31 vs 8.07). We release our implementation, expert preference annotations, and the IntelliReward model to provide an automatic evaluation benchmark for grounding, effort, and evidence in LLM-generated review questions.

[NLP-68] Can LLM s Assess Personality? Validating Conversational AI for Trait Profiling

【速读】：该论文旨在解决传统人格测评依赖结构化问卷（如IPIP-50）所存在的效率低、参与度差及情境僵化等问题，探索生成式AI在人格评估中的可行性与有效性。其解决方案的关键在于利用引导式大型语言模型（Large Language Models, LLMs）对话交互获取人格特质信息，并通过对照实验验证其与标准问卷结果的一致性，发现LLM方法在尽责性（Conscientiousness）、开放性（Openness）和神经质（Neuroticism）上具有统计等效性，且用户对LLM生成的人格画像准确度评价不逊于传统方式，表明基于对话的AI评估可作为心理测量的新范式。

链接: https://arxiv.org/abs/2602.15848
作者: Andrius Matšenas,Anet Lello,Tõnis Lees,Hans Peep,Kim Lilii Tamm
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures, 4 tables, 2 appendices

点击查看摘要

Abstract:This study validates Large Language Models (LLMs) as a dynamic alternative to questionnaire-based personality assessment. Using a within-subjects experiment (N=33), we compared Big Five personality scores derived from guided LLM conversations against the gold-standard IPIP-50 questionnaire, while also measuring user-perceived accuracy. Results indicate moderate convergent validity (r=0.38-0.58), with Conscientiousness, Openness, and Neuroticism scores statistically equivalent between methods. Agreeableness and Extraversion showed significant differences, suggesting trait-specific calibration is needed. Notably, participants rated LLM-generated profiles as equally accurate as traditional questionnaire results. These findings suggest conversational AI offers a promising new approach to traditional psychometrics.

[NLP-69] Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models

【速读】：该论文试图解决的问题是：在大型语言模型（Large Language Models, LLMs）中，人格特质控制是否可以独立实现，即是否存在不同人格特质（如大五人格中的开放性、尽责性等）之间的几何独立性。现有方法通常依赖于注入特定人格特质的引导向量（steering vectors），隐含假设各人格特质可被独立操控。论文的关键解决方案在于通过分析两类模型（LLaMA-3-8B 和 Mistral-8B）中提取的人格引导方向之间的几何关系，引入多种几何约束策略（包括无约束、软正交化和硬正交化），从而系统评估人格特质间的耦合程度。结果表明，即使去除线性重叠，人格引导方向仍表现出显著的几何依赖性，说明人格特质在模型中占据的是一个略微耦合的子空间，这限制了真正意义上的独立控制能力。

链接: https://arxiv.org/abs/2602.15847
作者: Pranav Bhandari,Usman Naseem,Mehwish Nasim
机构: The University of Western Australia (西澳大利亚大学); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Personality steering in large language models (LLMs) commonly relies on injecting trait-specific steering vectors, implicitly assuming that personality traits can be controlled independently. In this work, we examine whether this assumption holds by analysing the geometric relationships between Big Five personality steering directions. We study steering vectors extracted from two model families (LLaMA-3-8B and Mistral-8B) and apply a range of geometric conditioning schemes, from unconstrained directions to soft and hard orthonormalisation. Our results show that personality steering directions exhibit substantial geometric dependence: steering one trait consistently induces changes in others, even when linear overlap is explicitly removed. While hard orthonormalisation enforces geometric independence, it does not eliminate cross-trait behavioural effects and can reduce steering strength. These findings suggest that personality traits in LLMs occupy a slightly coupled subspace, limiting fully independent trait control.

[NLP-70] Gated Tree Cross-attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLM s

【速读】：该论文旨在解决Decoder-only大语言模型（Large Language Models, LLMs）在面对微小语法扰动时表现出脆弱性的问题，这种脆弱性会削弱其在下游推理任务中的可靠性。解决方案的关键在于引入一种检查点兼容的门控树交叉注意力（Gated Tree Cross-Attention, GTCA）分支，该分支在不改变主干架构的前提下，读取预计算的成分结构分块记忆（constituency chunk memory），并通过令牌更新掩码（token update mask）和分阶段训练策略来控制结构信息更新的范围与时序，从而有效提升模型的句法鲁棒性，同时保持多选问答（Multiple-Choice QA）性能与常识推理能力不受损。

链接: https://arxiv.org/abs/2602.15846
作者: Xinyu Gao,Shaonan Wang,Nai Ding
机构: Zhejiang University (浙江大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Decoder-only large language models achieve strong broad performance but are brittle to minor grammatical perturbations, undermining reliability for downstream reasoning. However, directly injecting explicit syntactic structure into an existing checkpoint can interfere with its pretrained competence. We introduce a checkpoint-compatible gated tree cross-attention (GTCA) branch that reads precomputed constituency chunk memory while leaving backbone architecture unchanged. Our design uses a token update mask and staged training to control the scope and timing of structural updates. Across benchmarks and Transformer backbones, GTCA strengthens syntactic robustness beyond continued-training baselines without compromising Multiple-Choice QA performance or commonsense reasoning, providing a practical checkpoint-compatible route to more syntax-robust decoder-only LLMs.

[NLP-71] KD4MT: A Survey of Knowledge Distillation for Machine Translation

【速读】：该论文旨在系统梳理知识蒸馏（Knowledge Distillation, KD）在机器翻译（Machine Translation, MT）领域的研究进展，解决当前KD方法在MT中应用缺乏统一评估标准、存在潜在风险（如幻觉增强和偏见放大）以及方法多样性与实践指导不足的问题。其解决方案的关键在于：首先，通过整合105篇相关文献对KD4MT的方法论贡献与实际应用场景进行分类分析；其次，提出实用的KD方法选择指南，并识别出该领域的主要研究空白；最后，探讨大语言模型（Large Language Models, LLMs）对KD4MT范式的重塑作用，同时构建公开数据库与术语词典以支持后续研究。

链接: https://arxiv.org/abs/2602.15845
作者: Ona de Gibert,Joseph Attieh,Timothee Mickus,Yves Scherrer,Jörg Tiedemann
机构: University of Helsinki (赫尔辛基大学); University of Oslo (奥斯陆大学)
类目: Computation and Language (cs.CL)
备注: Pre-print under Review submitted to Computational Linguistics Journal

点击查看摘要

Abstract:Knowledge Distillation (KD) as a research area has gained a lot of traction in recent years as a compression tool to address challenges related to ever-larger models in NLP. Remarkably, Machine Translation (MT) offers a much more nuanced take on this narrative: in MT, KD also functions as a general-purpose knowledge transfer mechanism that shapes supervision and translation quality as well as efficiency. This survey synthesizes KD for MT (KD4MT) across 105 papers (through October 1, 2025). We begin by introducing both MT and KD for non-experts, followed by an overview of the standard KD approaches relevant to MT applications. Subsequently, we categorize advances in the KD4MT literature based on (i) their methodological contributions and (ii) their practical applications. Our qualitative and quantitative analyses identify common trends in the field and highlight key research gaps as well as the absence of unified evaluation practice for KD methods in MT. We further provide practical guidelines for selecting a KD method in concrete settings and highlight potential risks associated with the application of KD to MT such as increased hallucination and bias amplification. Finally, we discuss the role of LLMs in re-shaping the KD4MT field. To support further research, we complement our survey with a publicly available database summarizing the main characteristics of the surveyed KD methods and a glossary of key terms. Comments: Pre-print under Review submitted to Computational Linguistics Journal Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.15845 [cs.CL] (or arXiv:2602.15845v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.15845 Focus to learn more arXiv-issued DOI via DataCite

[NLP-72] Language Model Representations for Efficient Few-Shot Tabular Classification WWW’26

【速读】：该论文旨在解决如何利用已部署的大语言模型（Large Language Models, LLMs）对网页原生表格（如产品目录、知识库导出数据、科学数据门户等）进行少样本（few-shot）结构化分类的问题，避免为每类表格训练专用模型或进行大量微调。其核心挑战在于表格结构与语义的异构性使得传统方法难以统一建模，而直接使用LLM生成的行级语义嵌入（semantic embeddings）效果不佳。解决方案的关键在于两个技术改进：一是移除所有嵌入中的公共成分以增强区分度，二是通过校准Softmax温度来优化分类置信度；进一步地，引入一个基于手工特征的轻量级元学习器（meta-learner）自动预测最优温度参数。该方法在低数据场景下（k ≤ 32）实现了与当前最优模型相当的性能，验证了复用现有LLM基础设施实现高效语义驱动的Web表格理解的可行性。

链接: https://arxiv.org/abs/2602.15844
作者: Inwon Kang,Parikshit Ram,Yi Zhou,Horst Samulowitz,Oshani Seneviratne
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to WWW’26

点击查看摘要

Abstract:The Web is a rich source of structured data in the form of tables, from product catalogs and knowledge bases to scientific datasets. However, the heterogeneity of the structure and semantics of these tables makes it challenging to build a unified method that can effectively leverage the information they contain. Meanwhile, Large language models (LLMs) are becoming an increasingly integral component of web infrastructure for tasks like semantic search. This raises a crucial question: can we leverage these already-deployed LLMs to classify structured data in web-native tables (e.g., product catalogs, knowledge base exports, scientific data portals), avoiding the need for specialized models or extensive retraining? This work investigates a lightweight paradigm, \textbfTa ble \textbfR epresentation with \textbfL anguage Model~( \textbfTaRL ), for few-shot tabular classification that directly utilizes semantic embeddings of individual table rows. We first show that naive application of these embeddings underperforms compared to specialized tabular models. We then demonstrate that their potentials can be unlocked with two key techniques: removing the common component from all embeddings and calibrating the softmax temperature. We show that a simple meta-learner, trained on handcrafted features, can learn to predict an appropriate temperature. This approach achieves performance comparable to state-of-the-art models in low-data regimes ( k \leq 32 ) of semantically-rich tables. Our findings demonstrate the viability of reusing existing LLM infrastructure for efficient semantics-driven pathway to reuse existing LLM infrastructure for Web table understanding.

[NLP-73] he Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts

【速读】：该论文旨在解决代码生成与链式思维推理在提示压缩下的性能差异问题，验证“困惑度悖论”机制，并提出一种任务感知的自适应压缩算法。其关键解决方案是提出TAAC（Task-Aware Adaptive Compression）方法，通过动态调整不同任务类型的提示压缩比例，在保持96%生成质量的同时实现22%的成本降低，显著优于固定压缩比率策略（提升7%），并基于多基准测试验证了其泛化能力与有效性。

链接: https://arxiv.org/abs/2602.15843
作者: Warren Johnson
机构: Bona Opera Studios
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures, 4 tables. Second paper in TAAC research series. Code and data at this https URL

点击查看摘要

Abstract:In “Compress or Route?” (Johnson, 2026), we found that code generation tolerates aggressive prompt compression (r = 0.6) while chain-of-thought reasoning degrades gradually. That study was limited to HumanEval (164 problems), left the “perplexity paradox” mechanism unvalidated, and provided no adaptive algorithm. This paper addresses all three gaps. First, we validate across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL-E) and four reasoning benchmarks (GSM8K, MATH, ARC-Challenge, MMLU-STEM), confirming the compression threshold generalizes across languages and difficulties. Second, we conduct the first per-token perplexity analysis (n=723 tokens), revealing a “perplexity paradox”: code syntax tokens are preserved (high perplexity) while numerical values in math problems are pruned despite being task-critical (low perplexity). Signature injection recovers +34 percentage points in pass rate (5.3% to 39.3%; Cohen’s h=0.890). Third, we propose TAAC (Task-Aware Adaptive Compression), achieving 22% cost reduction with 96% quality preservation, outperforming fixed-ratio compression by 7%. MBPP validation (n=1,800 trials) confirms systematic variation: 3.6% at r=0.3 to 54.6% at r=1.0.

[NLP-74] Memes-as-Replies: Can Models Select Humorous Manga Panel Responses?

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 在理解与生成语境化幽默（contextually humorous replies）方面的不足，尤其是针对网络文化中广泛使用的表情包（meme）作为互动回复时的动态语用特性缺乏系统研究的问题。解决方案的关键在于构建并发布 MaMe-Re 基准数据集——一个包含 10 万对人类标注的日本漫画分镜与社交媒体帖子的大型数据集（共 50 万条标注，来自 2325 名独立标注者），从而为评估模型在真实社交语境下选择幽默回应的能力提供标准化测试平台。该工作揭示了大语言模型（LLMs）虽能初步捕捉夸张等复杂社会线索，但在融合视觉信息和区分细微语义差异方面仍存在显著局限，凸显出当前模型在语境幽默理解上的核心挑战。

链接: https://arxiv.org/abs/2602.15842
作者: Ryosuke Kohita,Seiichiro Yoshioka
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memes are a popular element of modern web communication, used not only as static artifacts but also as interactive replies within conversations. While computational research has focused on analyzing the intrinsic properties of memes, the dynamic and contextual use of memes to create humor remains an understudied area of web science. To address this gap, we introduce the Meme Reply Selection task and present MaMe-Re (Manga Meme Reply Benchmark), a benchmark of 100,000 human-annotated pairs (500,000 total annotations from 2,325 unique annotators) consisting of openly licensed Japanese manga panels and social media posts. Our analysis reveals three key insights: (1) large language models (LLMs) show preliminary evidence of capturing complex social cues such as exaggeration, moving beyond surface-level semantic matching; (2) the inclusion of visual information does not improve performance, revealing a gap between understanding visual content and effectively using it for contextual humor; (3) while LLMs can match human judgments in controlled settings, they struggle to distinguish subtle differences in wit among semantically similar candidates. These findings suggest that selecting contextually humorous replies remains an open challenge for current models.

[NLP-75] Lyapunov Spectral Analysis of Speech Embedding Trajectories in Psychosis

【速读】：该论文旨在解决如何通过语言动力学特征区分精神分裂症患者与健康对照群体的问题，尤其是探索言语表达中隐含的非线性动态特性是否能作为反映认知紊乱的稳定指标。解决方案的关键在于将语言生成视为高维动力系统，并利用李雅普诺夫指数（Lyapunov exponent, LE）谱分析来自两种不同大语言模型（large language models, LLMs）的词级和回答级语义嵌入（speech embeddings），从而量化其动力稳定性。研究发现，词级嵌入呈现一致收缩动力学（无正LE），而回答级嵌入则表现出多个正LE及更高维吸引子结构，且LE谱在群体层面可稳健地区分精神病患者与健康个体，表明此类非线性动力学不变量为认知紊乱提供了一个物理启发式的探测工具，且结论对嵌入模型的选择具有鲁棒性。

链接: https://arxiv.org/abs/2602.16273
作者: Jelena Vasic,Branislav Andjelic,Ana Mancic,Dusica Filipovic Djurdjevic,Ljiljana Mihic,Aleksandar Kovacevic,Nadja P. Maric,Aleksandra Maluckov
机构: Institute of Mental Health (精神健康研究所); Faculty of Technical Sciences, University of Novi Sad (诺维萨德大学技术科学学院); Faculty of Sciences and Mathematics, University of Niš (尼什大学科学与数学学院); Faculty of Philosophy, University of Belgrade (贝尔格莱德大学哲学学院); Faculty of Philosophy, University of Novi Sad (诺维萨德大学哲学学院); Faculty of Medicine, University of Belgrade (贝尔格莱德大学医学院); Vinča Institute of Nuclear Sciences, National Institute of the Republic of Serbia, University of Belgrade (VINČA核科学研究所，塞尔维亚共和国国家研究所，贝尔格莱德大学)
类目: Adaptation and Self-Organizing Systems (nlin.AO); Computation and Language (cs.CL)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:We analyze speech embeddings from structured clinical interviews of psychotic patients and healthy controls by treating language production as a high-dimensional dynamical process. Lyapunov exponent (LE) spectra are computed from word-level and answer-level embeddings generated by two distinct large language models, allowing us to assess the stability of the conclusions with respect to different embedding presentations. Word-level embeddings exhibit uniformly contracting dynamics with no positive LE, while answer-level embeddings, in spite of the overall contraction, display a number of positive LEs and higher-dimensional attractors. The resulting LE spectra robustly separate psychotic from healthy speech, while differentiation within the psychotic group is not statistically significant overall, despite a tendency of the most severe cases to occupy distinct dynamical regimes. These findings indicate that nonlinear dynamical invariants of speech embeddings provide a physics-inspired probe of disordered cognition whose conclusions remain stable across embedding models.

[NLP-76] Evidence for Daily and Weekly Periodic Variability in GPT -4o Performance

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在固定条件下（如相同模型快照、超参数和提示）的性能是否具有时间不变性的问题，这一假设是当前大量基于LLM的研究工作的基础。若模型性能随时间系统性变化，则将严重影响研究结果的可靠性、有效性和可复现性。论文的关键解决方案是通过一项纵向实验，利用API持续调用GPT-4o在固定条件下每3小时执行同一物理选择题任务，共持续约三个月，并对每次的10个独立输出进行评分平均，从而构建时间序列数据；进一步采用傅里叶（Fourier）谱分析发现，模型平均性能存在显著的周期性波动，解释了约20%的总方差，且主要由日周期与周周期相互作用所致，这揭示了即使在严格控制条件下，LLM性能仍可能呈现时间依赖性，挑战了传统的时间不变性假设。

链接: https://arxiv.org/abs/2602.15889
作者: Paul Tschisgale,Peter Wulff
机构: Leibniz Institute for Science and Mathematics Education(德国科学与数学教育研究所); Ludwigsburg University of Education(吕贝克教育大学)
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Physics Education (physics.ed-ph)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in research both as tools and as objects of investigation. Much of this work implicitly assumes that LLM performance under fixed conditions (identical model snapshot, hyperparameters, and prompt) is time-invariant. If average output quality changes systematically over time, this assumption is violated, threatening the reliability, validity, and reproducibility of findings. To empirically examine this assumption, we conducted a longitudinal study on the temporal variability of GPT-4o’s average performance. Using a fixed model snapshot, fixed hyperparameters, and identical prompting, GPT-4o was queried via the API to solve the same multiple-choice physics task every three hours for approximately three months. Ten independent responses were generated at each time point and their scores were averaged. Spectral (Fourier) analysis of the resulting time series revealed notable periodic variability in average model performance, accounting for approximately 20% of the total variance. In particular, the observed periodic patterns are well explained by the interaction of a daily and a weekly rhythm. These findings indicate that, even under controlled conditions, LLM performance may vary periodically over time, calling into question the assumption of time invariance. Implications for ensuring validity and replicability of research that uses or investigates LLMs are discussed.

信息检索

[IR-0] Neighborhood Stability as a Measure of Nearest Neighbor Searchability

【速读】：该论文旨在解决聚类-based近邻搜索（Clustering-based Approximate Nearest Neighbor Search, ANNS）在实际应用中缺乏理论指导的问题，即如何判断一个高维数据集是否适合使用聚类方法进行近邻搜索——这一问题被称为“可搜索性（searchability）”。现有方法无法在不依赖具体搜索算法或参数的情况下评估数据集的适用性。解决方案的关键在于提出两个可计算的度量：一是聚类-邻域稳定性度量（Clustering-Neighborhood Stability Measure, clustering-NSM），作为聚类质量的内部指标，能够预测ANNS的准确性；二是点-邻域稳定性度量（Point-Neighborhood Stability Measure, point-NSM），作为数据集本身“聚类能力”的度量，可预测clustering-NSM的值。二者均基于点之间的最近邻关系而非距离本身，因而适用于多种距离函数（如内积），从而使得仅凭数据点即可判断数据集是否适合采用聚类-based ANNS策略。

链接: https://arxiv.org/abs/2602.16673
作者: Thomas Vecchiato,Sebastian Bruch
机构: Northeastern University (东北大学); University of Copenhagen (哥本哈根大学)
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Clustering-based Approximate Nearest Neighbor Search (ANNS) organizes a set of points into partitions, and searches only a few of them to find the nearest neighbors of a query. Despite its popularity, there are virtually no analytical tools to determine the suitability of clustering-based ANNS for a given dataset – what we call “searchability.” To address that gap, we present two measures for flat clusterings of high-dimensional points in Euclidean space. First is Clustering-Neighborhood Stability Measure (clustering-NSM), an internal measure of clustering quality – a function of a clustering of a dataset – that we show to be predictive of ANNS accuracy. The second, Point-Neighborhood Stability Measure (point-NSM), is a measure of clusterability – a function of the dataset itself – that is predictive of clustering-NSM. The two together allow us to determine whether a dataset is searchable by clustering-based ANNS given only the data points. Importantly, both are functions of nearest neighbor relationships between points, not distances, making them applicable to various distance functions including inner product.

[IR-1] ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

【速读】：该论文旨在解决当前多向量模型（multi-vector models）性能受限于小规模知识蒸馏（Knowledge Distillation, KD）训练步骤的问题，即仅依赖强单向量模型的预训练结果进行微调难以充分挖掘多向量架构的潜力。其解决方案的关键在于通过大规模的多向量预训练（multi-vector pre-training）显著提升模型性能，具体表现为：提出完全基于公开数据预训练的ColBERT-Zero模型，在不使用封闭数据的情况下超越了GTE-ModernColBERT及其基础模型GTE-ModernBERT；同时发现，若在KD前加入监督微调步骤，可在跳过最昂贵的无监督预训练阶段的前提下获得更接近全预训练效果的性能。此外，研究强调了在迁移现有模型时保持预训练与微调设置一致性的关键作用。

链接: https://arxiv.org/abs/2602.16609
作者: Antoine Chaffin,Luca Arnaboldi,Amélie Chatelain,Florent Krzakala
机构: LightOn(法国); Ecole Polytechnique Fédérale de Lausanne (EPFL), IdePHICS Lab(瑞士洛桑联邦理工学院，IdePHICS 实验室)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 pages, 5 tables, 2 figures

点击查看摘要

Abstract:Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.

[IR-2] Why Thinking Hurts? Diagnosing and Rectifying the Reasoning Shift in Foundation Recommender Models

【速读】：该论文旨在解决将思维链（Chain-of-Thought, CoT）推理引入基于语义ID的推荐基础模型（如OpenOneRec）时，常出现性能下降的问题。其核心原因是来自通用子空间（General Subspace）的文本惯性，导致冗长的推理过程主导了推理路径，使模型忽视关键的语义ID信息。解决方案的关键在于提出一种无需训练的推理时子空间对齐框架（Inference-Time Subspace Alignment），通过压缩推理链并采用偏置减去的对比解码策略，有效缓解无根基的文本漂移问题，从而在不牺牲语义ID引导准确性的前提下，使基础模型能够合理利用推理能力。

链接: https://arxiv.org/abs/2602.16587
作者: Luankang Zhang,Yonghao Huang,Hang Lv,Mingjia Yin,Liangyue Li,Zulong Chen,Hao Wang,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学); China Alibaba Group (阿里巴巴集团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Integrating Chain-of-Thought (CoT) reasoning into Semantic ID-based recommendation foundation models (such as OpenOneRec) often paradoxically degrades recommendation performance. We identify the root cause as textual inertia from the General Subspace, where verbose reasoning dominates inference and causes the model to neglect critical Semantic ID. To address this, we propose a training-free Inference-Time Subspace Alignment framework. By compressing reasoning chains and applying bias-subtracted contrastive decoding, our approach mitigates ungrounded textual drift. Experiments show this effectively calibrates inference, allowing foundation models to leverage reasoning without sacrificing ID-grounded accuracy.

[IR-3] From Latent to Observable Position-Based Click Models in Carousel Interfaces

【速读】：该论文旨在解决现有点击模型（Click Models）在复杂推荐界面（如轮播图carousel）中建模用户行为能力不足的问题。传统点击模型主要针对单一排序列表设计，难以刻画用户在多列表滑动浏览场景下的真实交互模式。其关键解决方案是提出三种专为轮播界面设计的位置相关点击模型，其中最具创新性的是观察到的检查位置基础模型（Observed Examination Position-Based Model, OEPBM）——这是首个不依赖潜在变量、直接利用眼动追踪数据中的显式检查信号的位置基础模型。OEPBM通过引入可测量的用户检查行为，显著提升了点击预测性能，并更贴近实际用户浏览路径，但研究也揭示了仅基于点击数据的模型即使拟合良好，仍可能无法准确反映用户的实际检查与浏览行为，凸显了在复杂界面中融合多源行为信号的重要性。

链接: https://arxiv.org/abs/2602.16541
作者: Santiago de Leon-Martinez,Robert Moro,Branislav Kveton,Maria Bielikova
机构: Brno University of Technology (布林诺理工大学); Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所); Adobe Research (Adobe 研究院)
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Click models are a central component of learning and evaluation in recommender systems, yet most existing models are designed for single ranked-list interfaces. In contrast, modern recommender platforms increasingly use complex interfaces such as carousels, which consist of multiple swipeable lists that enable complex user browsing behaviors. In this paper, we study position-based click models in carousel interfaces and examine optimization methods, model structure, and alignment with user behavior. We propose three novel position-based models tailored to carousels, including the first position-based model without latent variables that incorporates observed examination signals derived from eye tracking data, called the Observed Examination Position-Based Model (OEPBM). We develop a general implementation of these carousel click models, supporting multiple optimization techniques and conduct experiments comparing gradient-based methods with classical approaches, namely expectation-maximization and maximum likelihood estimation. Our results show that gradient-based optimization consistently achieve better click likelihoods. Among the evaluated models, the OEPBM achieves the strongest performance in click prediction and produces examination patterns that most closely align to user behavior. However, we also demonstrate that strong click fit does not imply realistic modeling of user examination and browsing patterns. This reveals a fundamental limitation of click-only models in complex interfaces and the need for incorporating additional behavioral signals when designing click models for carousel-based recommender systems. Subjects: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.16541 [cs.IR] (or arXiv:2602.16541v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.16541 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-4] Variable-Length Semantic IDs for Recommender Systems

【速读】：该论文旨在解决推荐系统中因物品空间基数过大导致的生成式建模困难问题，以及自然语言与物品标识符之间的词汇鸿沟问题。现有方法虽引入语义标识符（semantic IDs）以降低物品表示维度，但其固定长度的编码方式忽略了真实场景中物品频率分布的高度偏斜特性——热门物品与长尾物品的信息需求差异显著，从而造成效率低下且不符合自然语言表达习惯。解决方案的关键在于将推荐系统与涌现通信（emergent communication）研究相融合，提出一种基于离散变分自编码器（discrete variational autoencoder）的可变长度语义标识符机制，利用Gumbel-Softmax重参数化在概率框架下学习自适应长度的物品表示，避免了REINFORCE类训练的不稳定性及传统固定长度方法的局限性。

链接: https://arxiv.org/abs/2602.16375
作者: Kirill Khrylchenko
机构: HSE University (高等经济大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative models are increasingly used in recommender systems, both for modeling user behavior as event sequences and for integrating large language models into recommendation pipelines. A key challenge in this setting is the extremely large cardinality of item spaces, which makes training generative models difficult and introduces a vocabulary gap between natural language and item identifiers. Semantic identifiers (semantic IDs), which represent items as sequences of low-cardinality tokens, have recently emerged as an effective solution to this problem. However, existing approaches generate semantic identifiers of fixed length, assigning the same description length to all items. This is inefficient, misaligned with natural language, and ignores the highly skewed frequency structure of real-world catalogs, where popular items and rare long-tail items exhibit fundamentally different information requirements. In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions. Despite the conceptual similarity, these ideas have not been systematically adopted in recommender systems. In this work, we bridge recommender systems and emergent communication by introducing variable-length semantic identifiers for recommendation. We propose a discrete variational autoencoder with Gumbel-Softmax reparameterization that learns item representations of adaptive length under a principled probabilistic framework, avoiding the instability of REINFORCE-based training and the fixed-length constraints of prior semantic ID methods. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2602.16375 [cs.IR] (or arXiv:2602.16375v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.16375 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] he Diversity Paradox revisited: Systemic Effects of Feedback Loops in Recommender Systems

【速读】：该论文旨在解决现有推荐系统研究中对反馈循环（feedback loop）系统性影响理解不足的问题，尤其是由于仿真研究中假设不切实际而导致的偏差。其解决方案的关键在于提出一个能够捕捉隐式反馈、周期性重新训练、推荐采纳的概率性机制以及推荐系统异质性的反馈循环模型，并将其应用于在线零售和音乐流媒体数据进行实证分析。该框架揭示了推荐采纳率提升虽可能带来个体消费多样性增加的假象，但长期来看个体多样性反而下降，且集体需求分布呈现依赖于模型和领域特征的集中度放大效应，从而强调了在推荐系统设计中必须考虑动态反馈机制的重要性。

链接: https://arxiv.org/abs/2602.16315
作者: Gabriele Barlacchi,Margherita Lalli,Emanuele Ferragina,Fosca Giannotti,Dino Pedreschi,Luca Pappalardo
机构: Scuola Normale Superiore (圣母学院); Università di Pisa (比萨大学); Sciences Po (巴黎政治学院); CNR (国家研究委员会)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender systems shape individual choices through feedback loops in which user behavior and algorithmic recommendations coevolve over time. The systemic effects of these loops remain poorly understood, in part due to unrealistic assumptions in existing simulation studies. We propose a feedback-loop model that captures implicit feedback, periodic retraining, probabilistic adoption of recommendations, and heterogeneous recommender systems. We apply the framework on online retail and music streaming data and analyze systemic effects of the feedback loop. We find that increasing recommender adoption may lead to a progressive diversification of individual consumption, while collective demand is redistributed in model- and domain-dependent ways, often amplifying popularity concentration. Temporal analyses further reveal that apparent increases in individual diversity observed in static evaluations are illusory: when adoption is fixed and time unfolds, individual diversity consistently decreases across all models. Our results highlight the need to move beyond static evaluations and explicitly account for feedback-loop dynamics when designing recommender systems.

[IR-6] MICE: Minimal Interaction Cross-Encoders for efficient Re-ranking

【速读】：该论文旨在解决交叉编码器（cross-encoder）在信息检索中因推理成本高而难以作为第一阶段排序器使用的问题，同时避免重新排序文档时带来的计算开销。解决方案的关键在于通过深入分析交叉编码器的内部机制，识别并移除有害或冗余的交互操作，从而设计出一种新的轻量化架构——MICE（Minimal Interaction Cross-Encoders）。MICE在保持交叉编码器在域内（in-domain, ID）任务上高排序效果的同时，将推理延迟降低至标准交叉编码器的四分之一，并展现出优于传统交叉编码器的域外（out-of-domain, OOD）泛化能力，且性能接近晚交互模型（如ColBERT）。

链接: https://arxiv.org/abs/2602.16299
作者: Mathias Vast,Victor Morand,Basile van Cooten,Laure Soulier,Josiane Mothe,Benjamin Piwowarski
机构: Sinequa by ChapsVisionParisFrance; Sorbonne Université, CNRS, ISIRParisFrance; University of Toulouse, IRITToulouseFrance
类目: Information Retrieval (cs.IR)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Cross-encoders deliver state-of-the-art ranking effectiveness in information retrieval, but have a high inference cost. This prevents them from being used as first-stage rankers, but also incurs a cost when re-ranking documents. Prior work has addressed this bottleneck from two largely separate directions: accelerating cross-encoder inference by sparsifying the attention process or improving first-stage retrieval effectiveness using more complex models, e.g. late-interaction ones. In this work, we propose to bridge these two approaches, based on an in-depth understanding of the internal mechanisms of cross-encoders. Starting from cross-encoders, we show that it is possible to derive a new late-interaction-like architecture by carefully removing detrimental or unnecessary interactions. We name this architecture MICE (Minimal Interaction Cross-Encoders). We extensively evaluate MICE across both in-domain (ID) and out-of-domain (OOD) datasets. MICE decreases fourfold the inference latency compared to standard cross-encoders, matching late-interaction models like ColBERT while retaining most of cross-encoder ID effectiveness and demonstrating superior generalization abilities in OOD.

[IR-7] Retrieval Collapses When AI Pollutes the Web WWW’26

【速读】：该论文旨在解决生成式 AI（Generative AI）内容在互联网上的快速扩散对信息检索系统造成的结构性风险，特别是检索增强生成（Retrieval-Augmented Generation, RAG）系统因过度依赖由大语言模型（Large Language Models, LLMs）生成的合成内容而引发的“检索坍塌”（Retrieval Collapse）问题。其核心解决方案在于识别并量化两种污染场景下的检索失效机制：一是高质量SEO内容主导搜索结果导致来源多样性下降；二是对抗性内容渗透检索管道造成质量退化。关键发现是，在高污染比例下，传统检索方法如BM25易暴露有害内容（约19%），而基于LLM的排序器展现出更强的抑制能力，表明需采用具备检索感知能力的策略来阻断合成证据驱动的质量下滑自我强化循环。

链接: https://arxiv.org/abs/2602.16136
作者: Hongyeon Yu,Dongchan Kim,Young-Bum Kim
机构: NAVER Corp.(NAVER公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 4 pages, Proceedings of The Web Conference 2026 (WWW '26)

点击查看摘要

Abstract:The rapid proliferation of AI-generated content on the Web presents a structural risk to information retrieval, as search engines and Retrieval-Augmented Generation (RAG) systems increasingly consume evidence produced by the Large Language Models (LLMs). We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage process where (1) AI-generated content dominates search results, eroding source diversity, and (2) low-quality or adversarial content infiltrates the retrieval pipeline. We analyzed this dynamic through controlled experiments involving both high-quality SEO-style content and adversarially crafted content. In the SEO scenario, a 67% pool contamination led to over 80% exposure contamination, creating a homogenized yet deceptively healthy state where answer accuracy remains stable despite the reliance on synthetic sources. Conversely, under adversarial contamination, baselines like BM25 exposed \sim 19% of harmful content, whereas LLM-based rankers demonstrated stronger suppression capabilities. These findings highlight the risk of retrieval pipelines quietly shifting toward synthetic evidence and the need for retrieval-aware strategies to prevent a self-reinforcing cycle of quality decline in Web-grounded systems.

[IR-8] Rethinking ANN-based Retrieval: Multifaceted Learnable Index for Large-scale Recommendation System

【速读】：该论文旨在解决大规模推荐系统中基于近似最近邻（Approximate Nearest Neighbor, ANN）检索的两大局限性：一是物品嵌入（item embeddings）与索引通常分阶段学习，导致新物品检索质量不佳；二是ANN查询在服务端仍需对每个请求执行，带来显著的计算开销。解决方案的关键在于提出多面可学习索引（MultiFaceted Learnable Index, MFLI），其核心是通过残差量化构建多面层次码本（hierarchical codebook），并联合训练嵌入与码本，在统一框架内实现嵌入与索引的端到端优化；同时设计高效多面索引结构与实时更新机制，使服务阶段无需执行ANN搜索，直接利用学习得到的层次索引定位相关项目，从而显著提升召回率、冷启动内容覆盖率和语义相关性，并降低服务延迟与资源消耗。

链接: https://arxiv.org/abs/2602.16124
作者: Jiang Zhang,Yubo Wang,Wei Chang,Lu Han,Xingying Cheng,Feng Zhang,Min Li,Songhao Jiang,Wei Zheng,Harry Tran,Zhen Wang,Lei Chen,Yueming Wang,Benyu Zhang,Xiangjun Fan,Bi Xue,Qifan Wang
机构: Meta Platforms, Inc. (Meta)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Approximate nearest neighbor (ANN) search is widely used in the retrieval stage of large-scale recommendation systems. In this stage, candidate items are indexed using their learned embedding vectors, and ANN search is executed for each user (or item) query to retrieve a set of relevant items. However, ANN-based retrieval has two key limitations. First, item embeddings and their indices are typically learned in separate stages: indexing is often performed offline after embeddings are trained, which can yield suboptimal retrieval quality-especially for newly created items. Second, although ANN offers sublinear query time, it must still be run for every request, incurring substantial computation cost at industry scale. In this paper, we propose MultiFaceted Learnable Index (MFLI), a scalable, real-time retrieval paradigm that learns multifaceted item embeddings and indices within a unified framework and eliminates ANN search at serving time. Specifically, we construct a multifaceted hierarchical codebook via residual quantization of item embeddings and co-train the codebook with the embeddings. We further introduce an efficient multifaceted indexing structure and mechanisms that support real-time updates. At serving time, the learned hierarchical indices are used directly to identify relevant items, avoiding ANN search altogether. Extensive experiments on real-world data with billions of users show that MFLI improves recall on engagement tasks by up to 11.8%, cold-content delivery by up to 57.29%, and semantic relevance by 13.5% compared with prior state-of-the-art methods. We also deploy MFLI in the system and report online experimental results demonstrating improved engagement, less popularity bias, and higher serving efficiency.

[IR-9] FeDecider: An LLM -Based Framework for Federated Cross-Domain Recommendation WWW

【速读】：该论文旨在解决在联邦跨域推荐（Federated Cross-Domain Recommendation, Federated CDR）场景下，直接应用大语言模型（Large Language Model, LLM）带来的两个核心挑战：一是本地适配器（local adapters）易过拟合，因各域参数更新幅度差异导致聚合偏差；二是LLM隐式编码知识的方式使得跨域相似性难以有效度量。解决方案的关键在于提出FeDecider框架：首先通过解耦客户端低秩更新并仅共享方向信息来缓解尺度特异性噪声；其次，每个客户端学习个性化权重以实现对其他域更新的数据感知融合，从而提升跨域协同建模的鲁棒性与有效性。

链接: https://arxiv.org/abs/2602.16034
作者: Xinrui He,Ting-Wei Li,Tianxin Wei,Xuying Ning,Xinyu He,Wenxuan Bao,Hanghang Tong,Jingrui He
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Information Retrieval (cs.IR)
备注: Accepted to The Web Conference (WWW) 2026

点击查看摘要

Abstract:Federated cross-domain recommendation (Federated CDR) aims to collaboratively learn personalized recommendation models across heterogeneous domains while preserving data privacy. Recently, large language model (LLM)-based recommendation models have demonstrated impressive performance by leveraging LLMs’ strong reasoning capabilities and broad knowledge. However, adopting LLM-based recommendation models in Federated CDR scenarios introduces new challenges. First, there exists a risk of overfitting with domain-specific local adapters. The magnitudes of locally optimized parameter updates often vary across domains, causing biased aggregation and overfitting toward domain-specific distributions. Second, unlike traditional recommendation models (e.g., collaborative filtering, bipartite graph-based methods) that learn explicit and comparable user/item representations, LLMs encode knowledge implicitly through autoregressive text generation training. This poses additional challenges for effectively measuring the cross-domain similarities under heterogeneity. To address these challenges, we propose an LLM-based framework for federated cross-domain recommendation, FeDecider. Specifically, FeDecider tackles the challenge of scale-specific noise by disentangling each client’s low-rank updates and sharing only their directional components. To handle the need for flexible and effective integration, each client further learns personalized weights that achieve the data-aware integration of updates from other domains. Extensive experiments across diverse datasets validate the effectiveness of our proposed FeDecider.

[IR-10] Latent Objective Induction and Diversity-Constrained Selection: Algorithms for Multi-Locale Retrieval Pipelines

【速读】：该论文旨在解决多地域（multi-locale）搜索结果中如何高效选择多样化来源的问题，以避免同一域名（same-domain）重复和提升第一方来源（first-party source）比例。其解决方案的关键在于提出三个具有形式化正确性保证与复杂度边界的核心算法：一是将加权地域分配建模为受限整数划分问题，设计出 $O(n \log n)$ 时间复杂度的算法，同时满足最小代表性、预算耗尽与比例约束；二是定义了一个确定性的国家代码推断函数（cascaded country-code inference function），通过异构信号（如顶级域名结构、模型推断元数据、语言回退机制）构成优先级链，确保结果的确定性和渐进式退化特性；三是引入 $\kappa$ -域多样性约束（ $\kappa$ -domain diversity constraint），并给出 $O(|K| \cdot R)$ 复杂度的哈希表查找算法，有效消除基于URL去重导致的聚合器垄断病态现象。这些方法共同提升了多语言检索管道中来源多样性和质量。

链接: https://arxiv.org/abs/2602.15921
作者: Faruk Alpay,Levent Sarioglu
机构: Bahcesehir University (巴赫切席尔大学)
类目: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
备注: 13 pages, 2 algorithms, 3 tables

点击查看摘要

Abstract:We present three algorithms with formal correctness guarantees and complexity bounds for the problem of selecting a diverse, multi-locale set of sources from ranked search results. First, we formulate weighted locale allocation as a constrained integer partition problem and give an O(n \log n) algorithm that simultaneously satisfies minimum-representation, budget-exhaustion, and proportionality-bound constraints; we prove all three hold with a tight deviation bound of 1 . Second, we define a cascaded country-code inference function as a deterministic priority chain over heterogeneous signals (TLD structure, model-inferred metadata, language fallback) and prove it satisfies both determinism and graceful degradation. Third, we introduce a \kappa -domain diversity constraint for source selection and give an O(|K| \cdot R) algorithm that maintains the invariant via hash-map lookup, eliminating the aggregator monopolization pathology present in URL-level deduplication. We further formalize Latent Objective Induction (LOI), an environment-shaping operator over prompt spaces that steers downstream model behavior without restricting the feasible output set, and prove its convergence under mild assumptions. Applied to a multi-locale retrieval pipeline, these algorithms yield 62% improvement in first-party source ratio and 89% reduction in same-domain duplication across 120 multilingual queries.

[IR-11] Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective WWW2026

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）在实际应用中因上下文长度过长和冗余检索导致的可扩展性瓶颈问题。现有软上下文压缩方法通常采用全压缩策略，即强制编码器将文档全部信息压缩为紧凑嵌入，忽略了查询相关性，从而损害了任务相关信息密度并影响大语言模型（Large Language Models, LLMs）的生成效果。作者通过分析指出，此类方法存在两个根本局限：一是“不可行性”，即全压缩与LLM下游生成行为冲突；二是“非必要性”，即无需对所有文档内容进行压缩，反而会稀释关键信息。为此，论文提出SeleCom框架——一种基于选择器的软压缩机制，重新定义编码器角色为查询条件下的信息选择器，其采用仅解码器结构，并在大规模、多样化且难度分级的合成问答数据集上结合课程学习进行训练。实验表明，SeleCom显著优于现有软压缩方法，在保持甚至超越无压缩基线性能的同时，计算开销和延迟降低达33.8%~84.6%。

链接: https://arxiv.org/abs/2602.15856
作者: Yunhao Liu,Zian Jia,Xinyu Gao,Kanjun Xu,Yun Xiong
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by WWW 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge and is widely applied to Web-related tasks. However, its scalability is hindered by excessive context length and redundant retrievals. Recent research on soft context compression aims to address this by encoding long documents into compact embeddings, yet they often underperform non-compressed RAG due to their reliance on auto-encoder-like full-compression that forces the encoder to compress all document information regardless of relevance to the input query. In this work, we conduct an analysis on this paradigm and reveal two fundamental limitations: (I) Infeasibility, full-compression conflicts with the LLM’s downstream generation behavior; and (II) Non-necessity: full-compression is unnecessary and dilutes task-relevant information density. Motivated by these insights, we introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder’s role as query-conditioned information selector. The selector is decoder-only and is trained with a massive, diverse and difficulty-graded synthetic QA dataset with curriculum learning. Extensive experiments show that SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines, while reducing computation and latency by 33.8%~84.6%. Comments: Accepted by WWW 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2602.15856 [cs.CL] (or arXiv:2602.15856v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.15856 Focus to learn more arXiv-issued DOI via DataCite

人机交互

[HC-0] Wearable AR for Restorative Breaks: How Interactive Narrative Experiences Support Relaxation for Young Adults

【速读】：该论文旨在解决年轻人群在长时间屏幕工作后，通过移动设备消费数字内容进行休息时，因视觉疲劳和身体静止而导致恢复效果不佳的问题（即“休息失效”）。其解决方案的关键在于提出了一种嵌入式轻度活动设计框架（Embedded Light Break Activity Design Framework），通过三种核心策略实现：(1) 在增强现实（AR）智能眼镜的媒体内容中嵌入与媒介元素对齐的活动提示，实现无感引导；(2) 转向以音频为主的内容形式，在降低视觉负荷的同时维持沉浸感；(3) 采用“上升-峰值-收尾”的节奏结构组织活动时段，确保过渡自然流畅。实验验证表明，基于该框架开发的InteractiveBreak系统能有效提升休息质量，将被动休息转化为具参与感和意义感的恢复体验。

链接: https://arxiv.org/abs/2602.16323
作者: Jindu Wang,Runze Cai,Shuchang Xu,Tianrui Hu,Huamin Qu,Shengdong Zhao,Ling-Ping Yuan
机构: The Hong Kong University of Science and Technology (香港科技大学); National University of Singapore (新加坡国立大学); City University of Hong Kong (香港城市大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Young adults often take breaks from screen-intensive work by consuming digital content on mobile phones, which undermines rest through visual fatigue and inactivity. We introduce a design framework that embeds light break activities into media content on AR smart glasses, balancing engagement and recovery. The framework employs three strategies: (1) seamlessly guiding users by embedding activity cues aligned with media elements; (2) transitioning to audio-centric formats to reduce visual load while sustaining immersion; and (3) structuring sessions with “rise-peak-closure” pacing for smooth transitions. In a within-subjects study (N = 16) comparing passive viewing, reminder-based breaks, and non-narrative activities, InteractiveBreak instantiated from our framework seamlessly guided activities, sustained engagement, and enhanced break quality. These findings demonstrate wearable AR’s potential to support restorative relaxation by transforming breaks into engaging and meaningful experiences.

[HC-1] Generative AI Usage of University Students: Navigating Between Education and Business

【速读】：该论文旨在解决当前对边工作边学习的学生（part-time students）在教育与职场场景中交叉使用生成式AI（Generative AI）的研究空白问题。现有文献较少关注此类学生群体及其在学术与职业实践中对GenAI的整合应用。研究通过扎根理论方法，对11名远程学习大学生进行访谈，识别出三个因果条件和四个中介条件，以及相应的使用策略，构建了一个解释GenAI使用行为的扎根模型。该模型的关键在于揭示了影响GenAI在教育与商业双重场景下使用的结构性因素，为教育者、政策制定者及GenAI工具开发者提供了系统性洞察，有助于弥合教育与产业间的技术应用鸿沟。

链接: https://arxiv.org/abs/2602.16307
作者: Fabian Walke,Veronika Föller
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study investigates generative artificial intelligence (GenAI) usage of university students who study alongside their professional career. Previous literature has paid little attention to part-time students and the intersectional use of GenAI between education and business. This study examines with a grounded theory approach the characteristics of GenAI usage of part-time students. Eleven students from a distance learning university were interviewed. Three causal and four intervening conditions, as well as strategies were identified, to influence the use of GenAI. The study highlights both the potential and challenges of GenAI usage in education and business. While GenAI can significantly enhance productivity and learning outcomes, concerns about ethical implications, reliability, and the risk of academic misconduct persist. The developed grounded model offers a comprehensive understanding of GenAI usage among students, providing valuable insights for educators, policymakers, and developers of GenAI tools seeking to bridge the gap between education and business.

[HC-2] “What Im Interested in is Something that Violates the Law”: Regulatory Practitioner Views on Automated Detection of Deceptive Design Patterns

【速读】：该论文旨在解决监管机构在应对欺骗性设计模式（dark patterns）时面临的执法滞后问题，尤其是在面对大规模应用时的检测与识别挑战。其核心解决方案的关键在于：将自动化检测工具与监管实践的实际需求深度结合，而非简单照搬学术研究中的技术设想。具体而言，论文指出当前多数工具缺乏必要的透明度和可追溯性以满足监管调查的合规要求，且无法将用户界面元素精准映射到法律违规行为上；因此，建议开展用户需求调研、支持检测之外的辅助性监管活动，并建立兼顾科研严谨性与监管实用性的技术采纳路径。

链接: https://arxiv.org/abs/2602.16302
作者: Arianna Rossi,Simon Parkin
机构: LIDER-Lab, DIRPOLIS, Sant’Anna School of Advanced Studies (圣安娜高等研究学院); TU Delft (代尔夫特理工大学)
类目: Human-Computer Interaction (cs.HC)
备注: Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26), April 13–17, 2026, Barcelona, Spain

点击查看摘要

Abstract:Although deceptive design patterns are subject to growing regulatory oversight, enforcement races to keep up with the scale of the problem. One promising solution is automated detection tools, many of which are developed within academia. We interviewed nine experienced practitioners working within or alongside regulatory bodies to understand their work against deceptive design patterns, including the use of supporting tools and the prospect of automation. Computing technologies have their place in regulatory practice, but not as envisioned in research. For example, investigations require utmost transparency and accountability in all the activities we identify as accompanying dark pattern detection, which many existing tools cannot provide. Moreover, tools need to map interfaces to legal violations to be of use. We thus recommend conducting user requirement research to maximize research impact, supporting ancillary activities beyond detection, and establishing practical tech adoption pathways that account for the needs of both scientific and regulatory activities.

[HC-3] Flow on Social Media? Rarer Than Youd Think

【速读】：该论文试图解决的问题是：尽管社会媒体常被认为能引发“心流”（flow）体验，即深度沉浸与 effortless engagement，但其长期使用却与分心、疲劳和情绪低落相关，这种看似矛盾的现象尚未被充分理解。此前研究多依赖习惯性或单次报告，直接让参与者将心流归因于社交平台，导致结果可能存在偏差。为解决这一问题，作者设计了一项为期五天的实地研究，通过客观追踪智能手机应用使用数据，并结合每日对心流活动的重构记录，共收集了673次心流事件。关键发现是：仅有2%的心流事件与社交平台相关，且高频率使用社交平台显著减少每日心流发生次数；进一步分析显示，社交平台使用通过提升疲劳感、降低情绪和动机水平间接抑制心流。因此，研究揭示了心流与社交平台之间并非如普遍假设般正向关联，反而可能相互竞争，凸显出深入探究二者关系的必要性。

链接: https://arxiv.org/abs/2602.16279
作者: Michael T. Knierim,Thimo Schulz,Moritz Schiller,Jwan Shaban,Mario Nadj,Max L. Wilson,Alexander Maedche
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); University of Nottingham (诺丁汉大学); University of Duisburg-Essen (杜伊斯堡-埃森大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Researchers often attribute social media’s appeal to its ability to elicit flow experiences of deep absorption and effortless engagement. Yet prolonged use has also been linked to distraction, fatigue, and lower mood. This paradox remains poorly understood, in part because prior studies rely on habitual or one-shot reports that ask participants to directly attribute flow to social media. To address this gap, we conducted a five-day field study with 40 participants, combining objective smartphone app tracking with daily reconstructions of flow-inducing activities. Across 673 reported flow occurrences, participants rarely associated flow with social media (2 percent). Instead, heavier social media use predicted fewer daily flow occurrences. We further examine this relationship through the effects of social media use on fatigue, mood, and motivation. Altogether, our findings suggest that flow and social media may not align as closely as assumed - and might even compete - underscoring the need for further research.

[HC-4] RelianceScope: An Analytical Framework for Examining Students Reliance on Generative AI Chatbots in Problem Solving

【速读】：该论文旨在解决当前教育场景中学生与生成式AI聊天机器人交互时，缺乏系统性工具来联合刻画学生在求助行为（help-seeking）和响应使用（response-use）两个维度上的参与模式，从而难以准确分析其对AI的依赖行为这一问题。解决方案的关键在于提出RelianceScope框架，该框架将依赖行为操作化为九种基于帮助寻求与响应使用参与模式组合的类型，并引入知识情境视角（knowledge-context lens）以考虑学生的先验知识水平和知识组件的教学重要性，从而实现对学生开放式人机交互中依赖行为的细粒度分析。

链接: https://arxiv.org/abs/2602.16251
作者: Hyoungwook Jin,Minju Yoo,Jieun Han,Zixin Chen,So-Yeon Ahn,Xu Wang
机构: University of Michigan (密歇根大学); KAIST (韩国科学技术院); HKUST (香港科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative AI chatbots enable personalized problem-solving, but effective learning requires students to self-regulate both how they seek help and how they use AI-generated responses. Considering engagement modes across these two actions reveals nuanced reliance patterns: for example, a student may actively engage in help-seeking by clearly specifying areas of need, yet engage passively in response-use by copying AI outputs, or vice versa. However, existing research lacks systematic tools for jointly capturing engagement across help-seeking and response-use, limiting the analysis of such reliance behaviors. We introduce RelianceScope, an analytical framework that characterizes students’ reliance on chatbots during problem-solving. RelianceScope (1) operationalizes reliance into nine patterns based on combinations of engagement modes in help-seeking and response-use, and (2) situates these patterns within a knowledge-context lens that accounts for students’ prior knowledge and the instructional significance of knowledge components. Rather than prescribing optimal AI use, the framework enables fine-grained analysis of reliance in open-ended student-AI interactions. As an illustrative application, we applied RelianceScope to analyze chat and code-edit logs from 79 college students in a web programming course. Results show that active help-seeking is associated with active response-use, whereas reliance patterns remain similar across knowledge mastery levels. Students often struggled to articulate their knowledge gaps and to adapt AI responses. Using our annotated dataset as a benchmark, we further demonstrate that large language models can reliably detect reliance during help-seeking and response-use. We conclude by discussing the implications of RelianceScope and the design guidelines for AI-supported educational systems.

[HC-5] Peeking Ahead of the Field Study: Exploring VLM Personas as Support Tools for Embodied Studies in HCI

【速读】：该论文旨在解决实地研究（field studies）在自动驾驶（AV）与行人交互场景中所面临的高成本、耗时长及易出错等问题。其核心挑战在于如何在不依赖真实人类参与者的情况下，有效模拟人类行为以支持前期设计与验证。解决方案的关键在于利用视觉语言模型（Vision-Language Model, VLM）构建“角色化身”（personas），通过这些具有情境感知和行为生成能力的虚拟主体，在视频实验中模拟人类对街道过街任务的响应模式，从而实现快速、低成本且可重复的评估。实证结果表明，VLM personas能够再现人类平均过街时间等关键指标（如5.25秒 vs. 5.07秒），但在行为变异性与深度上仍存在差距，显示出其在形成性研究、实地研究准备和人类数据增强中的潜力。

链接: https://arxiv.org/abs/2602.16157
作者: Xinyue Gui,Ding Xia,Mark Colley,Yuan Li,Vishal Chauhan,Anubhav Anubhav,Zhongyi Zhou,Ehsan Javanmardi,Stela Hanbyeol Seo,Chia-Ming Chang,Manabu Tsukada,Takeo Igarashi
机构: The University of Tokyo(东京大学); UCL Interaction Centre(伦敦大学学院交互中心); Google(谷歌); Kyoto University(京都大学); National Taiwan University of Arts(台湾艺术大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CHI 2026

点击查看摘要

Abstract:Field studies are irreplaceable but costly, time-consuming, and error-prone, which need careful preparation. Inspired by rapid-prototyping in manufacturing, we propose a fast, low-cost evaluation method using Vision-Language Model (VLM) personas to simulate outcomes comparable to field results. While LLMs show human-like reasoning and language capabilities, autonomous vehicle (AV)-pedestrian interaction requires spatial awareness, emotional empathy, and behavioral generation. This raises our research question: To what extent can VLM personas mimic human responses in field studies? We conducted parallel studies: 1) one real-world study with 20 participants, and 2) one video-study using 20 VLM personas, both on a street-crossing task. We compared their responses and interviewed five HCI researchers on potential applications. Results show that VLM personas mimic human response patterns (e.g., average crossing times of 5.25 s vs. 5.07 s) lack the behavioral variability and depth. They show promise for formative studies, field study preparation, and human data augmentation.

[HC-6] ASPEN: Spectral-Temporal Fusion for Cross-Subject Brain Decoding

【速读】：该论文旨在解决脑电图（EEG）基脑机接口（BCI）中跨被试泛化能力不足的问题，其核心挑战源于个体间神经信号的差异性。研究表明，频谱特征相较于时域波形具有更高的跨被试相似性，因此作者提出ASPEN架构，通过乘法融合机制将频谱与时域特征流结合，仅当两者达成跨模态一致性时才允许特征传播。该方案的关键在于利用乘法融合实现动态调整频谱与时域特征的平衡，从而提升模型在未见被试上的性能表现。

链接: https://arxiv.org/abs/2602.16147
作者: Megan Lee,Seung Ha Hwang,Inhyeok Choi,Shreyas Darade,Mengchun Zhang,Kateryna Shapovalenko
机构: Carnegie Mellon University (卡内基梅隆大学); Kyung Hee University (中央大学); Korea Advanced Institute of Science and Technology (韩国科学技术院); University of Pittsburgh (匹兹堡大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Cross-subject generalization in EEG-based brain-computer interfaces (BCIs) remains challenging due to individual variability in neural signals. We investigate whether spectral representations offer more stable features for cross-subject transfer than temporal waveforms. Through correlation analyses across three EEG paradigms (SSVEP, P300, and Motor Imagery), we find that spectral features exhibit consistently higher cross-subject similarity than temporal signals. Motivated by this observation, we introduce ASPEN, a hybrid architecture that combines spectral and temporal feature streams via multiplicative fusion, requiring cross-modal agreement for features to propagate. Experiments across six benchmark datasets reveal that ASPEN is able to dynamically achieve the optimal spectral-temporal balance depending on the paradigm. ASPEN achieves the best unseen-subject accuracy on three of six datasets and competitive performance on others, demonstrating that multiplicative multimodal fusion enables effective cross-subject generalization.

[HC-7] Human-AI Collaboration in Large Language Model-Integrated Building Energy Management Systems: The Role of User Domain Knowledge and AI Literacy

【速读】：该论文试图解决的问题是：用户在使用生成式 AI (Generative AI) 集成的建筑能源管理系统（BEMS）时，其领域知识（building energy use domain knowledge）和人工智能素养（AI literacy）如何影响人机交互效率与决策质量。解决方案的关键在于通过系统性角色扮演实验，结合定量分析框架对人类-AI交互行为进行分层评分，并基于用户自我评估的领域知识与AI素养将参与者划分为四组，进而利用非参数检验（Kruskal-Wallis H测试）识别出关键差异指标。结果显示，尽管大多数用户依赖简洁提示并高度信任GPT模型的分析能力，但仅有“家电识别率”一项指标显示出显著组间差异（p=0.037），且该差异由AI素养驱动而非领域知识，表明大型语言模型（LLM）具有在不同专业水平用户中实现能力均等化的潜力，从而为构建以人为中心的LLM集成能源系统提供了实证基础和发展方向。

链接: https://arxiv.org/abs/2602.16140
作者: Wooyoung Jung,Kahyun Jeon,Prosper Babon-Ayeng
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 39 pages, 11 figures

点击查看摘要

Abstract:This study aimed to comprehend how user domain knowledge and artificial intelligence (AI) literacy impact the effective use of human-AI interactive building energy management system (BEMS). While prior studies have investigated the potential of integrating large language models (LLMs) into BEMS or building energy modeling, very few studies have examined how user interact with such systems. We conducted a systematic role-playing experiment, where 85 human subjects interacted with an advanced generative pre-trained transformer (OpenAI GPT-4o). Participants were tasked with identifying the top five behavioral changes that could reduce home energy use with the GPT model that functioned as an LLM-integrated BEMS. Then, the collected prompt-response data and participant conclusions were analyzed using an analytical framework that hierarchically assessed and scored human-AI interactions and their home energy analysis approaches. Also, participants were classified into four groups based on their self-evaluated domain knowledge of building energy use and AI literacy, and Kruskal-Wallis H tests with post-hoc pairwise comparisons were conducted across 20 quantifiable metrics. Key takeaways include: most participants employed concise prompts (median: 16.2 words) and relied heavily on GPT’s analytical capabilities; and notably, only 1 of 20 metrics, appliance identification rate, showed statistically significant group differences (p=0.037), driven by AI literacy rather than domain knowledge, suggesting an equalizing effect of LLMs across expertise levels. This study provides foundational insights into human-AI collaboration dynamics and promising development directions in the context of LLM-integrated BEMS and contributes to realizing human-centric LLM-integrated energy systems.

[HC-8] “You Can Actually Do Something”: Shifts in High School Computer Science Teachers Conceptions of AI/ML Systems and Algorithmic Justice

【速读】：该论文试图解决的问题是：随着生成式 AI (Generative AI) 和机器学习（Machine Learning, ML）系统的广泛应用，教育工作者亟需发展出有效的能力来理解和评估这些系统。研究聚焦于五位经验丰富的高中计算机科学教师在一年参与式设计过程中的认知转变，他们共同开发了关于 AI 审计（AI auditing）的教学内容——一种系统化的方法用于查询和评估 AI/ML 系统。解决方案的关键在于通过参与式设计实践，使教师从抽象理解转向基于日常情境的、更具批判性和行动导向的认知框架，并将算法正义（algorithmic justice）问题锚定在其教育者角色与学校社区的具体实践中，从而提升教师的 AI 素养并为学生提供更深层次的 AI 教育。

链接: https://arxiv.org/abs/2602.16123
作者: Daniel J. Noh,Deborah A. Fields,Yasmin B. Kafai,Danaé Metaxa
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The recent proliferation of artificial intelligence and machine learning (AI/ML) systems highlights the need for all people to develop effective competencies to interact with and examine AI/ML systems. We study shifts in five experienced high school CS teachers’ understanding of AI/ML systems after one year of participatory design, where they co-developed lessons on AI auditing, a systematic method to query AI/ML systems. Drawing on individual and group interviews, we found that teachers’ perspectives became more situated, grounding their understanding in everyday contexts; more critical, reflecting growing awareness of harms; and more agentic, highlighting possibilities for action. Further, across all three perspectives, teachers consistently framed algorithmic justice through their role as educators, situating their concerns within their school communities. In the discussion, we consider the ways teachers’ perspectives shifted, how AI auditing can shape these shifts, and the implications of these findings on AI literacy for both teachers and students.

[HC-9] Hiding in Plain Sight: Understanding the Everyday Practices and Challenges of Car Dwellers

【速读】：该论文旨在解决当前人机交互（Human-Computer Interaction, HCI）研究中对车辆居住（vehicle dwelling）现象理解的不足，特别是忽视了其作为住房不安全形式的复杂性以及小型交通工具带来的独特限制。解决方案的关键在于通过定性分析在线社区中的帖子与评论，揭示车居者如何在社会、空间和基础设施约束下进行“基础架构构建”（infrastructuring）工作以管理日常生活，并进一步探讨其身份协商机制——即车居者的体验介于无家可归与游牧生活方式之间，而基础架构能力的发展则深刻影响其身份认同。这一发现为未来面向不平等基础设施获取条件下的移动性与居住研究提供了理论依据，并提出设计更贴合车居者多样化需求、情境与身份的技术系统的建议。

链接: https://arxiv.org/abs/2602.16112
作者: Rachael Zehrung,Yunan Chen
机构: University of California, Irvine (加州大学欧文分校)
类目: Human-Computer Interaction (cs.HC)
备注: ACM CHI 2026, 13 pages, 1 figure

点击查看摘要

Abstract:Vehicle dwelling has increased significantly in recent years. While HCI research has explored vehicle dwelling through the lens of digital nomadism and vanlife, it has largely overlooked the complexities of vehicle dwelling as a form of housing insecurity, as well as the unique constraints of living in smaller vehicles. Drawing on a qualitative analysis of posts and comments from an online community, we examine car dwellers’ infrastructuring work to manage daily life under social, spatial, and infrastructural constraints. We further explore the motivations and identity negotiations of car dwellers, whose experiences fall between homelessness and nomadism, and highlight how developing infrastructural competence can shape identity. We discuss implications for future HCI research on mobility and dwelling under conditions of uneven access to infrastructure and provide design recommendations for technologies that better account for car dwellers’ diverse needs, circumstances, and identities.

[HC-10] Surgical Activation Steering via Generative Causal Mediation

【速读】：该论文旨在解决如何在长文本生成过程中对扩散于多个token中的行为概念进行精准干预的问题。传统方法往往依赖相关性探测（correlational probe-based）手段，难以准确识别和操控这些分布式的语义特征。其解决方案的关键在于提出一种名为生成式因果中介分析（Generative Causal Mediation, GCM）的新框架：通过构建对比输入-输出数据集，量化模型组件（如注意力头）在介导二元概念（如诗歌 vs. 散文）时的因果作用，并选择最强中介组件进行稀疏干预，从而实现对长文本响应的有效定位与控制。

链接: https://arxiv.org/abs/2602.16080
作者: Aruna Sankaranarayanan,Amir Zur,Atticus Geiger,Dylan Hadfield-Menell
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks–refusal, sycophancy, and style transfer–across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.

[HC-11] Access in the Shadow of Ableism: An Autoethnography of a Blind Students Higher Education Experience in China

【速读】：该论文试图解决的问题是：在主流无障碍（accessibility）研究中，“访问”（access）这一概念在面对更广泛的社会能力主义（ableist）结构时存在局限性，难以真正实现对视障群体的平等包容。解决方案的关键在于将“访问”重新概念化为一种矛盾性建构（contradictory construct），并主张将无障碍理解为一种在能力主义结构内持续探索、动态实践的过程，而非静态目标或可达成的状态。这一视角强调系统性障碍（如资源匮乏、能力主义文化及政策缺失）对视障学生教育参与的深层制约，并呼吁从实践层面推动更具韧性和批判性的无障碍设计策略。

链接: https://arxiv.org/abs/2602.16070
作者: Weijun Zhang,Xinru Tang
机构: Syracuse University (雪城大学); University of California, Irvine (加州大学欧文分校)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The HCI research community has witnessed a growing body of research on accessibility and disability driven by efforts to improve access. Yet, the concept of access reveals its limitations when examined within broader ableist structures. Drawing on an autoethnographic method, this study shares the co-first author Zhang’s experiences at two higher-education institutions in China, including a specialized program exclusively for blind and low-vision students and a mainstream university where he was the first blind student admitted. Our analysis revealed tensions around access in both institutions: they either marginalized blind students within society at large or imposed pressures to conform to sighted norms. Both institutions were further constrained by systemic issues, including limited accessible resources, pervasive ableist cultures, and the lack of formalized policies. In response to these tensions, we conceptualize access as a contradictory construct and argue for understanding accessibility as an ongoing, exploratory practice within ableist structures.

[HC-12] A Unified Cross-Platform Framework for Automatic GUI and Plugin Generation in Structural Bioinformatics and Beyond

【速读】：该论文旨在解决为命令行接口（CLI）可执行程序自动化创建图形用户界面（GUI）的问题，以降低复杂交互式应用的开发成本并提升跨平台可移植性。其解决方案的关键在于提出一个三阶段工作流：第一步手动设计插件结构，第二步以平台无关的形式规范GUI的模型（Model）与视图（View），第三步自动生成针对特定平台（如VMD、PyMOL和Web服务器）的呈现层（Presenter）代码。该架构遵循模型-视图- presenter（MVP）模式，通过解耦逻辑与界面实现复用、减少工程量，并支持多生态系统的无缝迁移。

链接: https://arxiv.org/abs/2602.16047
作者: Sikao Guo,Edoardo Sarti,Frédéric Cazals
机构: Université Côte d’Azur (蔚蓝海岸大学); Inria (法国国家信息与自动化研究院)
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:We present a workflow and associated toolkit to automate the creation of graphical user interfaces (GUI) for executables run from command line interfaces (CLI). The workflow consists of three phases, namely (Step 1) the plugin design, (Step 2) the formal (platform independent) specification of the GUI, and (Step 3) the plugin code generation for the targeted platforms. Our architecture is aligned with the Model–View–Presenter (MVP) pattern: steps one and two build the Model and View descriptions, while step three implements the Presenter layer that binds inputs, invokes the CLI, and updates outputs. Once Step one has been (manually) completed, steps two and three are fully automated. The decoupled MVP design and platform-specific generator modules enable reuse of logic, portability across ecosystems, and significant reductions in engineering effort for complex interactive applications. We primarily use our workflow to generate GUI in structural bioinformatics for CLI executables from the Structural Bioinformatics Library (SBL), targeting three platforms, namely VMD, Pymol and Web servers. The workflow can be used as a guideline, while its implementation available in the package Plugin_manager from the SBL, see this https URL. Comments: 10 pages, 4 figures Subjects: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE) Cite as: arXiv:2602.16047 [cs.HC] (or arXiv:2602.16047v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.16047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-13] ransforming GenAI Policy to Prompting Instruction: An RCT of Scalable Prompting Interventions in a CS1 Course

【速读】：该论文旨在解决生成式 AI（Generative AI）在教育场景中被学生滥用或低效使用的问题，即学生难以区分任务完成与真实学习，并缺乏利用 AI 促进学习的 prompting 技能，导致考试表现下降。解决方案的关键在于设计并验证基于 ICAP（Interactive, Constructive, Active, Passive）框架的干预措施，通过分层增强认知参与强度（从被动到主动再到建构性互动），系统提升学生的 prompting literacy（提示素养）。研究采用大规模随机对照试验（RCT, N=979），证明高参与度条件显著提升 prompting 技能，且这些技能与即时学习成效和最终考试成绩正相关，从而为将 GenAI 教学政策转化为可推广、可操作的提示素养教学提供了实证依据和理论支持。

链接: https://arxiv.org/abs/2602.16033
作者: Ruiwei Xiao,Runlong Ye,Xinying Hou,Jessica Wen,Harsh Kumar,Michael Liut,John Stamper
机构: Carnegie Mellon University (卡内基梅隆大学); University of Toronto (多伦多大学); University of Michigan (密歇根大学); University of Toronto Mississauga (多伦多大学密西沙加分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Despite universal GenAI adoption, students cannot distinguish task performance from actual learning and lack skills to leverage AI for learning, leading to worse exam performance when AI use remains unreflective. Yet few interventions teaching students to prompt AI as a tutor rather than solution provider have been validated at scale through randomized controlled trials (RCTs). To bridge this gap, we conducted a semester-long RCT (N=979) with four ICAP framework-based instructional conditions varying in engagement intensity with a pre-test, immediate and delayed post-test and surveys. Mixed methods analysis results showed: (1) All conditions significantly improved prompting skills, with gains increasing progressively from Condition 1 to Condition 4, validating ICAP’s cognitive engagement hierarchy; (2) for students with similar pre-test scores, higher learning gain in immediate post-test predict higher final exam score, though no direct between-group differences emerged; (3) Our interventions are suitable and scalable solutions for diverse educational contexts, resources and learners. Together, this study makes empirical and theoretical contributions: (1) theoretically, we provided one of the first large-scale RCTs examining how cognitive engagement shapes learning in prompting literacy and clarifying the relationship between learning-oriented prompting skills and broader academic performance; (2) empirically, we offered timely design guidance for transforming GenAI classroom policies into scalable, actionable prompting literacy instruction to advance learning in the era of Generative AI.

[HC-14] Punchlines Unbound: Comedy Practices in Social Virtual Reality

【速读】：该论文试图解决的问题是：在社交虚拟现实（Social VR）平台中，由于化身（avatar）的非语言表达能力受限，表演者（如脱口秀演员）如何有效利用有限的肢体和表情线索进行实时互动与表演，从而维持观众参与感与现场氛围。解决方案的关键在于，虚拟喜剧演员通过有意识地控制和夸张其化身的动作与表情，将原本的表达局限转化为独特的表演机会；同时，研究还揭示了观众通过特定情境下恰当的emoji反应形成了一种新型互动文化，这为系统设计提供了重要启示——即增强反馈可见性并维护社区规范，同时不抑制创造性表达。

链接: https://arxiv.org/abs/2602.16013
作者: Ryo Ohara,Chi-Lan Yang,Yuji Hatada,Takuji Narumi,Hideaki Kuzuoka
机构: The University of Tokyo (东京大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Social VR platforms serve as an emergent venue for live performance, enabling co-presence and real-time interaction among distributed performers and audiences within shared virtual environments. Live performances, such as comedy, rely on subtle social cues between performers and audiences, which are missing in VR. However, it remains unclear how comedians utilize avatar-mediated cues in social VR. We conducted semi-structured interviews and observations with 23 virtual comedians on VRChat. Results revealed that virtual comedians transformed their limited nonverbal expressiveness into performative opportunities through intentional control and exaggeration. Additionally, a distinctive culture emerged around context-appropriate emoji reactions from audiences, while challenges such as audio latency and moderation against trolling were highlighted. Our findings advance understanding of how performers creatively adapt to expressive constraints in avatar-mediated settings. We further demonstrate how challenges in performer-audience interaction and moderation provide design insights for systems enhancing feedback visibility and sustain community norms without restricting creative expression.

[HC-15] From Reflection to Repair: A Scoping Review of Dataset Documentation Tools

【速读】：该论文试图解决当前数据集文档工具设计中缺乏对动机和障碍的系统理解问题，尤其是这些工具如何与现有系统、法规及文化规范相连接。文献指出，尽管已有多种文档工具被开发，但其采纳率低且标准化困难，根源在于四个持续存在的模式：文档价值的操作化不清晰、设计脱离实际场景、未考虑劳动投入需求，以及将集成视为未来工作而非当前重点。解决方案的关键在于从面向个体的工具设计转向面向制度的解决方案，强调通过HCI（人机交互）社区推动可持续文档实践的具体行动，从而实现更负责任的AI开发。

链接: https://arxiv.org/abs/2602.15968
作者: Pedro Reynolds-Cuéllar(Robotics and AI Institute),Marisol Wong-Villacres(Escuela Superior Politécnica del Litoral),Adriana Alvarado Garcia(IBM Research),Heila Precel(Robotics and AI Institute)
机构: Robotics and AI Institute (机器人与人工智能研究所); Escuela Superior Politécnica del Litoral (太平洋海岸高等理工学院); IBM Research (IBM 研究院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: to be published at the CHI conference on Human Factors in Computing Systems

点击查看摘要

Abstract:Dataset documentation is widely recognized as essential for the responsible development of automated systems. Despite growing efforts to support documentation through different kinds of artifacts, little is known about the motivations shaping documentation tool design or the factors hindering their adoption. We present a systematic review supported by mixed-methods analysis of 59 dataset documentation publications to examine the motivations behind building documentation tools, how authors conceptualize documentation practices, and how these tools connect to existing systems, regulations, and cultural norms. Our analysis shows four persistent patterns in dataset documentation conceptualization that potentially impede adoption and standardization: unclear operationalizations of documentation’s value, decontextualized designs, unaddressed labor demands, and a tendency to treat integration as future work. Building on these findings, we propose a shift in Responsible AI tool design toward institutional rather than individual solutions, and outline actions the HCI community can take to enable sustainable documentation practices.

[HC-16] NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）在社交媒体分析中因处理包含个人身份信息（Personally Identifiable Information, PII）、行为线索和元数据而导致的隐私风险问题，如监控、画像构建和定向广告。其解决方案的关键在于提出NLP Privacy Risk Identification in Social Media (NLP-PRISM)框架，系统性地从数据收集、预处理、可见性、公平性、计算风险和合规性六个维度评估隐私漏洞，并通过实证分析揭示当前NLP任务（如情感分析、情绪识别、攻击性语言检测等）在隐私保护方面的显著研究空白与模型性能下降（F1-score下降1%–23%，MIA AUC达0.81，AIA准确率达0.75）之间的权衡，最终倡导强化匿名化、隐私感知学习和公平驱动训练以实现伦理导向的NLP应用。

链接: https://arxiv.org/abs/2602.15866
作者: Dhiman Goswami,Jai Kruthunz Naveen Kumar,Sanchari Das
机构: George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Natural Language Processing (NLP) is integral to social media analytics but often processes content containing Personally Identifiable Information (PII), behavioral cues, and metadata raising privacy risks such as surveillance, profiling, and targeted advertising. To systematically assess these risks, we review 203 peer-reviewed papers and propose the NLP Privacy Risk Identification in Social Media (NLP-PRISM) framework, which evaluates vulnerabilities across six dimensions: data collection, preprocessing, visibility, fairness, computational risk, and regulatory compliance. Our analysis shows that transformer models achieve F1-scores ranging from 0.58-0.84, but incur a 1% - 23% drop under privacy-preserving fine-tuning. Using NLP-PRISM, we examine privacy coverage in six NLP tasks: sentiment analysis (16), emotion detection (14), offensive language identification (19), code-mixed processing (39), native language identification (29), and dialect detection (24) revealing substantial gaps in privacy research. We further found a (reduced by 2% - 9%) trade-off in model utility, MIA AUC (membership inference attacks) 0.81, AIA accuracy 0.75 (attribute inference attacks). Finally, we advocate for stronger anonymization, privacy-aware learning, and fairness-driven training to enable ethical NLP in social media contexts.

[HC-17] AI as Teammate or Tool? A Review of Human-AI Interaction in Decision Support

【速读】：该论文试图解决的问题是：在人工智能（Artificial Intelligence, AI）与人类协同工作的情境下，如何界定AI系统是作为工具还是协作伙伴（collaborative teammate），并提升其在实际应用中的效能。解决方案的关键在于突破当前以可解释性（explainability）为中心的设计范式，转向构建具有自适应性和情境感知能力的交互机制，从而支持人类与AI之间共享心智模型（shared mental models）和动态协商决策权（dynamic negotiation of authority），使AI从被动辅助角色转变为积极的协作队友。

链接: https://arxiv.org/abs/2602.15865
作者: Most. Sharmin Sultana Samu,Nafisa Khan,Kazi Toufique Elahi,Tasnuva Binte Rahman,Md. Rakibul Islam,Farig Sadeque
机构: BRAC University (BRAC大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) necessitates determining whether systems function as tools or collaborative teammates. In this study, by synthesizing Human-AI Interaction (HAI) literature, we analyze this distinction across four dimensions: interaction design, trust calibration, collaborative frameworks and healthcare applications. Our analysis reveals that static interfaces and miscalibrated trust limit AI efficacy. Performance hinges on aligning transparency with cognitive workflows, yet a fluency trap often inflates trust without improving decision-making. Consequently, an overemphasis on explainability leaves systems largely passive. Our findings show that current AI systems remain largely passive due to an overreliance on explainability-centric designs and that transitioning AI to an active teammate requires adaptive, context-aware interactions that support shared mental models and the dynamic negotiation of authority between humans and AI.

[HC-18] EmoTrack: An application to Facilitate User Reflection on Their Online Behaviours

【速读】：该论文试图解决互联网使用对青少年心理健康带来的双重影响问题，即如何在保障线上互动益处的同时，减少有害内容与行为的负面影响。其核心挑战在于难以区分有益与有害的在线活动，从而难以有效干预。解决方案的关键在于开发一款名为EmoTrack的多平台个人信息（Personal Informatics）系统，通过记录用户在YouTube上的观看行为并引导其反思行为与情绪之间的关联，帮助青少年培养更积极、有意识的在线参与策略。评估结果显示，EmoTrack能有效促进用户从浅层到深层（R0至R3）的不同层次反思，从而实现对在线行为的自我调节。

链接: https://arxiv.org/abs/2602.15839
作者: Ruiyong Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Master’s thesis

点击查看摘要

Abstract:With the rapid growth of the internet, all online activities can have both positive and negative effects on human mental health. Online engagement is complex and efforts to regulate online use face challenges in distinguishing between beneficial and harmful content and behaviours. An alternative approach is to help young people develop the skills they need to manage online safety while preserving the benefits of online interactions. This dissertation presents the entire development process and evaluation of an multi-platform application, called EmoTrack that aims to help young people reflect on their online behaviour. It was developed to record their online activities and cultivate strategies for more positive and mindful engagement online. EmoTrack is a personal informatics system, and it is designed to help people track and reflect on their engagement with YouTube videos. The system was evaluated with thirteen participants and it was found that EmoTrack can facilitate them to reflect on their video watching behaviour and the impact on their mood, with reports of different levels of reflections from R0 to R3.

[HC-19] A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dialogue System Alignment Models

【速读】：该论文旨在解决实践中对话系统评估指标不明确的问题，即传统上仅以用户满意度和用户体验为主要评价标准，而忽略了其他对实际部署至关重要的评估维度。为填补这一空白，论文提出了一种基于业务-对话系统对齐模型（business-dialogue system alignment model）的评估项识别方法，该模型源自业务-信息技术对齐模型（business-IT alignment model），用于指导实用型IT系统的开发与运营。其解决方案的关键在于构建一个通用模型，能够支撑针对不同对话系统的具体业务-对话系统对齐模型的定制化设计，从而系统性地识别出多样化的评估项，推动相关研究向更贴近实际应用场景的方向发展。

链接: https://arxiv.org/abs/2602.15835
作者: Mikio Nakano,Hironori Takeuchi,Kazunori Komatani
机构: C4A Research Institute, Inc.(C4A研究 institute, 公司); Musashi University (武藏大学); SANKEN, Osaka University (大阪大学三井研究所)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2025 (IWSDS 2025)

点击查看摘要

Abstract:This paper proposes a methodology for identifying evaluation items for practical dialogue systems. Traditionally, user satisfaction and user experiences have been the primary metrics for evaluating dialogue systems. However, there are various other evaluation items to consider when developing and operating practical dialogue systems, and such evaluation items are expected to lead to new research topics. So far, there has been no methodology for identifying these evaluation items. We propose identifying evaluation items based on business-dialogue system alignment models, which are applications of business-IT alignment models used in the development and operation of practical IT systems. We also present a generic model that facilitates the construction of a business-dialogue system alignment model for each dialogue system.

[HC-20] A Koopman-Bayesian Framework for High-Fidelity Perceptually Optimized Haptic Surgical Simulation

【速读】：该论文旨在解决手术仿真中力反馈真实感不足的问题，特别是如何在保持低延迟的同时提升触觉渲染的感知精度与动态建模能力。其核心解决方案在于构建一个统一框架，将非线性动力学、感知心理物理学与高频触觉渲染相结合：首先利用Koopman算子将原本非线性的软组织交互动力学映射到扩展状态空间中实现线性预测与控制；其次通过基于韦伯-费希纳定律和斯蒂文斯幂律的贝叶斯校准模块，根据个体感知阈值动态调整力信号，从而显著提升感知辨别能力（改善20%）；最终系统在多种典型手术任务中实现了平均4.3 ms的渲染延迟和低于2.8%的力误差，优于传统弹簧阻尼及能量驱动方法。

链接: https://arxiv.org/abs/2602.15834
作者: Rohit Kaushik,Eva Kaushik
机构: Hanson Professional Services (哈森专业服务公司); University of Tennessee, Knoxville (田纳西大学诺克斯维尔分校); Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:We introduce a unified framework that combines nonlinear dynamics, perceptual psychophysics and high frequency haptic rendering to enhance realism in surgical simulation. The interaction of the surgical device with soft tissue is elevated to an augmented state space with a Koopman operator formulation, allowing linear prediction and control of the dynamics that are nonlinear by nature. To make the rendered forces consistent with human perceptual limits, we put forward a Bayesian calibration module based on WeberFechner and Stevens scaling laws, which progressively shape force signals relative to each individual’s discrimination thresholds. For various simulated surgical tasks such as palpation, incision, and bone milling, the proposed system attains an average rendering latency of 4.3 ms, a force error of less than 2.8% and a 20% improvement in perceptual discrimination. Multivariate statistical analyses (MANOVA and regression) reveal that the system’s performance is significantly better than that of conventional spring-damper and energy, based rendering methods. We end by discussing the potential impact on surgical training and VR, based medical education, as well as sketching future work toward closed, loop neural feedback in haptic interfaces.

[HC-21] owards a More Realistic VR Experience: Merging Haptic Gloves with Precision Gloves IROS

【速读】：该论文旨在解决当前虚拟现实（Virtual Reality, VR）手套技术中普遍存在的精度与触觉反馈难以兼得的问题：高精度手套通常缺乏触觉反馈能力，而具备触觉反馈的 gloves 则在手势识别精度上表现不佳。解决方案的关键在于提出一种新颖的混合方法，将高精度手套与触觉手套进行集成，构建一个同时支持高精度手势捕捉和有效触觉反馈的系统，从而实现二者功能的协同优化。

链接: https://arxiv.org/abs/2602.15833
作者: Paolo Bottoni,Susanna Cifani,Kamen Kanev,Daniel Moraru,Atsushi Nakamura,Marco Raoul Marini
机构: Sapienza University of Rome (罗马大学); Ontario Tech University (安大略理工大学); Shizuoka University (静冈大学)
类目: Human-Computer Interaction (cs.HC)
备注: 3 pages, 2 figures. Presented as Abstract P3-24 at the 10th International Symposium on Biomedical Engineering (ISBE2025) and the International Workshop on Nanodevice Technologies 2025 (IWNT2025), October 30-31, 2025, Higashihiroshima, Japan

点击查看摘要

Abstract:Virtual reality (VR) glove technology is increasingly important for professional training, industrial applications, and teleoperation in hazardous environments, since it enables more natural and immersive interactions than controllers. However, current solutions face a trade-off: high-precision gloves lack haptic feedback, while haptic gloves suffer from poor accuracy. Existing studies have mainly focused on developing new glove prototypes or optimizing only one type of glove, without addressing the integration of both features. Our work presents a novel hybrid approach that combines a high-precision glove with a haptic glove, creating a system that delivers both precision and haptics.

[HC-22] What Persona Are We Missing? Identifying Unknown Relevant Personas for Faithful User Simulation

【速读】：该论文旨在解决现有用户模拟（User Simulation）中因缺乏对目标用户人格特征（Persona）充分覆盖而导致的模拟有效性存疑的问题。其核心挑战在于识别在特定对话情境下，可能影响用户决策但尚未被明确提供的未知人格特征。解决方案的关键在于提出PICQ数据集——一个包含情境感知的选择题标注数据集，其中问题设计用于探测潜在但未被指定的人格特征（如“用户是否对价格敏感？”），并构建多维度评估框架（包括忠实度、影响力与不可访问性）来系统衡量大语言模型（LLM）在模拟中的表现。研究发现，模型规模与性能之间存在复杂的“忠实度 vs. 洞察力”权衡关系：影响力随模型规模增加而提升，但忠实于人类行为模式的表现呈倒U型曲线，这一现象可归因于人类认知经济性（Cognitive Economy）倾向，从而为理解人类与先进生成式AI在认知建模上的差异提供了新视角。

链接: https://arxiv.org/abs/2602.15832
作者: Weiwen Su,Yuhan Zhou,Zihan Wang,Naoki Yoshinaga,Masashi Toyoda
机构: The University of Tokyo (东京大学); Institute of Industrial Science, The University of Tokyo (东京大学工业科学研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing user simulations, where models generate user-like responses in dialogue, often lack verification that sufficient user personas are provided, questioning the validity of the simulations. To address this core concern, this work explores the task of identifying relevant but unknown personas of the simulation target for a given simulation context. We introduce PICQ, a novel dataset of context-aware choice questions, annotated with unknown personas (e.g., ‘‘Is the user price-sensitive?’’) that may influence user choices, and propose a multi-faceted evaluation scheme assessing fidelity, influence, and inaccessibility. Our benchmark of leading LLMs reveals a complex ‘‘Fidelity vs. Insight’’ dilemma governed by model scale: while influence generally scales with model size, fidelity to human patterns follows an inverted U-shaped curve. We trace this phenomenon to cognitive differences, particularly the human tendency for ‘‘cognitive economy.’’ Our work provides the first comprehensive benchmark for this crucial task, offering a new lens for understanding the divergent cognitive models of humans and advanced LLMs.

[HC-23] VERA-MH Concept Paper

【速读】：该论文旨在解决生成式 AI（Generative AI）在心理健康领域应用中的安全性评估问题，特别是针对自杀风险干预场景下聊天机器人（chatbot）的伦理与责任合规性缺乏系统化自动化测评工具的问题。解决方案的关键在于提出并开发 VERA-MH（Validation of Ethical and Responsible AI in Mental Health），其核心机制是通过两个辅助型 AI 代理实现全流程自动化：一是用户代理（user-agent）模拟具有预设风险水平和特征的不同心理状态个体与待测聊天机器人进行对话；二是裁判代理（judge-agent）依据由临床专家制定的评分量表对每轮对话进行打分；最终通过聚合所有模拟对话的评分结果，形成对聊天机器人安全性的综合评价。该方法实现了从人工主观判断向可复现、结构化、可扩展的自动化评估范式的转变。

链接: https://arxiv.org/abs/2510.15297
作者: Luca Belli,Kate Bentley,Will Alexander,Emily Ward,Matt Hawrilenko,Kelly Johnston,Mill Brown,Adam Chekroud
机构: Spring Health; Yale University School of Medicine
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:We introduce VERA-MH (Validation of Ethical and Responsible AI in Mental Health), an automated evaluation of the safety of AI chatbots used in mental health contexts, with an initial focus on suicide risk. Practicing clinicians and academic experts developed a rubric informed by best practices for suicide risk management for the evaluation. To fully automate the process, we used two ancillary AI agents. A user-agent model simulates users engaging in a mental health-based conversation with the chatbot under evaluation. The user-agent role-plays specific personas with pre-defined risk levels and other features. Simulated conversations are then passed to a judge-agent who scores them based on the rubric. The final evaluation of the chatbot being tested is obtained by aggregating the scoring of each conversation. VERA-MH is actively under development and undergoing rigorous validation by mental health clinicians to ensure user-agents realistically act as patients and that the judge-agent accurately scores the AI chatbot. To date we have conducted preliminary evaluation of GPT-5, Claude Opus and Claude Sonnet using initial versions of the VERA-MH rubric and used the findings for further design development. Next steps will include more robust clinical validation and iteration, as well as refining actionable scoring. We are seeking feedback from the community on both the technical and clinical aspects of our evaluation. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI) Cite as: arXiv:2510.15297 [cs.CY] (or arXiv:2510.15297v2 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2510.15297 Focus to learn more arXiv-issued DOI via DataCite

[HC-24] Automated Assessment of Kidney Ureteroscopy Exploration for Training

【速读】：该论文旨在解决肾内镜导航训练中缺乏高效、可扩展且无需专家实时指导的培训工具的问题（当前训练依赖于手术室内的个体化反馈，存在资源受限和学习曲线陡峭的缺陷）。其解决方案的关键在于提出一种基于纯输尿管镜视频的新型相机定位框架，通过先验的慢速、全面探索视频生成参考重建模型，并利用该模型自动识别受训者在后续探索过程中遗漏的肾盏区域，从而实现高精度（<4 mm定位误差）与自动化反馈。该方法显著提升了离室训练的可能性，无需专家监督即可完成有效评估与指导。

链接: https://arxiv.org/abs/2602.15988
作者: Fangjie Li,Nicholas Kavoussi,Charan Mohan,Matthieu Chabanas,Jie Ying Wu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Purpose: Kidney ureteroscopic navigation is challenging with a steep learning curve. However, current clinical training has major deficiencies, as it requires one-on-one feedback from experts and occurs in the operating room (OR). Therefore, there is a need for a phantom training system with automated feedback to greatly \revisionexpand training opportunities. Methods: We propose a novel, purely ureteroscope video-based scope localization framework that automatically identifies calyces missed by the trainee in a phantom kidney exploration. We use a slow, thorough, prior exploration video of the kidney to generate a reference reconstruction. Then, this reference reconstruction can be used to localize any exploration video of the same phantom. Results: In 15 exploration videos, a total of 69 out of 74 calyces were correctly classified. We achieve 4mm camera pose localization error. Given the reference reconstruction, the system takes 10 minutes to generate the results for a typical exploration (1-2 minute long). Conclusion: We demonstrate a novel camera localization framework that can provide accurate and automatic feedback for kidney phantom explorations. We show its ability as a valid tool that enables out-of-OR training without requiring supervision from an expert. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.15988 [eess.IV] (or arXiv:2602.15988v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2602.15988 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Fangjie Li [view email] [v1] Tue, 17 Feb 2026 20:25:04 UTC (1,823 KB)

计算机视觉

[CV-0] CoNeRV: Leverag ing Temporal Coherence for Compressible Neural Representations for Videos

【速读】：该论文旨在解决隐式神经表示（Implicit Neural Representations, INRs）在视频压缩中面临的两大核心挑战：一是为每段视频单独过拟合INR导致的编码效率低下，二是基于超网络（Hypernetwork）的方法在高分辨率下存在质量差、码率大及内存消耗高的问题。解决方案的关键在于提出一种名为TeCoNeRV的新方法，其创新性体现在三个方面：(1) 将权重预测任务在空间和时间维度上分解，通过将短视频片段划分为patch tubelets以降低预训练阶段的内存开销达20倍；(2) 引入基于残差的存储机制，仅保存连续片段表示之间的差异，显著减少比特流大小；(3) 设计时序一致性正则化框架，促使权重空间变化与视频内容变化保持关联。该方案在UVG、HEVC和MCL-JCV数据集上实现了480p至1080p的首次成功应用，相比基线提升2.47dB和5.35dB PSNR，同时码率降低36%，编码速度提高1.5–3倍。

链接: https://arxiv.org/abs/2602.16711
作者: Namitha Padmanabhan,Matthew Gwilliam,Abhinav Shrivastava
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution videos while maintaining encoding efficiency remains a significant challenge. Hypernetwork-based approaches predict INR weights (hyponetworks) for unseen videos at high speeds, but with low quality, large compressed size, and prohibitive memory needs at higher resolutions. We address these fundamental limitations through three key contributions: (1) an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead by 20 \times ; (2) a residual-based storage scheme that captures only differences between consecutive segment representations, significantly reducing bitstream size; and (3) a temporal coherence regularization framework that encourages changes in the weight space to be correlated with video content. Our proposed method, TeCoNeRV, achieves substantial improvements of 2.47dB and 5.35dB PSNR over the baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3 \times faster encoding speeds. With our low memory usage, we are the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV. Our project page is available at this https URL .

[CV-1] Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

【速读】：该论文旨在解决人形机器人在真实复杂环境中对任意物体进行视觉引导的运动操作（visual loco-manipulation）时，因训练数据规模受限导致的泛化能力不足问题。其解决方案的关键在于提出一种名为HERO的新范式，该范式融合了大视觉模型的开放词汇理解能力与仿真训练中获得的精确控制性能；核心创新是设计了一个残差感知的末端执行器（end-effector, EE）跟踪策略，该策略结合经典机器人学方法（如逆运动学生成参考轨迹）与学习型神经前向模型以实现高精度位姿预测，并引入目标调整和重规划机制，显著降低末端执行器跟踪误差达3.2倍，从而构建出模块化且可泛化的视觉-控制协同系统。

链接: https://arxiv.org/abs/2602.16705
作者: Runpei Dong,Ziyan Li,Xialin He,Saurabh Gupta
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D images). Existing approaches are based on real-world imitation learning and exhibit limited generalization due to the difficulty in collecting large-scale training datasets. This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots that combines the strong generalization and open-vocabulary understanding of large vision models with strong control performance from simulated training. We achieve this by designing an accurate residual-aware EE tracking policy. This EE tracking policy combines classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, c) goal adjustment, and d) replanning. Together, these innovations help us cut down the end-effector tracking error by 3.2x. We use this accurate end-effector tracker to build a modular system for loco-manipulation, where we use open-vocabulary large vision models for strong visual generalization. Our system is able to operate in diverse real-world environments, from offices to coffee shops, where the robot is able to reliably manipulate various everyday objects (e.g., mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests in simulation and the real world demonstrate the effectiveness of our proposed design. We believe the advances in this paper can open up new ways of training humanoid robots to interact with daily objects.

[CV-2] Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在推理过程中因视觉信息仅初始输入、文本推理逐步累积导致的视觉锚定误差传播问题，以及传统视觉引导方式粗粒度、噪声大难以控制长文本推理的问题。解决方案的关键在于提出一种显著性感知原则选择（Saliency-Aware Principle, SAP）机制，其不依赖于逐标记轨迹而基于高层推理原则进行决策，从而在噪声反馈下实现对离散生成过程的稳定控制，并支持在需要时重新查阅视觉证据以实现动态视觉锚定；此外，SAP还支持多路径并行推理，提升推理多样性与效率，且无需额外训练，具备模型无关性和数据无依赖性。

链接: https://arxiv.org/abs/2602.16702
作者: Mingjia Shi,Yinhan He,Yaochen Zhu,Jundong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint 10 pages, 4 figures

点击查看摘要

Abstract:Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emphSaliency-Aware Principle (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.

[CV-3] Are Object-Centric Representations Better At Compositional Generalization?

【速读】：该论文旨在解决机器学习模型在视觉场景中对未见过的对象属性组合进行推理的能力问题，即组合泛化（compositional generalization）难题。其核心挑战在于如何让模型不仅掌握已知概念的单一组合，还能灵活应对新组合的推理任务，这正是人类认知的核心能力之一。解决方案的关键在于系统性地评估以对象为中心（Object-centric, OC）表示方法相较于传统密集型视觉编码器（dense representations）在三种受控视觉世界（CLEVRTex、Super-CLEVR 和 MOVi-C）中的泛化性能差异，并通过严格控制训练数据多样性、样本量、表征维度、下游模型容量和计算资源等变量，确保公平比较。研究发现：OC 方法在更复杂的组合泛化场景下显著优于密集表示；而后者仅在简单场景中表现更好，且通常需要更多下游计算资源；此外，OC 模型在样本效率上更具优势，在数据有限或计算受限条件下展现出更强的泛化能力。

链接: https://arxiv.org/abs/2602.16689
作者: Ferdinand Kapl,Amir Mohammad Karimi Mamaghan,Maximilian Seitzer,Karl Henrik Johansson,Carsten Marr,Stefan Bauer,Andrea Dittadi
机构: Technical University of Munich (慕尼黑工业大学); Helmholtz AI, Munich (亥姆霍兹人工智能); KTH Royal Institute of Technology (皇家理工学院); MPI for Intelligent Systems, Tübingen (马克斯·普朗克智能系统研究所); Institute of AI for Health, Computational Health Center, Helmholtz Munich (亥姆霍兹慕尼黑健康人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.

[CV-4] Learning Situated Awareness in the Real World

【速读】：该论文旨在解决当前多模态基础模型（Multimodal Foundation Models, MFMs）在评估中过度关注环境中心的空间关系（如场景内物体间的相对位置），而忽视了观察者中心的关系（即模型需基于自身视角、姿态和运动进行推理的能力）这一问题。其解决方案的关键在于提出一个名为SAW-Bench（Situated Awareness in the Real World）的新基准，该基准使用真实世界视频数据集（786段由Ray-Ban Meta智能眼镜录制的视频，覆盖多样室内与室外环境）和超过2071个由人类标注的问题-答案对，系统性地评估模型在六类观察者中心意识任务中的表现，从而推动模型从被动感知向物理 grounded 的、以观察者为中心的空间动态理解演进。

链接: https://arxiv.org/abs/2602.16682
作者: Chuhan Li,Ruilin Han,Joy Hsu,Yongyuan Liang,Rajiv Dhawan,Jiajun Wu,Ming-Hsuan Yang,Xin Eric Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent’s viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model’s observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.

[CV-5] VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection

【速读】：该论文旨在解决时间序列异常检测（Time-series Anomaly Detection, TSAD）中长期存在的模态权衡问题：一维（1D）时序模型虽能实现细粒度的点异常定位，但缺乏全局上下文感知能力；而二维（2D）视觉模型虽可捕捉全局模式，却因缺乏时间对齐导致信息瓶颈，并且在点级异常检测上表现粗粒度。解决方案的关键在于提出首个统一时序与视觉模态的框架VETime，其核心创新包括：通过可逆图像转换（Reversible Image Conversion）和Patch-Level Temporal Alignment模块建立共享的细粒度视觉-时序时间线，保持判别性细节的同时保留时序敏感性；并设计异常窗口对比学习（Anomaly Window Contrastive Learning）与任务自适应多模态融合（Task-Adaptive Multi-Modal Fusion），动态整合两种模态的互补感知优势，从而在零样本场景下显著提升异常定位精度，同时降低计算开销。

链接: https://arxiv.org/abs/2602.16681
作者: Yingyuan Yang,Tian Lan,Yifei Gao,Yimeng Lu,Wenjun He,Meng Wang,Chenghao Liu,Chen Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: this https URL.

[CV-6] PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction WACV2026

【速读】：该论文旨在解决当前基于查询的高精地图（High-definition map, HD map）构建方法中存在的时序不一致性与不稳定性问题，这些问题主要源于随机查询初始化和隐式时间建模机制。其解决方案的关键在于提出一种端到端的在线矢量高精地图构建框架，通过四个核心模块实现：1）语义感知查询生成器（Semantic-Aware Query Generator），利用空间对齐的语义掩码初始化查询以捕获全局场景上下文；2）历史栅格化地图记忆（History Rasterized Map Memory），存储每个追踪实例的细粒度实例级地图，提供显式的历史先验；3）历史地图引导模块（History-Map Guidance Module），将栅格化地图信息融入跟踪查询中以增强时序连续性；4）短期未来引导模块（Short-Term Future Guidance），基于存储的历史轨迹预测地图实例的瞬时运动，并作为提示用于避免不合理预测，从而保持时序一致性。该方案显著提升了地图构建的稳定性和准确性。

链接: https://arxiv.org/abs/2602.16669
作者: Bo Lang,Nirav Savaliya,Zhihao Zheng,Jinglun Feng,Zheng-Hang Yeh,Mooi Choo Chuah
机构: Lehigh University (莱赫igh大学); Honda Research Institute USA (本田研究 institute 美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026

点击查看摘要

Abstract:High-definition (HD) maps are crucial to autonomous driving, providing structured representations of road elements to support navigation and planning. However, existing query-based methods often employ random query initialization and depend on implicit temporal modeling, which lead to temporal inconsistencies and instabilities during the construction of a global map. To overcome these challenges, we introduce a novel end-to-end framework for consistent online HD vectorized map construction, which jointly performs map instance tracking and short-term prediction. First, we propose a Semantic-Aware Query Generator that initializes queries with spatially aligned semantic masks to capture scene-level context globally. Next, we design a History Rasterized Map Memory to store fine-grained instance-level maps for each tracked instance, enabling explicit historical priors. A History-Map Guidance Module then integrates rasterized map information into track queries, improving temporal continuity. Finally, we propose a Short-Term Future Guidance module to forecast the immediate motion of map instances based on the stored history trajectories. These predicted future locations serve as hints for tracked instances to further avoid implausible predictions and keep temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed method outperforms state-of-the-art (SOTA) methods with good efficiency.

[CV-7] Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

【速读】：该论文旨在解决无配对图像到图像翻译（unpaired image-to-image translation）中两大关键问题：一是对抗式方法（adversarial approach）在训练时需依赖目标域的对抗损失，导致模型泛化能力受限；二是扩散逆向方法（diffusion-inversion method）因噪声潜空间表示不准确，常产生低保真度的翻译结果。解决方案的核心在于提出自监督语义桥（Self-Supervised Semantic Bridge, SSB），其关键创新是利用自监督视觉编码器学习对外观变化不变但保留几何结构的表征，构建一个共享潜在空间来指导扩散桥梁模型，从而实现无需跨域监督即可进行空间上忠实的图像转换。

链接: https://arxiv.org/abs/2602.16664
作者: Jiaming Liu,Felix Petersen,Yunhe Gao,Yabin Zhang,Hyojin Kim,Akshay S. Chaudhari,Yu Sun,Stefano Ermon,Sergios Gatidis
机构: Stanford University (斯坦福大学); LLNL; Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages

点击查看摘要

Abstract:Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.

[CV-8] Style-Aware Gloss Control for Generative Non-Photorealistic Rendering

【速读】：该论文旨在解决生成式 AI (Generative AI) 模型中材质外观特征（尤其是光泽度，gloss）与艺术风格（artistic style）在表征空间中难以解耦的问题。现有模型往往无法独立控制这些因素，导致合成图像中光泽与风格混杂，限制了可控性。解决方案的关键在于：首先构建一个新收集的绘画物体数据集，通过无监督生成模型学习到一个分层的潜在空间，其中光泽度被成功解耦于其他外观因素；进而设计一个轻量级适配器模块，将该风格和光泽感知的潜在空间接入扩散模型（latent-diffusion model），从而实现对非照片真实感图像中光泽和艺术风格的细粒度控制，显著提升了学习因子的解耦程度与可控性。

链接: https://arxiv.org/abs/2602.16611
作者: Santiago Jimenez-Navarro,Belen Masia,Ana Serrano
机构: University of Zaragoza (萨拉戈萨大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans can infer material characteristics of objects from their visual appearance, and this ability extends to artistic depictions, where similar perceptual strategies guide the interpretation of paintings or drawings. Among the factors that define material appearance, gloss, along with color, is widely regarded as one of the most important, and recent studies indicate that humans can perceive gloss independently of the artistic style used to depict an object. To investigate how gloss and artistic style are represented in learned models, we train an unsupervised generative model on a newly curated dataset of painterly objects designed to systematically vary such factors. Our analysis reveals a hierarchical latent space in which gloss is disentangled from other appearance factors, allowing for a detailed study of how gloss is represented and varies across artistic styles. Building on this representation, we introduce a lightweight adapter that connects our style- and gloss-aware latent space to a latent-diffusion model, enabling the synthesis of non-photorealistic images with fine-grained control of these factors. We compare our approach with previous models and observe improved disentanglement and controllability of the learned factors.

[CV-9] A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

【速读】：该论文旨在解决街景图像属性分类任务中因计算成本高且现有方法难以捕捉局部细粒度特征而导致的性能瓶颈问题。当前基于预训练视觉语言模型（如CLIP）的方法多依赖全局图像嵌入，无法有效建模复杂、杂乱街景中的局部区域依赖关系。其解决方案的关键在于提出CLIP-MHAdapter，一种轻量级适配框架，在CLIP基础上引入一个包含多头自注意力机制的瓶颈MLP模块，作用于patch tokens以显式建模patch间的相互依赖关系，从而在仅约140万可训练参数下显著提升细粒度属性分类精度，同时保持低计算开销。

链接: https://arxiv.org/abs/2602.16590
作者: Qi You,Yitai Cheng,Zichao Zeng,James Haworth
机构: SpaceTimeLab; University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at this https URL.

[CV-10] Arc2Morph: Identity-Preserving Facial Morphing with Arc2Face

【速读】：该论文旨在解决面部仿造攻击（face morphing attacks）对电子身份证件中人脸识别系统构成的严重威胁问题。此类攻击利用许多国家在护照注册过程中缺乏受控活体采集流程的漏洞，通过合成介于两个目标身份之间的伪造人脸图像实现欺骗。解决方案的关键在于提出一种基于Arc2Face（一种身份条件驱动的面部基础模型）的新颖面部仿造技术，该模型能够从紧凑的身份表示中生成高保真度的人脸图像。实验表明，该方法在保持身份信息完整性方面表现优异，其仿造攻击潜力与传统基于关键点的方法相当，验证了深度学习方法在复杂仿造场景下的有效性。

链接: https://arxiv.org/abs/2602.16569
作者: Nicolò Di Domenico,Annalisa Franco,Matteo Ferrara,Davide Maltoni
机构: University of Bologna (博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Face morphing attacks are widely recognized as one of the most challenging threats to face recognition systems used in electronic identity documents. These attacks exploit a critical vulnerability in passport enrollment procedures adopted by many countries, where the facial image is often acquired without a supervised live capture process. In this paper, we propose a novel face morphing technique based on Arc2Face, an identity-conditioned face foundation model capable of synthesizing photorealistic facial images from compact identity representations. We demonstrate the effectiveness of the proposed approach by comparing the morphing attack potential metric on two large-scale sequestered face morphing attack detection datasets against several state-of-the-art morphing methods, as well as on two novel morphed face datasets derived from FEI and ONOT. Experimental results show that the proposed deep learning-based approach achieves a morphing attack potential comparable to that of landmark-based techniques, which have traditionally been regarded as the most challenging. These findings confirm the ability of the proposed method to effectively preserve and manage identity information during the morph generation process.

[CV-11] Lets Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding ICLR2026

【速读】：该论文旨在解决视频识别模型在面对细粒度分类需求时的局限性问题，即现有模型通常基于固定且粗粒度的类别体系进行训练，难以适应任务演进中出现的新区分（如物体、动作方式或结果的细微差异），且重新标注和训练成本高昂。其解决方案的关键在于提出“类别拆分”（category splitting）这一新任务，并设计一种零样本编辑方法：利用视频分类器的潜在组合结构（latent compositional structure）揭示细粒度差异，无需额外数据即可实现对粗粒度类别的精细化划分，同时保持原有分类准确率不受影响；此外，该方法还通过低样本微调进一步提升性能，且受益于零样本初始化带来的良好起点。

链接: https://arxiv.org/abs/2602.16545
作者: Kaiting Liu,Hazel Doughty
机构: Leiden University (莱顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICLR 2026

点击查看摘要

Abstract:Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: this https URL.

[CV-12] DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images

【速读】：该论文旨在解决现有服装图案生成方法在处理多样化姿态和视角时的局限性，以及基于优化的方法计算成本高、难以扩展的问题。其核心挑战在于如何从单张自然场景图像中高效重建物理一致的二维缝制图案（sewing pattern）及其对应的三维服装模型，以满足可编辑、可分离且适合仿真应用的需求。解决方案的关键在于提出DressWild——一个新颖的前馈式流程：首先利用视觉-语言模型（VLMs）在图像层面归一化姿态差异，提取具有姿态感知和三维信息的服装特征；随后通过基于Transformer的编码器融合这些特征，并预测缝制参数，直接用于物理仿真、纹理合成及多层虚拟试穿等下游任务。该方法无需多视角输入或迭代优化，在保持高质量的同时实现了高效与可扩展性。

链接: https://arxiv.org/abs/2602.16502
作者: Zeng Tao,Ying Jiang,Yunuo Chen,Tianyi Xie,Huamin Wang,Yingnian Wu,Yin Yang,Abishek Sampath Kumar,Kenji Tashiro,Chenfanfu Jiang
机构: UCLA; Fudan University (复旦大学); Style3D; University of Utah (犹他大学); Sony (索尼)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in garment pattern generation have shown promising progress. However, existing feed-forward methods struggle with diverse poses and viewpoints, while optimization-based approaches are computationally expensive and difficult to scale. This paper focuses on sewing pattern generation for garment modeling and fabrication applications that demand editable, separable, and simulation-ready garments. We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image. Given an input image, our method leverages vision-language models (VLMs) to normalize pose variations at the image level, then extract pose-aware, 3D-informed garment features. These features are fused through a transformer-based encoder and subsequently used to predict sewing pattern parameters, which can be directly applied to physical simulation, texture synthesis, and multi-layer virtual try-on. Extensive experiments demonstrate that our approach robustly recovers diverse sewing patterns and the corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization, offering an efficient and scalable solution for realistic garment simulation and animation.

[CV-13] Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection

【速读】：该论文旨在解决目标检测模型在面对对抗攻击时缺乏统一、公平的评估基准和有效防御策略的问题，这限制了攻击与防御方法之间的可比性与进步。其关键解决方案是提出一个聚焦于数字域、非贴片类攻击的统一基准框架，该框架通过引入区分定位误差与分类误差的具体指标，并采用多种感知度量来评估攻击成本，从而实现对攻击方法的公正比较；同时，实验表明最有效的对抗训练策略依赖于混合高扰动攻击（如空间和语义目标不同）的数据集，显著优于单一攻击源的训练方式。

链接: https://arxiv.org/abs/2602.16494
作者: Alexis Winter,Jean-Vincent Martini,Romaric Audigier,Angelique Loesch,Bertrand Luvison
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection models are critical components of automated systems, such as autonomous vehicles and perception-based robots, but their sensitivity to adversarial attacks poses a serious security risk. Progress in defending these models lags behind classification, hindered by a lack of standardized evaluation. It is nearly impossible to thoroughly compare attack or defense methods, as existing work uses different datasets, inconsistent efficiency metrics, and varied measures of perturbation cost. This paper addresses this gap by investigating three key questions: (1) How can we create a fair benchmark to impartially compare attacks? (2) How well do modern attacks transfer across different architectures, especially from Convolutional Neural Networks to Vision Transformers? (3) What is the most effective adversarial training strategy for robust defense? To answer these, we first propose a unified benchmark framework focused on digital, non-patch-based attacks. This framework introduces specific metrics to disentangle localization and classification errors and evaluates attack cost using multiple perceptual metrics. Using this benchmark, we conduct extensive experiments on state-of-the-art attacks and a wide range of detectors. Our findings reveal two major conclusions: first, modern adversarial attacks against object detection models show a significant lack of transferability to transformer-based architectures. Second, we demonstrate that the most robust adversarial training strategy leverages a dataset composed of a mix of high-perturbation attacks with different objectives (e.g., spatial and semantic), which outperforms training on any single attack.

[CV-14] MMA: Multimodal Memory Agent

【速读】：该论文旨在解决长时程多模态智能体（long-horizon multimodal agents）在依赖外部记忆进行决策时，因基于相似性的检索机制常引入过时、低可信度或相互冲突的记忆项而导致的过度自信错误问题。解决方案的关键在于提出多模态记忆代理（Multimodal Memory Agent, MMA），其通过融合源可信度（source credibility）、时间衰减（temporal decay）与冲突感知网络共识（conflict-aware network consensus）动态计算每个记忆项的可靠性得分，并利用该信号重新加权证据，同时在支持不足时选择不回答（abstain）。这一机制显著提升了模型在复杂场景下的鲁棒性与可解释性。

链接: https://arxiv.org/abs/2602.16493
作者: Yihao Lu,Wanru Cheng,Zeyu Zhang,Hao Tang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the “Visual Placebo Effect”, revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: this https URL.

[CV-15] Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在复杂视觉感知任务中表现不佳的问题，尤其是面对图表解析（Chart Parsing）这类视觉密集型任务时，现有模型常出现数据遗漏、对齐错误和幻觉等问题。解决方案的关键在于提出一种名为视觉自精炼（Visual Self-Refine, VSR）的新范式，其核心是让模型生成像素级定位输出并可视化这些结果，再将可视化反馈回自身，从而直观地检查并修正潜在的视觉感知错误。具体到图表解析场景，作者构建了ChartVSR模型，将其分为精炼阶段（Refine Stage）与解码阶段（Decode Stage），其中精炼阶段通过迭代视觉反馈确保所有数据点的像素级定位准确，解码阶段则利用这些验证后的定位作为精确视觉锚点来结构化提取最终数据。

链接: https://arxiv.org/abs/2602.16455
作者: Jinsong Li,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Jiaqi Wang,Dahua Lin
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory (上海人工智能实验室); CPII under InnoHK (InnoHK下的CPII); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor’’ to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points’ Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.

[CV-16] Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems

【速读】：该论文旨在解决印度多语言光学字符识别（OCR）系统设计中的核心挑战，即如何在语言多样性、文档异构性和部署约束之间取得平衡。其关键解决方案在于对比两种训练策略：一是采用通用视觉编码器与强健的多语言语言模型进行端到端联合训练；二是对已有的OCR模型进行微调，即使该模型未针对目标语言预训练。实验表明，后者在准确率-延迟权衡上表现更优，尤其Chitrapathak-2在保持SOTA性能的同时实现3–6倍加速，且在泰卢固语（Telugu）上达到6.69 char ANLS的最优结果。此外，研究还提出了Parichay系列模型，专为9类印度政府文书结构化字段提取优化，在89.8%精确匹配得分下实现更快推理速度，为构建生产级OCR流水线提供了实用指导。

链接: https://arxiv.org/abs/2602.16430
作者: Ali Faraz,Raja Kolla,Ashish Kulkarni,Shubham Agarwal
机构: Krutrim AI(克鲁特里姆人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.

[CV-17] ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在长视频理解任务中面临的计算复杂度高和信息冗余问题。由于自注意力机制的复杂度随序列长度呈二次增长，直接处理完整RGB帧流在计算上不可行。解决方案的关键在于提出ReMoRa模型，其通过在压缩表示上操作实现高效处理：保留稀疏RGB关键帧以捕捉视觉外观，同时用运动表示（motion representation）替代光学流来编码时序动态，从而避免逐帧解码；此外，引入去噪与细粒度运动生成模块提升运动表示质量，并设计线性扩展的特征压缩机制，使整体计算复杂度与视频长度呈线性关系。

链接: https://arxiv.org/abs/2602.16412
作者: Daichi Yashima,Shuhei Kurita,Yusuke Oda,Komei Sugiura
机构: Keio University (庆应义塾大学); NII (日本国立信息学研究所); NII LLMC (日本国立信息学研究所大语言模型研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.

[CV-18] Parameter-Free Adaptive Multi-Scale Channel-Spatial Attention Aggregation framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired

【速读】：该论文旨在解决单目三维语义场景补全（Monocular 3D Semantic Scene Completion, SSC）中因缺乏显式体素特征可靠性建模和跨尺度信息传播调控而导致的投影扩散与特征纠缠问题，从而限制结构完整性与语义一致性。其解决方案的关键在于提出自适应多尺度注意力聚合（Adaptive Multi-scale Attention Aggregation, AMAA）框架：通过并行通道-空间注意力机制联合校准提升后的体素特征在语义与空间维度的一致性，并采用分层自适应特征门控策略稳定编码器-解码器间的多尺度融合过程，实现可靠且高效的单目SSC感知。

链接: https://arxiv.org/abs/2602.16385
作者: Qi He,XiangXiang Wang,Jingtao Zhang,Yongbin Yu,Hongxiang Chu,Manping Fan,JingYe Cai,Zhenglin Yang
机构: University of Electronic Science and Technology of China (电子科技大学); Sichuan Academy of Medical Sciences & Sichuan Provincial People’s Hospital (四川省医学科学院/四川省人民医院); Chinese Academy of Medical Sciences (中国医学科学院); Qinghai Normal University (青海师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures, 5 tables

点击查看摘要

Abstract:In indoor assistive perception for visually impaired users, 3D Semantic Scene Completion (SSC) is expected to provide structurally coherent and semantically consistent occupancy under strictly monocular vision for safety-critical scene understanding. However, existing monocular SSC approaches often lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation during 2D-3D projection and multi-scale fusion, making them vulnerable to projection diffusion and feature entanglement and thus limiting structural this http URL address these challenges, this paper presents an Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon the MonoScene pipeline. Rather than introducing a heavier backbone, AMAA focuses on reliability-oriented feature regulation within a monocular SSC framework. Specifically, lifted voxel features are jointly calibrated in semantic and spatial dimensions through parallel channel-spatial attention aggregation, while multi-scale encoder-decoder fusion is stabilized via a hierarchical adaptive feature-gating strategy that regulates information injection across this http URL on the NYUv2 benchmark demonstrate consistent improvements over MonoScene without significantly increasing system complexity: AMAA achieves 27.25% SSC mIoU (+0.31) and 43.10% SC IoU (+0.59). In addition, system-level deployment on an NVIDIA Jetson platform verifies that the complete AMAA framework can be executed stably on embedded hardware. Overall, AMAA improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.

[CV-19] Markerless 6D Pose Estimation and Position-Based Visual Servoing for Endoscopic Continuum Manipulators

【速读】：该论文旨在解决柔性内窥镜手术系统中连续体机械臂（continuum manipulators）的位姿估计与闭环控制难题，特别是由迟滞效应、柔顺性及远端传感受限导致的精度不足问题。传统基于视觉的方法因几何可观测性有限和计算开销大而难以实现实时闭环应用。其解决方案的关键在于提出了一套统一的无标记立体6D位姿估计与基于位置的视觉伺服（position-based visual servoing）框架：通过逼真的仿真管道实现大规模自动训练并生成像素级标注；设计了一个立体感知的多特征融合网络，联合利用分割掩码、关键点、热力图和边界框以增强几何可观测性；引入一个前馈式渲染引导的精化模块，在单次前向传播中预测残差位姿修正，无需迭代优化即可保证几何一致性；同时采用自监督的“仿真到现实”适应策略，利用未标注的真实数据进一步提升实际性能。该方法实现了无需物理标记或嵌入式传感的高精度闭环控制，验证表明其在1000个样本上平均平移误差为0.83 mm、旋转误差为2.76°，并在轨迹跟踪任务中相比开环控制分别降低85%和59%的误差，具有良好的重复性。

链接: https://arxiv.org/abs/2602.16365
作者: Junhyun Park,Chunggil An,Myeongbo Park,Ihsan Ullah,Sihyeong Park,Minho Hwang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 13 figures, 7 tables

点击查看摘要

Abstract:Continuum manipulators in flexible endoscopic surgical systems offer high dexterity for minimally invasive procedures; however, accurate pose estimation and closed-loop control remain challenging due to hysteresis, compliance, and limited distal sensing. Vision-based approaches reduce hardware complexity but are often constrained by limited geometric observability and high computational overhead, restricting real-time closed-loop applicability. This paper presents a unified framework for markerless stereo 6D pose estimation and position-based visual servoing of continuum manipulators. A photo-realistic simulation pipeline enables large-scale automatic training with pixel-accurate annotations. A stereo-aware multi-feature fusion network jointly exploits segmentation masks, keypoints, heatmaps, and bounding boxes to enhance geometric observability. To enforce geometric consistency without iterative optimization, a feed-forward rendering-based refinement module predicts residual pose corrections in a single pass. A self-supervised sim-to-real adaptation strategy further improves real-world performance using unlabeled data. Extensive real-world validation achieves a mean translation error of 0.83 mm and a mean rotation error of 2.76° across 1,000 samples. Markerless closed-loop visual servoing driven by the estimated pose attains accurate trajectory tracking with a mean translation error of 2.07 mm and a mean rotation error of 7.41°, corresponding to 85% and 59% reductions compared to open-loop control, together with high repeatability in repeated point-reaching tasks. To the best of our knowledge, this work presents the first fully markerless pose-estimation-driven position-based visual servoing framework for continuum manipulators, enabling precise closed-loop control without physical markers or embedded sensing.

[CV-20] Articulated 3D Scene Graphs for Open-World Mobile Manipulation

【速读】：该论文旨在解决机器人在真实环境中难以预测物体运动的问题，从而实现长时程移动操作中语义、几何与运动学之间的信息鸿沟。其核心挑战在于如何从RGB-D序列中自动构建包含可交互物体的语义-运动学3D场景图（semantic-kinematic 3D scene graph），以支持基于功能（affordance）的物体操作。解决方案的关键在于提出MoMa-SG框架：首先通过鲁棒的点跟踪技术对多物体运动进行时序分割并估计三维轨迹；其次采用统一的旋量（twist）估计方法，在单次优化中精确识别旋转（revolute）和移动（prismatic）关节参数；再通过父-子关系推理检测嵌套物体及开合状态；最终结合新提出的Arti4D-Semantic数据集验证了方法的有效性，并在四足机器人和移动操作臂上实现了家庭环境中对铰接物体的鲁棒操作。

链接: https://arxiv.org/abs/2602.16356
作者: Martin Büchner,Adrian Röfer,Tim Engelbracht,Tim Welschehold,Zuria Bauer,Hermann Blum,Marc Pollefeys,Abhinav Valada
机构: 1: University of Bonn (波恩大学); 2: ETH Zurich (苏黎世联邦理工学院); 3: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics including parent-child relation labels with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate the performance of MoMa-SG on two datasets and ablate key design choices of our approach. In addition, real-world experiments on both a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments. We provide code and data at: this https URL.

[CV-21] SCAR: Satellite Imagery-Based Calibration for Aerial Recordings

【速读】：该论文旨在解决长期内航空视觉-惯性系统（Visual-Inertial System, VIS）因环境变化和时间推移导致的标定参数退化问题，从而影响定位精度与鲁棒性。解决方案的关键在于提出SCAR方法，利用公开获取的地理参考卫星影像（如正射影像和高程模型）作为持久的全局参考，通过2D–3D对应关系自动估计并修正系统的内在参数（intrinsic parameters）和外在参数（extrinsic parameters），无需依赖专门的标定操作或人工测量的地面控制点，实现了无需人工干预的长期自主校准优化。

链接: https://arxiv.org/abs/2602.16349
作者: Henry Hölzemann,Michael Schleiss
机构: Fraunhofer FKIE (弗劳恩霍夫信息与通信技术研究所); University of the Bundeswehr Munich (慕尼黑联邦国防军大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce SCAR, a method for long-term auto-calibration refinement of aerial visual-inertial systems that exploits georeferenced satellite imagery as a persistent global reference. SCAR estimates both intrinsic and extrinsic parameters by aligning aerial images with 2D–3D correspondences derived from publicly available orthophotos and elevation models. In contrast to existing approaches that rely on dedicated calibration maneuvers or manually surveyed ground control points, our method leverages external geospatial data to detect and correct calibration degradation under field deployment conditions. We evaluate our approach on six large-scale aerial campaigns conducted over two years under diverse seasonal and environmental conditions. Across all sequences, SCAR consistently outperforms established baselines (Kalibr, COLMAP, VINS-Mono), reducing median reprojection error by a large margin, and translating these calibration gains into substantially lower visual localization rotation errors and higher pose accuracy. These results demonstrate that SCAR provides accurate, robust, and reproducible calibration over long-term aerial operations without the need for manual intervention.

[CV-22] Subtractive Modulative Network with Learnable Periodic Activations

【速读】：该论文旨在解决隐式神经表示（Implicit Neural Representation, INR）在重建精度与参数效率之间难以平衡的问题。现有方法往往需要大量参数以实现高质量重建，但计算成本高且难以扩展至复杂场景。解决方案的关键在于提出一种受经典减法合成（subtractive synthesis）启发的新型架构——减法调制网络（Subtractive Modulative Network, SMN），其核心由可学习的周期激活层（Oscillator）和一系列调制掩码模块（Filters）组成：前者生成多频基函数，后者主动调制并生成高阶谐波成分，从而构建一个结构化、信号处理导向的INR表示。该设计在图像和3D NeRF新视角合成任务中均实现了高保真度重建（PSNR > 40 dB），同时显著降低参数量，展现出优越的性能-效率权衡。

链接: https://arxiv.org/abs/2602.16337
作者: Tiou Wang,Zhuoqian Yang,Markus Flierl,Mathieu Salzmann,Sabine Süsstrunk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 pages, 3 figures, 3 tables

点击查看摘要

Abstract:We propose the Subtractive Modulative Network (SMN), a novel, parameter-efficient Implicit Neural Representation (INR) architecture inspired by classical subtractive synthesis. The SMN is designed as a principled signal processing pipeline, featuring a learnable periodic activation layer (Oscillator) that generates a multi-frequency basis, and a series of modulative mask modules (Filters) that actively generate high-order harmonics. We provide both theoretical analysis and empirical validation for our design. Our SMN achieves a PSNR of 40+ dB on two image datasets, comparing favorably against state-of-the-art methods in terms of both reconstruction accuracy and parameter efficiency. Furthermore, consistent advantage is observed on the challenging 3D NeRF novel view synthesis task. Supplementary materials are available at this https URL.

[CV-23] Guide-Guard: Off-Target Predicting in CRISPR Applications

【速读】：该论文旨在解决CRISPR基因编辑技术中脱靶效应（off-target behavior）的预测难题，即在设计引导RNA（gRNA）时难以准确预判其在基因组中的非特异性结合位点，从而影响编辑精度与安全性。解决方案的关键在于提出一种基于机器学习的方法——Guide-Guard，该模型能够从数据驱动的角度建模生物和化学机制，在同时训练多个不同基因的情况下仍保持84%的预测准确率，显著提升了对CRISPR系统行为的可预测性与实用性。

链接: https://arxiv.org/abs/2602.16327
作者: Joseph Bingham,Netanel Arussy,Saman Zonouz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 11 figs, accepted to IDEAL 2022

点击查看摘要

Abstract:With the introduction of cyber-physical genome sequencing and editing technologies, such as CRISPR, researchers can more easily access tools to investigate and create remedies for a variety of topics in genetics and health science (e.g. agriculture and medicine). As the field advances and grows, new concerns present themselves in the ability to predict the off-target behavior. In this work, we explore the underlying biological and chemical model from a data driven perspective. Additionally, we present a machine learning based solution named \textitGuide-Guard to predict the behavior of the system given a gRNA in the CRISPR gene-editing process with 84% accuracy. This solution is able to be trained on multiple different genes at the same time while retaining accuracy.

[CV-24] A Self-Supervised Approach for Enhanced Feature Representations in Object Detection Tasks

【速读】：该论文旨在解决深度学习模型在复杂任务（如目标检测）中对大量标注数据的依赖问题，这在实际应用中导致高昂的人力与成本开销。其解决方案的关键在于通过自监督学习策略训练一个特征提取器，该模型在无标签数据上预训练，从而显著提升特征表示能力，使其在少量标注数据下仍能超越基于ImageNet预训练且专为检测任务设计的先进特征提取器。该方法促使模型聚焦于目标最相关的视觉特征，增强模型的可靠性与鲁棒性。

链接: https://arxiv.org/abs/2602.16322
作者: Santiago C. Vilabella,Pablo Pérez-Núñez,Beatriz Remeseiro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the fast-evolving field of artificial intelligence, where models are increasingly growing in complexity and size, the availability of labeled data for training deep learning models has become a significant challenge. Addressing complex problems like object detection demands considerable time and resources for data labeling to achieve meaningful results. For companies developing such applications, this entails extensive investment in highly skilled personnel or costly outsourcing. This research work aims to demonstrate that enhancing feature extractors can substantially alleviate this challenge, enabling models to learn more effective representations with less labeled data. Utilizing a self-supervised learning strategy, we present a model trained on unlabeled data that outperforms state-of-the-art feature extractors pre-trained on ImageNet and particularly designed for object detection tasks. Moreover, the results demonstrate that our approach encourages the model to focus on the most relevant aspects of an object, thus achieving better feature representations and, therefore, reinforcing its reliability and robustness.

[CV-25] Breaking the Sub-Millimeter Barrier: Eyeframe Acquisition from Color Images

【速读】：该论文旨在解决传统眼镜框镜片追踪（eyeframe lens tracing）过程中依赖机械工具导致的精度低、流程繁琐及设备复杂的问题。其解决方案的关键在于提出一种基于人工视觉（artificial vision）的新方法，通过多视角信息融合实现高精度测量：首先利用InVision系统获取图像，随后进行帧分割以分离镜框与背景，结合深度估计获得三维空间信息，并最终通过多视角处理将分割后的RGB图像与深度数据融合，从而在无需专用追踪设备的情况下，从静态彩色图像中实现亚毫米级精度的镜框轮廓测量，显著简化了验光师的工作流程。

链接: https://arxiv.org/abs/2602.16281
作者: Manel Guzmán,Antonio Agudo
机构: Horizons Optical (霍里izon光学); Institut de Robòtica i Informàtica Industrial, CSIC-UPC (工业机器人与信息研究所，CSIC-UPC)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CAI 2026

点击查看摘要

Abstract:Eyeframe lens tracing is an important process in the optical industry that requires sub-millimeter precision to ensure proper lens fitting and optimal vision correction. Traditional frame tracers rely on mechanical tools that need precise positioning and calibration, which are time-consuming and require additional equipment, creating an inefficient workflow for opticians. This work presents a novel approach based on artificial vision that utilizes multi-view information. The proposed algorithm operates on images captured from an InVision system. The full pipeline includes image acquisition, frame segmentation to isolate the eyeframe from background, depth estimation to obtain 3D spatial information, and multi-view processing that integrates segmented RGB images with depth data for precise frame contour measurement. To this end, different configurations and variants are proposed and analyzed on real data, providing competitive measurements from still color images with respect to other solutions, while eliminating the need for specialized tracing equipment and reducing workflow complexity for optical technicians.

[CV-26] AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards

【速读】：该论文旨在解决高分辨率自监督预训练在计算资源受限场景下的可扩展性问题，特别是针对掩码自动编码器（Masked Autoencoders, MAE）与分层下采样架构结合时因密集网格先验和掩码感知设计妥协而导致的结构挑战。其解决方案的关键在于提出AFFMAE框架，通过引入自适应、非网格化的token合并机制，在丢弃被掩码token的基础上仅对可见token进行动态合并，从而消除对密集网格假设的依赖并保持分层扩展能力；同时结合数值稳定的混合精度Flash风格聚类注意力核及深度监督策略缓解稀疏阶段表征崩溃问题，显著降低计算量（最多减少7倍FLOPs）和内存占用（减半），并在单张RTX 5090显卡上实现更快训练速度。

链接: https://arxiv.org/abs/2602.16249
作者: David Smerkous,Zian Wang,Behzad Najafian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Self-supervised pretraining has transformed computer vision by enabling data-efficient fine-tuning, yet high-resolution training typically requires server-scale infrastructure, limiting in-domain foundation model development for many research laboratories. Masked Autoencoders (MAE) reduce computation by encoding only visible tokens, but combining MAE with hierarchical downsampling architectures remains structurally challenging due to dense grid priors and mask-aware design compromises. We introduce AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging. By discarding masked tokens and performing dynamic merging exclusively over visible tokens, AFFMAE removes dense-grid assumptions while preserving hierarchical scalability. We developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision. On high-resolution electron microscopy segmentation, AFFMAE matches ViT-MAE performance at equal parameter count while reducing FLOPs by up to 7x, halving memory usage, and achieving faster training on a single RTX 5090. Code available at this https URL.

[CV-27] HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis

【速读】：该论文旨在解决现有多模态融合框架在医学影像分析中面临的两大问题：一是计算复杂度高，限制了其在低资源环境中的应用；二是采用级联注意力模块易导致信息丢失，难以有效捕捉跨模态的鲁棒共享表示，从而影响多疾病分析任务的泛化能力。解决方案的关键在于提出一种新型混合并行-级联注意力网络（HyPCA-Net），其核心创新包括两个模块：(a) 一个计算高效的残差自适应学习注意力块，用于提取精细化的模态特异性表征；(b) 一个双视角级联注意力块，旨在学习跨多种医学影像模态（如MRI、CT）的稳健共享表示，从而提升模型性能与效率。

链接: https://arxiv.org/abs/2602.16245
作者: J. Dhar,M. K. Pandey,D. Chakladar,M. Haghighat,A. Alavi,S. Mistry,N. Zaidi
机构: Indian Institute of Technology Ropar, India (印度理工学院拉普尔分校); RoentGen Health, India (RoentGen健康公司); Lulea University of Technology, Sweden (瑞典吕勒奥理工大学); QUT, Australia (昆士兰科技大学); RMIT University, Australia (皇家墨尔本理工大学); Curtin University, Australia (科廷大学); Deakin University, Australia (迪肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision 2026

点击查看摘要

Abstract:Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks. To address these limitations, we propose a Hybrid Parallel-Fusion Cascaded Attention Network (HyPCA-Net), composed of two core novel blocks: (a) a computationally efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing leading methods, with improvements of up to 5.2% in performance and reductions of up to 73.1% in computational cost. Code: this https URL.

[CV-28] EasyControlEdge: A Foundation-Model Fine-Tuning for Edge Detection

【速读】：该论文旨在解决真实场景下边缘检测（如平面图墙体、卫星图像道路/建筑边界及医学器官轮廓）中因训练数据有限而导致的边缘清晰度不足问题，即如何在数据效率低的情况下生成高清晰度的原始边缘图。解决方案的关键在于提出EasyControlEdge，通过适配图像生成基础模型（image-generation foundation model）来实现边缘检测任务的专用化：一方面引入面向边缘的像素空间损失函数以增强模型对边缘特征的捕捉能力；另一方面在推理阶段利用无条件动态引导机制，仅用一个模型即可通过调整引导尺度控制边缘密度，从而实现高效且清晰的边缘检测，在BSDS500、NYUDv2、BIPED和CubiCasa等数据集上验证了其在无后处理评估下的持续性能提升，尤其在小样本训练条件下优势显著。

链接: https://arxiv.org/abs/2602.16238
作者: Hiroki Nakamura,Hiroto Iino,Masashi Okada,Tadahiro Taniguchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose EasyControlEdge, adapting an image-generation foundation model to edge detection. In real-world edge detection (e.g., floor-plan walls, satellite roads/buildings, and medical organ boundaries), crispness and data efficiency are crucial, yet producing crisp raw edge maps with limited training samples remains challenging. Although image-generation foundation models perform well on many downstream tasks, their pretrained priors for data-efficient transfer and iterative refinement for high-frequency detail preservation remain underexploited for edge detection. To enable crisp and data-efficient edge detection using these capabilities, we introduce an edge-specialized adaptation of image-generation foundation models. To better specialize the foundation model for edge detection, we incorporate an edge-oriented objective with an efficient pixel-space loss. At inference, we introduce guidance based on unconditional dynamics, enabling a single model to control the edge density through a guidance scale. Experiments on BSDS500, NYUDv2, BIPED, and CubiCasa compare against state-of-the-art methods and show consistent gains, particularly under no-post-processing crispness evaluation and with limited training data.

[CV-29] DataCube: A Video Retrieval Platform via Natural Language Semantic Profiling IJCAI ECAI2026

【速读】：该论文旨在解决大规模视频库在转化为高质量、任务特定数据集过程中存在的成本高、效率低的问题。解决方案的关键在于提出一个名为DataCube的智能平台，该平台能够实现视频的自动处理、多维特征分析以及基于查询的检索功能；其核心创新在于构建了视频片段的结构化语义表示，并支持神经重排序与深度语义匹配相结合的混合检索机制，从而显著提升用户从海量视频中高效提取定制化子集的能力，适用于训练、分析和评估等场景。

链接: https://arxiv.org/abs/2602.16231
作者: Yiming Ju,Hanyu Zhao,Quanyue Ma,Donglin Hao,Chengwei Wu,Ming Li,Songjing Wang,Tengfei Pan
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is under review for the IJCAI-ECAI 2026 Demonstrations Track

点击查看摘要

Abstract:Large-scale video repositories are increasingly available for modern video understanding and generation tasks. However, transforming raw videos into high-quality, task-specific datasets remains costly and inefficient. We present DataCube, an intelligent platform for automatic video processing, multi-dimensional profiling, and query-driven retrieval. DataCube constructs structured semantic representations of video clips and supports hybrid retrieval with neural re-ranking and deep semantic matching. Through an interactive web interface, users can efficiently construct customized video subsets from massive repositories for training, analysis, and evaluation, and build searchable systems over their own private video collections. The system is publicly accessible at this https URL. Demo Video: this https URL

[CV-30] Graph neural network for colliding particles with an application to sea ice floe modeling

【速读】：该论文旨在解决传统数值方法在海冰模拟中计算成本高、可扩展性差的问题。其解决方案的关键在于利用图神经网络（Graph Neural Networks, GNNs）捕捉海冰的自然图结构——其中节点代表单个冰块，边表示物理相互作用（如碰撞），构建出名为“Collision-captured Network (CN)”的模型，并融合数据同化（Data Assimilation, DA）技术以学习和预测不同条件下的海冰动力学行为。该方法在合成数据上验证有效，能够在不牺牲精度的前提下显著加速轨迹模拟，为边缘冰区（Marginal Ice Zone, MIZ）的高效预报提供了新工具。

链接: https://arxiv.org/abs/2602.16213
作者: Ruibiao Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:This paper introduces a novel approach to sea ice modeling using Graph Neural Networks (GNNs), utilizing the natural graph structure of sea ice, where nodes represent individual ice pieces, and edges model the physical interactions, including collisions. This concept is developed within a one-dimensional framework as a foundational step. Traditional numerical methods, while effective, are computationally intensive and less scalable. By utilizing GNNs, the proposed model, termed the Collision-captured Network (CN), integrates data assimilation (DA) techniques to effectively learn and predict sea ice dynamics under various conditions. The approach was validated using synthetic data, both with and without observed data points, and it was found that the model accelerates the simulation of trajectories without compromising accuracy. This advancement offers a more efficient tool for forecasting in marginal ice zones (MIZ) and highlights the potential of combining machine learning with data assimilation for more effective and efficient modeling.

[CV-31] Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking IJCNN2026

【速读】：该论文旨在解决基于Transformer的单目标跟踪器在长视频序列中因固定深度推理而导致的计算冗余问题，即无论帧间视觉复杂度如何，均执行完整的编码器-解码器堆栈，造成不必要的资源消耗。解决方案的关键在于提出UncL-STARK方法，其通过保留原始网络架构并引入一种基于不确定性感知的动态深度自适应机制：首先采用随机深度训练结合知识蒸馏策略对模型进行微调，使其在多个中间深度下仍具备鲁棒预测能力；其次在运行时利用角点定位热图直接推导轻量级不确定性估计，并基于此设计反馈驱动策略，根据预测置信度和视频的时间连贯性动态选择下一帧的编码器与解码器深度，从而实现安全的推理时截断。

链接: https://arxiv.org/abs/2602.16160
作者: Patrick Poggi,Divake Kumar,Theja Tulabandhula,Amit Ranjan Trivedi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IJCNN 2026

点击查看摘要

Abstract:Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder–decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model’s corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings while maintaining tracking accuracy within 0.2% of the full-depth baseline across both short-term and long-term sequences.

[CV-32] Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing

【速读】：该论文旨在解决指令引导的图像到图像（I2I）编辑中因用户身份特征（如种族、性别、年龄）导致的系统性偏差问题，尤其是身份保留失败的现象。研究发现，相同编辑指令在不同群体间会产生不一致的结果，具体表现为两种失效模式：软擦除（Soft Erasure），即编辑被无声弱化或忽略；以及刻板印象替换（Stereotype Replacement），即引入与刻板印象一致的非请求属性。解决方案的关键在于引入一个基于诊断提示集的受控基准，并结合视觉语言模型（VLM）评分与人工评估来量化这些偏差。进一步提出仅通过提示层面的身份约束（prompt-level identity constraint）即可显著减少少数群体的身份变化，而对多数群体影响较小，揭示了当前I2I编辑器中存在的不对称身份先验，为构建更具公平性的编辑系统提供了有效路径。

链接: https://arxiv.org/abs/2602.16149
作者: Huichan Seo,Minki Hong,Sieun Choi,Jihie Kim,Jean Oh
机构: Carnegie Mellon University (卡内基梅隆大学); Dongguk University (东国大学); Lavoro AI Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 13 figures. Preprint

点击查看摘要

Abstract:Demographic bias in text-to-image (T2I) generation is well studied, yet demographic-conditioned failures in instruction-guided image-to-image (I2I) editing remain underexplored. We examine whether identical edit instructions yield systematically different outcomes across subject demographics in open-weight I2I editors. We formalize two failure modes: Soft Erasure, where edits are silently weakened or ignored in the output image, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent attributes. We introduce a controlled benchmark that probes demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age using a diagnostic prompt set, and evaluate multiple editors with vision-language model (VLM) scoring and human evaluation. Our analysis shows that identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors, including occupation-driven gender inference. Finally, we demonstrate that a prompt-level identity constraint, without model updates, can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors in current editors. Together, our findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate demographic-robust editing systems. Project page: this https URL

[CV-33] IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

【速读】：该论文旨在解决开放式视觉问答（Open-ended Visual Question Answering, VQA）中因问题歧义导致模型回答不准确的问题。其解决方案的关键在于提出了一种无需训练的实时方法IRIS（Intent Resolution via Inference-time Saccades），利用用户眼动数据（eye-tracking data）在推理阶段识别关键注视点（fixations），特别是与用户开始口头提问时间最接近的注视位置，作为解歧依据。实验证明，该方法能显著提升大视觉语言模型（VLMs）在歧义问题上的回答准确率（从35.2%提升至77.2%），同时保持对非歧义问题的性能稳定，且适用于不同架构的先进VLMs。

链接: https://arxiv.org/abs/2602.16138
作者: Parsa Madinei,Srijita Karmakar,Russell Cohen Hoffing,Felix Gervitz,Miguel P. Eckstein
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.

[CV-34] CHAI: CacHe Attention Inference for text2video

【速读】：该论文旨在解决文本到视频扩散模型（text-to-video diffusion models）在推理阶段速度缓慢的问题，其根本原因在于对三维潜在表示（3D latents）进行逐帧去噪的串行过程。现有加速方法要么需要昂贵的模型重训练，要么依赖启发式步长跳过策略，后者在减少去噪步骤时难以维持视频质量。论文提出的解决方案是CHAI，其核心创新在于引入Cache Attention机制，该机制能够有效关注跨不同推理过程中共享的对象或场景，从而实现对缓存潜在表示的选择性复用。这一机制显著提升了缓存命中率，使得在仅使用8步去噪的情况下仍可生成高质量视频；集成至整体系统后，CHAI相较基线OpenSora 1.2实现了1.65x至3.35倍的加速比，同时保持视频质量不变。

链接: https://arxiv.org/abs/2602.16132
作者: Joel Mathew Cherian,Ashutosh Muralidhara Bharadwaj,Vima Gupta,Anand Padmanabha Iyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.

[CV-35] OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

【速读】：该论文旨在解决当前大型视觉语言模型（Large Vision-Language Models, LVLMs）在CT影像分析中面临的割裂问题：即基于切片的LVLM虽具备良好泛化能力，但缺乏跨切片的空间一致性；而基于体积的LVLM虽能捕捉三维语义信息，却存在粒度粗糙且难以适配切片输入的问题。这一碎片化建模范式成为医学LVLM临床转化的主要瓶颈。解决方案的关键在于提出OmniCT——一个统一的切片-体积LVLM架构，其核心创新包括：(i) 空间一致性增强（Spatial Consistency Enhancement, SCE），通过体积切片组合与三轴位置嵌入实现跨切片空间一致性的显式建模，并引入MoE混合投影机制提升切片到体积的高效适配；(ii) 器官级语义增强（Organ-level Semantic Enhancement, OSE），利用分割和ROI定位对齐解剖区域，强化病变与器官层级的语义表达；以及(iii) MedEval-CT数据集与混合评估基准，为统一评价提供支持。该方案实现了微观细节敏感性与宏观空间推理能力的协同提升，确立了跨模态医学影像理解的新范式。

链接: https://arxiv.org/abs/2602.16110
作者: Tianwei Lin,Zhongwei Qiu,Wenqiao Zhang,Jiang Liu,Yihan Xie,Mingjian Gao,Zhenxuan Fan,Zhaocheng Li,Sijing Li,Zhongle Xie,Peng LU,Yueting Zhuang,Yingda Xia,Ling Zhang,Beng Chin Ooi
机构: Zhejiang University (浙江大学); DAMO Academy (达摩院); Alibaba Group (阿里巴巴集团); Hupan Lab (湖畔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.

[CV-36] LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

【速读】：该论文旨在解决离散图像标记化（discrete image tokenization）在可扩展视觉生成任务中的核心瓶颈问题：如何在保持紧凑性以支持高效潜在空间先验的同时，有效保留语义结构并充分利用离散容量。现有量化方法存在显著权衡——向量量化（vector-quantized）方法虽具灵活性但易受直通优化偏差、码本利用率不足及大词汇量下表示崩溃等问题影响；而结构化标量或隐式量化方法虽能保证稳定且接近完全的码本利用率，却受限于固定离散几何，难以适应异质潜在统计分布。其解决方案的关键在于提出可学习几何量化（Learnable Geometric Quantization, LGQ），通过端到端学习离散化几何结构，将硬最近邻查找替换为温度控制的软分配机制，实现全可微训练并在推理时恢复硬分配。该方法基于各向同性高斯混合模型的后验责任关系，并最小化变分自由能目标，在低温极限下理论收敛至最近邻量化；同时引入词元级尖锐度正则项与全局使用正则项，促使代码使用既自信又均衡，无需预设刚性网格。实验表明，LGQ在ImageNet上相较于FSQ和SimVQ均取得更优性能，显著提升rFID指标并大幅降低活跃码数与有效表示率。

链接: https://arxiv.org/abs/2602.16086
作者: Idil Bilge Altun,Mert Onur Cakiroglu,Elham Buxton,Mehmet Dalkilic,Hasan Kurban
机构: Indiana University Bloomington School of Informatics, Computing, and Engineering (印第安纳大学布卢明顿分校信息、计算与工程学院); University of Illinois Springfield Computer Science (伊利诺伊大学斯普林菲尔德分校计算机科学系); Hamad Bin Khalifa University College of Science and Engineering (哈马德本哈利法大学科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Discrete image tokenization is a key bottleneck for scalable visual generation: a tokenizer must remain compact for efficient latent-space priors while preserving semantic structure and using discrete capacity effectively. Existing quantizers face a trade-off: vector-quantized tokenizers learn flexible geometries but often suffer from biased straight-through optimization, codebook under-utilization, and representation collapse at large vocabularies. Structured scalar or implicit tokenizers ensure stable, near-complete utilization by design, yet rely on fixed discretization geometries that may allocate capacity inefficiently under heterogeneous latent statistics. We introduce Learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end. LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling fully differentiable training while recovering hard assignments at inference. The assignments correspond to posterior responsibilities of an isotropic Gaussian mixture and minimize a variational free-energy objective, provably converging to nearest-neighbor quantization in the low-temperature limit. LGQ combines a token-level peakedness regularizer with a global usage regularizer to encourage confident yet balanced code utilization without imposing rigid grids. Under a controlled VQGAN-style backbone on ImageNet across multiple vocabulary sizes, LGQ achieves stable optimization and balanced utilization. At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries. Our GitHub repository is available at: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2602.16086 [cs.CV] (or arXiv:2602.16086v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.16086 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-37] Extracting and Analyzing Rail Crossing Behavior Signatures from Videos using Tensor Methods

【速读】：该论文旨在解决铁路平交道口因驾驶员行为差异导致的安全隐患问题，尤其关注如何从多地点、多时段的复杂行为数据中识别出具有普适性的行为模式。传统方法局限于单个道口独立分析，难以发现跨区域的行为共性。其解决方案的关键在于提出一种多视角张量分解框架（multi-view tensor decomposition framework），通过TimeSformer提取三个关键时相（Approach、Waiting、Clearance）的视频嵌入表示，构建各阶段特定的相似性矩阵，并采用非负对称CP分解挖掘出具有独特时间特征的潜在行为组分（latent behavioral components）。该方法揭示了地理位置比时间段更能决定行为模式，且Approach阶段的行为最具区分度，从而实现了基于行为相似性的道口聚类与自动化模式发现，为制定针对性安全干预措施提供了量化依据。

链接: https://arxiv.org/abs/2602.16057
作者: Dawon Ahn,Het Patel,Aemal Khattak,Jia Chen,Evangelos E. Papalexakis
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 10 figures. Accepted at InnovaRail 2026

点击查看摘要

Abstract:Railway crossings present complex safety challenges where driver behavior varies by location, time, and conditions. Traditional approaches analyze crossings individually, limiting the ability to identify shared behavioral patterns across locations. We propose a multi-view tensor decomposition framework that captures behavioral similarities across three temporal phases: Approach (warning activation to gate lowering), Waiting (gates down to train passage), and Clearance (train passage to gate raising). We analyze railway crossing videos from multiple locations using TimeSformer embeddings to represent each phase. By constructing phase-specific similarity matrices and applying non-negative symmetric CP decomposition, we discover latent behavioral components with distinct temporal signatures. Our tensor analysis reveals that crossing location appears to be a stronger determinant of behavior patterns than time of day, and that approach-phase behavior provides particularly discriminative signatures. Visualization of the learned component space confirms location-based clustering, with certain crossings forming distinct behavioral clusters. This automated framework enables scalable pattern discovery across multiple crossings, providing a foundation for grouping locations by behavioral similarity to inform targeted safety interventions.

[CV-38] MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval WACV

【速读】：该论文旨在解决当前视觉-语言基础模型（Vision-Language Foundation Models）在高风险生物医学应用中因确定性嵌入（deterministic embeddings）导致的可靠性不足问题，尤其是在胸部X光片与放射科报告的跨模态检索任务中。其解决方案的关键在于提出MedProbCLIP，一个基于概率建模的视觉-语言学习框架，通过将图像和文本表示建模为高斯嵌入（Gaussian embeddings），并采用概率对比目标显式捕捉不确定性及影像与临床描述之间的多对多对应关系；同时引入变分信息瓶颈（variational information bottleneck）缓解过自信预测，并利用多视角影像编码与多段落报告编码实现细粒度监督，从而在推理阶段仅需单张图像和单份报告即可获得更可靠、校准良好且鲁棒的跨模态检索性能。

链接: https://arxiv.org/abs/2602.16019
作者: Ahmad Elallaf,Yu Zhang,Yuktha Priya Masupalli,Jeong Yang,Young Lee,Zechun Cao,Gongbo Liang
机构: Texas A&M University-San Antonio (德克萨斯农工大学圣安东尼奥分校); Boise State University (博伊西州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the 2026 Winter Conference on Applications of Computer Vision (WACV) Workshops

点击查看摘要

Abstract:Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.

[CV-39] BTReport: A Framework for Brain Tumor Radiology Report Generation with Clinically Relevant Features

【速读】：该论文旨在解决神经肿瘤学领域放射学报告生成（Radiology Report Generation, RRG）进展受限的问题，其根本原因在于缺乏公开的图像-报告配对数据集。为应对这一挑战，作者提出BTReport框架，其核心创新在于将RRG任务解耦为两个独立步骤：首先通过确定性特征提取方法从医学影像中获取可解释的成像特征，随后仅利用大语言模型（Large Language Models, LLMs）进行语法结构构建与叙事格式化，而非直接依赖LLMs完成图像理解与报告生成。这种分离策略显著提升了报告的可解释性并降低了幻觉风险，同时实验证明所提取特征能有效预测关键临床结局（如生存期和IDH突变状态），且生成报告与临床参考报告的对齐度优于现有基线方法。

链接: https://arxiv.org/abs/2602.16006
作者: Juampablo E. Heras Rivera,Dickson T. Chen,Tianyi Ren,Daniel K. Low,Asma Ben Abacha,Alberto Santamaria-Pang,Mehmet Kurt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Medical Imaging with Deep Learning (MIDL) 2026

点击查看摘要

Abstract:Recent advances in radiology report generation (RRG) have been driven by large paired image-text datasets; however, progress in neuro-oncology has been limited due to a lack of open paired image-report datasets. Here, we introduce BTReport, an open-source framework for brain tumor RRG that constructs natural language radiology reports using deterministically extracted imaging features. Unlike existing approaches that rely on large general-purpose or fine-tuned vision-language models for both image interpretation and report composition, BTReport performs deterministic feature extraction for image analysis and uses large language models only for syntactic structuring and narrative formatting. By separating RRG into a deterministic feature extraction step and a report generation step, the generated reports are completely interpretable and less prone to hallucinations. We show that the features used for report generation are predictive of key clinical outcomes, including survival and IDH mutation status, and reports generated by BTReport are more closely aligned with reference clinical reports than existing baselines for RRG. Finally, we introduce BTReport-BraTS, a companion dataset that augments BraTS imaging with synthetically generated radiology reports produced with BTReport. Code for this project can be found at this https URL.

[CV-40] SAM 3D Body: Robust Full-Body Human Mesh Recovery KR

【速读】：该论文旨在解决单图全身体三维网格重建（Single-image Full-body 3D Human Mesh Recovery, HMR）中的精度与泛化能力不足问题，尤其在复杂野外场景下表现不稳定。其核心解决方案是提出SAM 3D Body (3DB)模型，采用一种全新的参数化网格表示方法——Momentum Human Rig (MHR)，该结构将骨骼结构与表面形状解耦，从而提升建模灵活性和准确性；同时，3DB基于编码器-解码器架构并支持辅助提示（如2D关键点和掩码），实现类似SAM系列模型的用户引导推理机制，显著增强对复杂场景的鲁棒性与可控性。

链接: https://arxiv.org/abs/2602.15989
作者: Xitong Yang,Devansh Kukreja,Don Pinkus,Anushka Sagar,Taosha Fan,Jinhyung Park,Soyong Shin,Jinkun Cao,Jiawei Liu,Nicolas Ugrinovic,Matt Feiszli,Jitendra Malik,Piotr Dollar,Kris Kitani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal structure and surface shape. 3DB employs an encoder-decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.

[CV-41] LAND: A Longitudinal Analysis of Neuromorphic Datasets

【速读】：该论文试图解决当前神经形态工程领域中数据可用性差、标准化程度低以及合成数据泛化能力不足的问题。其关键解决方案在于系统梳理了超过423个现有神经形态数据集，揭示了数据规模、结构不一致性和访问困难等核心障碍，并提出通过构建元数据集（meta-datasets）来减少对新数据的依赖，同时缓解因数据定义和任务设定带来的潜在偏差，从而提升研究的可重复性和算法迁移能力。

链接: https://arxiv.org/abs/2602.15973
作者: Gregory Cohen,Alexandre Marcireau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: The LAND dataset tool can be accessed via this https URL

点击查看摘要

Abstract:Neuromorphic engineering has a data problem. Despite the meteoric rise in the number of neuromorphic datasets published over the past ten years, the conclusion of a significant portion of neuromorphic research papers still states that there is a need for yet more data and even larger datasets. Whilst this need is driven in part by the sheer volume of data required by modern deep learning approaches, it is also fuelled by the current state of the available neuromorphic datasets and the difficulties in finding them, understanding their purpose, and determining the nature of their underlying task. This is further compounded by practical difficulties in downloading and using these datasets. This review starts by capturing a snapshot of the existing neuromorphic datasets, covering over 423 datasets, and then explores the nature of their tasks and the underlying structure of the presented data. Analysing these datasets shows the difficulties arising from their size, the lack of standardisation, and difficulties in accessing the actual data. This paper also highlights the growth in the size of individual datasets and the complexities involved in working with the data. However, a more important concern is the rise of synthetic datasets, created by either simulation or video-to-events methods. This review explores the benefits of simulated data for testing existing algorithms and applications, highlighting the potential pitfalls for exploring new applications of neuromorphic technologies. This review also introduces the concepts of meta-datasets, created from existing datasets, as a way of both reducing the need for more data, and to remove potential bias arising from defining both the dataset and the task.

[CV-42] B-DENSE: Branching For Dense Ensemble Network Learning ICLR2026

【速读】：该论文旨在解决扩散模型（Diffusion Models）在推理阶段因迭代采样导致的高延迟问题，以及现有蒸馏技术因仅保留稀疏中间步骤而造成结构信息丢失和显著离散化误差的问题。解决方案的关键在于提出B-DENSE框架，通过多分支轨迹对齐机制实现密集中间轨迹监督：将学生模型架构修改为输出K倍扩展通道，每个子集对应教师模型轨迹中的一个离散中间步骤，并训练这些分支同时映射到教师目标时间步的完整序列，从而强制学生模型从训练初期即学习完整的解空间路径，显著提升图像生成质量。

链接: https://arxiv.org/abs/2602.15971
作者: Cherish Puniani,Tushar Kumar,Arnav Bendre,Gaurav Kumar,Shree Singhi
机构: Indian Institute of Technology, Roorkee(印度理工学院，鲁尔克)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 11 pages, 5 figures, 4 algorithms and 2 tables. Submitted to iclr 2026 delta workshop and still under review

点击查看摘要

Abstract:Inspired by non-equilibrium thermodynamics, diffusion models have achieved state-of-the-art performance in generative modeling. However, their iterative sampling nature results in high inference latency. While recent distillation techniques accelerate sampling, they discard intermediate trajectory steps. This sparse supervision leads to a loss of structural information and introduces significant discretization errors. To mitigate this, we propose B-DENSE, a novel framework that leverages multi-branch trajectory alignment. We modify the student architecture to output K -fold expanded channels, where each subset corresponds to a specific branch representing a discrete intermediate step in the teacher’s trajectory. By training these branches to simultaneously map to the entire sequence of the teacher’s target timesteps, we enforce dense intermediate trajectory alignment. Consequently, the student model learns to navigate the solution space from the earliest stages of training, demonstrating superior image generation quality compared to baseline distillation frameworks.

[CV-43] Non-Contact Physiological Monitoring in Pediatric Intensive Care Units via Adaptive Masking and Self-Supervised Learning

【速读】：该论文旨在解决在儿科重症监护病房（PICU）中利用远程光电容积脉搏波描记法（rPPG）进行无接触心率监测时面临的挑战，包括运动伪影、遮挡、光照变化以及实验室数据与临床数据之间的域偏移问题。其解决方案的关键在于提出一种基于渐进式课程学习策略的自监督预训练框架，采用VisionMamba架构并引入轻量级Mamba控制器以动态分配时空重要性分数，指导概率性图像块采样，在保持生理相关性的前提下逐步提升重建难度；同时通过教师-学生蒸馏机制，利用公开数据集上训练的监督专家模型为学生模型提供潜在生理引导，从而在缺乏标注临床数据的情况下显著提升rPPG估计性能，最终实现平均绝对误差（MAE）降低至3.2 bpm，优于现有方法如PhysFormer和标准掩码自动编码器。

链接: https://arxiv.org/abs/2602.15967
作者: Mohamed Khalil Ben Salah,Philippe Jouvet,Rita Noumeir
机构: École de Technologie Supérieure, University of Quebec(魁北克大学); CHU Sainte-Justine(圣洁医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continuous monitoring of vital signs in Pediatric Intensive Care Units (PICUs) is essential for early detection of clinical deterioration and effective clinical decision-making. However, contact-based sensors such as pulse oximeters may cause skin irritation, increase infection risk, and lead to patient discomfort. Remote photoplethysmography (rPPG) offers a contactless alternative to monitor heart rate using facial video, but remains underutilized in PICUs due to motion artifacts, occlusions, variable lighting, and domain shifts between laboratory and clinical data. We introduce a self-supervised pretraining framework for rPPG estimation in the PICU setting, based on a progressive curriculum strategy. The approach leverages the VisionMamba architecture and integrates an adaptive masking mechanism, where a lightweight Mamba-based controller assigns spatiotemporal importance scores to guide probabilistic patch sampling. This strategy dynamically increases reconstruction difficulty while preserving physiological relevance. To address the lack of labeled clinical data, we adopt a teacher-student distillation setup. A supervised expert model, trained on public datasets, provides latent physiological guidance to the student. The curriculum progresses through three stages: clean public videos, synthetic occlusion scenarios, and unlabeled videos from 500 pediatric patients. Our framework achieves a 42% reduction in mean absolute error relative to standard masked autoencoders and outperforms PhysFormer by 31%, reaching a final MAE of 3.2 bpm. Without explicit region-of-interest extraction, the model consistently attends to pulse-rich areas and demonstrates robustness under clinical occlusions and noise. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.15967 [cs.CV] (or arXiv:2602.15967v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.15967 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohamed Khalil Ben Salah [view email] [v1] Tue, 17 Feb 2026 19:34:50 UTC (8,263 KB)

[CV-44] Automated Re-Identification of Holstein-Friesian Cattle in Dense Crowds

【速读】：该论文旨在解决在密集畜群场景下（如挤奶厅或围栏内）Holstein-Friesian牛个体识别（Re-ID）准确率显著下降的问题，尤其是当传统基于YOLO的目标检测方法因动物间轮廓重叠而失效时。其关键解决方案是提出一种“检测-分割-识别”（detect-segment-identify）新流程，利用开放词汇权重无关局部定位（Open-Vocabulary Weight-free Localisation）和Segment Anything Model（SAM）作为预处理阶段，以提升目标检测的鲁棒性，并结合Re-ID网络实现高精度个体再识别。实验表明，该方法在真实农场环境中实现了98.93%的识别准确率，远超现有基于定向边界框或SAM检测的基线模型。

链接: https://arxiv.org/abs/2602.15962
作者: Phoenix Yu,Tilo Burghardt,Andrew W Dowsey,Neill W Campbell
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Holstein-Friesian detection and re-identification (Re-ID) methods capture individuals well when targets are spatially separate. However, existing approaches, including YOLO-based species detection, break down when cows group closely together. This is particularly prevalent for species which have outline-breaking coat patterns. To boost both effectiveness and transferability in this setting, we propose a new detect-segment-identify pipeline that leverages the Open-Vocabulary Weight-free Localisation and the Segment Anything models as pre-processing stages alongside Re-ID networks. To evaluate our approach, we publish a collection of nine days CCTV data filmed on a working dairy farm. Our methodology overcomes detection breakdown in dense animal groupings, resulting in a 98.93% accuracy. This significantly outperforms current oriented bounding box-driven, as well as SAM species detection baselines with accuracy improvements of 47.52% and 27.13%, respectively. We show that unsupervised contrastive learning can build on this to yield 94.82% Re-ID accuracy on our test data. Our work demonstrates that Re-ID in crowded scenarios is both practical as well as reliable in working farm settings with no manual intervention. Code and dataset are provided for reproducibility.

[CV-45] Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration

【速读】：该论文旨在解决双向光栅扫描光学分辨率光声显微成像（OR-PAM）中因域偏移（domain shift）和几何错位（geometric misalignment）导致的图像配准质量下降问题。现有方法受限于亮度恒定假设，而基于生成模型的方法虽能处理域偏移但缺乏帧间时序一致性建模能力。其解决方案的关键在于提出GPEReg-Net框架，通过自适应实例归一化（AdaIN）实现场景不变特征与域特定外观编码的解耦，从而直接进行图像到图像的配准；同时引入全局位置编码（GPE）模块，结合可学习位置嵌入与正弦编码及跨帧注意力机制，有效利用序列采集中的时序结构以增强帧间一致性，显著提升配准精度与稳定性。

链接: https://arxiv.org/abs/2602.15959
作者: Yiwen Wang,Jiahao Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing registration methods, constrained by brightness constancy assumptions, achieve limited alignment quality, while recent generative approaches address domain shift through complex architectures that lack temporal awareness across frames. We propose GPEReg-Net, a scene-appearance disentanglement framework that separates domain-invariant scene features from domain-specific appearance codes via Adaptive Instance Normalization (AdaIN), enabling direct image-to-image registration without explicit deformation field estimation. To exploit temporal structure in sequential acquisitions, we introduce a Global Position Encoding (GPE) module that combines learnable position embeddings with sinusoidal encoding and cross-frame attention, allowing the network to leverage context from neighboring frames for improved temporal coherence. On the OR-PAM-Reg-4K benchmark (432 test samples), GPEReg-Net achieves NCC of 0.953, SSIM of 0.932, and PSNR of 34.49dB, surpassing the state-of-the-art by 3.8% in SSIM and 1.99dB in PSNR while maintaining competitive NCC. Code is available at this https URL.

[CV-46] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在处理非文本类视觉元素时存在严重空间定位能力缺陷的问题。研究发现，当二值网格中的填充单元缺乏文本标识时，即使这些单元以图像形式输入（如纯色方块），VLMs 的定位准确率显著下降，F1 分数从文本符号条件下的约 84% 降至 39% 以下，且不同模型表现出系统性误判模式（如漏检、过检或模板幻觉）。解决方案的关键在于揭示了 VLMs 存在一个“高保真文本识别路径”用于空间推理，其性能远优于原生视觉通道；这一现象表明当前 VLMs 对文本信息的依赖导致其对纯视觉空间结构的理解能力受限，从而暴露出其在跨模态空间感知上的根本性局限。

链接: https://arxiv.org/abs/2602.15950
作者: Yuval Levental
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 2 tables. Workshop-length paper

点击查看摘要

Abstract:We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types – text symbols (. and #) and filled squares without gridlines – then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder – the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54 points across models, demonstrating that VLMs behave as if they possess a high-fidelity text-recognition pathway for spatial reasoning that dramatically outperforms their native visual pathway. Each model exhibits a distinct failure mode in the squares condition – systematic under-counting (Claude), massive over-counting (ChatGPT), and template hallucination (Gemini) – but all share the same underlying deficit: severely degraded spatial localization for non-textual visual elements.

[CV-47] Visual Memory Injection Attacks for Multi-Turn Conversations

【速读】：该论文旨在解决生成式大视觉语言模型（Large Vision-Language Models, LVLMs）在长上下文多轮对话场景下的安全漏洞问题，特别是针对隐蔽的视觉记忆注入攻击（Stealthy Visual Memory Injection, VMI）。传统攻击多集中于单轮交互，而VMI攻击通过上传被篡改的图像，使模型在正常提示下表现无异常，但在特定触发提示下输出预设的目标信息（如恶意营销或政治诱导），从而实现对用户的长期、隐蔽操控。其解决方案的关键在于设计一种能持久驻留于模型记忆中的隐蔽注入机制，确保攻击在多轮交互后仍有效，且不破坏模型的常规功能，揭示了当前LVLMs在复杂交互环境下的潜在风险，并呼吁提升模型对这类攻击的鲁棒性。

链接: https://arxiv.org/abs/2602.15927
作者: Christian Schlarmann,Matthias Hein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative large vision-language models (LVLMs) have recently achieved impressive performance gains, and their user base is growing rapidly. However, the security of LVLMs, in particular in a long-context multi-turn setting, is largely underexplored. In this paper, we consider the realistic scenario in which an attacker uploads a manipulated image to the web/social media. A benign user downloads this image and uses it as input to the LVLM. Our novel stealthy Visual Memory Injection (VMI) attack is designed such that on normal prompts the LVLM exhibits nominal behavior, but once the user gives a triggering prompt, the LVLM outputs a specific prescribed target message to manipulate the user, e.g. for adversarial marketing or political persuasion. Compared to previous work that focused on single-turn attacks, VMI is effective even after a long multi-turn conversation with the user. We demonstrate our attack on several recent open-weight LVLMs. This article thereby shows that large-scale manipulation of users is feasible with perturbed images in multi-turn conversation settings, calling for better robustness of LVLMs against these attacks. We release the source code at this https URL

[CV-48] A Study on Real-time Object Detection using Deep Learning

【速读】：该论文旨在解决实时对象检测（Real-Time Object Detection）在多个领域应用中的准确性与效率问题，尤其关注如何通过深度学习算法提升检测性能。其解决方案的关键在于系统性地分析和比较当前主流的深度学习模型，如Faster R-CNN、Mask R-CNN、YOLO、SSD及RetinaNet等，结合公开基准数据集进行评估，并通过受控实验对比不同策略的效果，从而揭示各方法的优势与局限，为后续研究提供可借鉴的技术路径与改进方向。

链接: https://arxiv.org/abs/2602.15926
作者: Ankita Bose,Jayasravani Bhumireddy,Naveen N
机构: GITAM University (GITAM大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 34 pages, 18 figures

点击查看摘要

Abstract:Object detection has compelling applications over a range of domains, including human-computer interfaces, security and video surveillance, navigation and road traffic monitoring, transportation systems, industrial automation healthcare, the world of Augmented Reality (AR) and Virtual Reality (VR), environment monitoring and activity identification. Applications of real time object detection in all these areas provide dynamic analysis of the visual information that helps in immediate decision making. Furthermore, advanced deep learning algorithms leverage the progress in the field of object detection providing more accurate and efficient solutions. There are some outstanding deep learning algorithms for object detection which includes, Faster R CNN(Region-based Convolutional Neural Network),Mask R-CNN, Cascade R-CNN, YOLO (You Only Look Once), SSD (Single Shot Multibox Detector), RetinaNet etc. This article goes into great detail on how deep learning algorithms are used to enhance real time object recognition. It provides information on the different object detection models available, open benchmark datasets, and studies on the use of object detection models in a range of applications. Additionally, controlled studies are provided to compare various strategies and produce some illuminating findings. Last but not least, a number of encouraging challenges and approaches are offered as suggestions for further investigation in both relevant deep learning approaches and object recognition.

[CV-49] World Action Models are Zero-shot Policies

【速读】：该论文旨在解决当前视觉-语言-动作（Vision-Language-Action, VLA）模型在新环境中的物理运动泛化能力不足的问题，即尽管VLA模型在语义层面表现优异，但在面对未见过的物理动作或场景时性能显著下降。其解决方案的关键在于提出一种基于预训练视频扩散骨干网络的世界动作模型（World Action Model, WAM），该模型通过联合建模视频和动作来学习物理动态，从而实现对世界状态演化的预测。这种设计使DreamZero能够在不依赖重复演示的情况下，从异构机器人数据中高效学习多样化技能，并在真实机器人实验中实现比先进VLA模型高出2倍以上的任务与环境泛化能力。此外，通过模型和系统优化，DreamZero实现了14B参数自回归视频扩散模型在7Hz频率下的实时闭环控制，进一步推动了具身智能的实际应用。

链接: https://arxiv.org/abs/2602.15922
作者: Seonghyeon Ye,Yunhao Ge,Kaiyuan Zheng,Shenyuan Gao,Sihyun Yu,George Kurian,Suneel Indupuru,You Liang Tan,Chuning Zhu,Jiannan Xiang,Ayaan Malik,Kyungmin Lee,William Liang,Nadun Ranawaka,Jiasheng Gu,Yinzhen Xu,Guanzhi Wang,Fengyuan Hu,Avnish Narayan,Johan Bjorck,Jing Wang,Gwanghyun Kim,Dantong Niu,Ruijie Zheng,Yuqi Xie,Jimmy Wu,Qi Wang,Ryan Julian,Danfei Xu,Yilun Du,Yevgen Chebotar,Scott Reed,Jan Kautz,Yuke Zhu,Linxi “Jim” Fan,Joel Jang
机构: NVIDIA
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.

[CV-50] EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLM s on Earth Imagery

【速读】：该论文旨在解决地球影像（Earth imagery）中多模态大语言模型（MLLMs）在空间推理能力评估方面的不足问题。现有基准主要聚焦于二维空间定位、图像描述和粗粒度的空间关系（如简单方向或邻近关系），缺乏对定量方向与距离推理、系统性拓扑关系以及超越边界框的复杂对象几何结构的支持。解决方案的关键在于提出 EarthSpatialBench，这是一个涵盖超过32.5万组问答对的综合性基准，支持：(1) 距离与方向的定性和定量推理；(2) 系统性的拓扑关系建模；(3) 单对象、对象对及复合群体查询；(4) 基于文本描述、视觉叠加和显式几何坐标（包括2D边界框、折线和多边形）的对象引用方式，从而全面评估MLLMs在地球影像中的空间推理能力。

链接: https://arxiv.org/abs/2602.15918
作者: Zelin Xu,Yupu Zhang,Saugat Adhikari,Saiful Islam,Tingsong Xiao,Zibo Liu,Shigang Chen,Da Yan,Zhe Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbfEarthSpatialBench, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.

[CV-51] MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

【速读】：该论文旨在解决知识增强型视觉问答（Knowledge-based Visual Question Answering, KB-VQA）中因外部检索知识存在噪声、部分无关或与视觉内容错位，以及模型内部知识难以控制和解释而导致的推理效率低下与答案准确率下降的问题。其解决方案的关键在于提出MaS-VQA框架，通过“掩码与选择”（Mask-and-Select）机制，联合过滤无关图像区域与弱相关知识片段，生成高信号的多模态知识表示；随后在受限语义空间中引导内部知识激活，实现显式知识与隐式知识的互补协同建模，从而提升答案预测的鲁棒性。

链接: https://arxiv.org/abs/2602.15915
作者: Xianwei Mao,Kai Ye,Sheng Zhou,Nan Zhang,Haikuan Huang,Bin Li,Jiajun Bu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.

[CV-52] A Comprehensive Survey on Deep Learning-Based LiDAR Super-Resolution for Autonomous Driving

【速读】：该论文旨在解决自动驾驶中激光雷达（LiDAR）传感器因分辨率差异导致的性能瓶颈问题：高分辨率LiDAR成本高昂，而低成本低分辨率LiDAR生成的点云稀疏且缺乏关键细节。解决方案的关键在于利用深度学习实现LiDAR超分辨率（LiDAR super-resolution），通过提升稀疏点云的密度与质量，使不同分辨率传感器之间具备兼容性，并支持实际部署中的实时推理与跨传感器泛化能力。

链接: https://arxiv.org/abs/2602.15904
作者: June Moh Goo,Zichao Zeng,Jan Boehm
机构: University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to The IEEE Intelligent Vehicles Symposium 2026 (IEEE IV 2026)

点击查看摘要

Abstract:LiDAR sensors are often considered essential for autonomous driving, but high-resolution sensors remain expensive while affordable low-resolution sensors produce sparse point clouds that miss critical details. LiDAR super-resolution addresses this challenge by using deep learning to enhance sparse point clouds, bridging the gap between different sensor types and enabling cross-sensor compatibility in real-world deployments. This paper presents the first comprehensive survey of LiDAR super-resolution methods for autonomous driving. Despite the importance of practical deployment, no systematic review has been conducted until now. We organize existing approaches into four categories: CNN-based architectures, model-based deep unrolling, implicit representation methods, and Transformer and Mamba-based approaches. We establish fundamental concepts including data representations, problem formulation, benchmark datasets and evaluation metrics. Current trends include the adoption of range image representation for efficient processing, extreme model compression and the development of resolution-flexible architectures. Recent research prioritizes real-time inference and cross-sensor generalization for practical deployment. We conclude by identifying open challenges and future research directions for advancing LiDAR super-resolution technology.

[CV-53] Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment

【速读】：该论文旨在解决深度伪造人脸检测中因不同伪造技术生成样本分布差异大而导致的模型准确率低和泛化能力差的问题。解决方案的关键在于提出一种结合多变量软混合增强（Multivariate and Soft Blending Augmentation, MSBA）与CLIP引导的伪造强度估计（CLIP-guided Forgery Intensity Estimation）的新框架（MSBA-CLIP）。该方法通过随机加权融合多种伪造技术生成的图像，迫使模型学习更具泛化性的特征；同时引入多变量伪造强度估计模块（Multivariate Forgery Intensity Estimation, MFIE），显式指导模型关注不同伪造模式与强度下的特征，从而提升检测鲁棒性与准确性。

链接: https://arxiv.org/abs/2602.15903
作者: Jingwei Li,Jiaxin Tong,Pengfei Wu
机构: Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of highly realistic facial forgeries necessitates robust detection methods. However, existing approaches often suffer from limited accuracy and poor generalization due to significant distribution shifts among samples generated by diverse forgery techniques. To address these challenges, we propose a novel Multivariate and Soft Blending Augmentation with CLIP-guided Forgery Intensity Estimation (MSBA-CLIP) framework. Our method leverages the multimodal alignment capabilities of CLIP to capture subtle forgery traces. We introduce a Multivariate and Soft Blending Augmentation (MSBA) strategy that synthesizes images by blending forgeries from multiple methods with random weights, forcing the model to learn generalizable patterns. Furthermore, a dedicated Multivariate Forgery Intensity Estimation (MFIE) module is designed to explicitly guide the model in learning features related to varied forgery modes and intensities. Extensive experiments demonstrate state-of-the-art performance. On in-domain tests, our method improves Accuracy and AUC by 3.32% and 4.02%, respectively, over the best baseline. In cross-domain evaluations across five datasets, it achieves an average AUC gain of 3.27%. Ablation studies confirm the efficacy of both proposed components. While the reliance on a large vision-language model entails higher computational cost, our work presents a significant step towards more generalizable and robust deepfake detection.

[CV-54] Adaptive Illumination Control for Robot Perception

【速读】：该论文旨在解决机器人在低光照或高动态范围环境下视觉SLAM（Simultaneous Localization and Mapping）性能下降的问题，传统方法如特征提取增强、图像增强或闭环曝光控制受限于初始捕获图像质量。其解决方案的关键在于提出一个闭环照明控制框架Lightning，包含三个核心阶段：首先，训练一种共位照明分解（Co-Located Illumination Decomposition, CLID）模型，将观测图像分解为环境光成分与光源贡献场，实现物理一致的多强度光照合成；其次，基于合成数据构建离线最优强度调度（Optimal Intensity Schedule, OIS）问题，权衡SLAM相关图像效用、功耗与时间平滑性；最后，通过行为克隆将理想策略蒸馏为实时控制器——照明控制策略（Illumination Control Policy, ILC），可在移动机器人上在线运行并适应未见场景，显著提升SLAM轨迹鲁棒性同时降低不必要的照明能耗。

链接: https://arxiv.org/abs/2602.15900
作者: Yash Turkar,Shekoufeh Sadeghi,Karthik Dantu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robot perception under low light or high dynamic range is usually improved downstream - via more robust feature extraction, image enhancement, or closed-loop exposure control. However, all of these approaches are limited by the image captured these conditions. An alternate approach is to utilize a programmable onboard light that adds to ambient illumination and improves captured images. However, it is not straightforward to predict its impact on image formation. Illumination interacts nonlinearly with depth, surface reflectance, and scene geometry. It can both reveal structure and induce failure modes such as specular highlights and saturation. We introduce Lightning, a closed-loop illumination-control framework for visual SLAM that combines relighting, offline optimization, and imitation learning. This is performed in three stages. First, we train a Co-Located Illumination Decomposition (CLID) relighting model that decomposes a robot observation into an ambient component and a light-contribution field. CLID enables physically consistent synthesis of the same scene under alternative light intensities and thereby creates dense multi-intensity training data without requiring us to repeatedly re-run trajectories. Second, using these synthesized candidates, we formulate an offline Optimal Intensity Schedule (OIS) problem that selects illumination levels over a sequence trading off SLAM-relevant image utility against power consumption and temporal smoothness. Third, we distill this ideal solution into a real-time controller through behavior cloning, producing an Illumination Control Policy (ILC) that generalizes beyond the initial training distribution and runs online on a mobile robot to command discrete light-intensity levels. Across our evaluation, Lightning substantially improves SLAM trajectory robustness while reducing unnecessary illumination power.

[CV-55] Egocentric Bias in Vision-Language Models

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在Level-2视觉视角转换（Level-2 Visual Perspective Taking, L2 VPT）能力上的系统性缺陷问题，即模型难以准确模拟从他人视角观察世界的能力。其解决方案的关键在于提出FlipSet——一个诊断性基准测试，专门用于评估VLMs在执行180度二维字符字符串空间旋转时的视角转换能力，该任务通过隔离空间变换与三维场景复杂性，精准刻画了社会认知中的视角理解机制。实验结果揭示当前VLMs存在显著的自我中心偏见（egocentric bias），且在理论心智（theory-of-mind）与心理旋转任务单独测试中表现良好，但整合后性能急剧下降，表明模型缺乏将社会意识与空间操作相结合的机制，暴露了基于模型的空间推理的根本局限。

链接: https://arxiv.org/abs/2602.15892
作者: Maijunxian Wang,Yijiang Li,Bingyang Wang,Tianwei Zhao,Ran Ji,Qingying Gao,Emmy Liu,Hokin Deng,Dezhi Luo
机构: University of California, Berkeley (加州大学伯克利分校); University of California San Diego (加州大学圣地亚哥分校); Georgia Institute of Technology & Emory University (佐治亚理工学院与埃默里大学); Johns Hopkins University (约翰霍普金斯大学); Carnegie Mellon University (卡内基梅隆大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual perspective taking–inferring how the world appears from another’s viewpoint–is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent’s perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit–models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.

[CV-56] MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

【速读】：该论文旨在解决机器人强化学习（Reinforcement Learning, RL）中密集奖励函数设计依赖人工工程、难以扩展和自动化的问题。现有基于视觉-语言模型（Vision-Language Models, VLMs）的奖励方法常因与任务进展不一致、空间定位能力弱以及任务语义理解不足而表现不佳。解决方案的关键在于提出MARVL（Multi-stage guidance for Robotic manipulation via Vision-Language models），其通过微调VLM以实现空间和语义一致性，并将任务分解为多阶段子任务，结合任务方向投影增强轨迹敏感性，从而显著提升样本效率和稀疏奖励场景下的鲁棒性。

链接: https://arxiv.org/abs/2602.15872
作者: Xunlan Zhou,Xuanlin Chen,Shaowei Zhang,Xiangkun Li,ShengHua Wan,Xiaohai Hu,Yuan Lei,Le Gan,De-chuan Zhan
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.

[CV-57] Reason Navi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation

【速读】：该论文旨在解决 embodied agents 在导航过程中因依赖局部视角观测而导致全局规划能力不足、探索效率低的问题。其解决方案的关键在于提出一种受人类认知启发的“先推理后执行”（reason-then-act）框架 ReasonNavi，通过将多模态大语言模型（Multimodal Large Language Models, MLLMs）与确定性规划器耦合，首先在由顶视图地图转换而来的离散推理空间中进行语义层面的目标选择，再利用在线构建的占据地图和确定性动作规划器生成可执行轨迹，从而实现无需微调的零样本导航，兼具可扩展性、可解释性和全局感知能力。

链接: https://arxiv.org/abs/2602.15864
作者: Yuzhuo Ao,Anbang Wang,Yu-Wing Tai,Chi-Keung Tang
机构: The Hong Kong University of Science and Technology (香港科技大学); Dartmouth College (达特茅斯学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures, Project page: this https URL

点击查看摘要

Abstract:Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations, which restrict global foresight and lead to inefficient exploration. In contrast, humans plan using maps: we reason globally first, then act locally. We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners. ReasonNavi converts a top-down map into a discrete reasoning space by room segmentation and candidate target nodes sampling. An MLLM is then queried in a multi-stage process to identify the candidate most consistent with the instruction (object, image, or text goal), effectively leveraging the model’s semantic reasoning ability while sidestepping its weakness in continuous coordinate prediction. The selected waypoint is grounded into executable trajectories using a deterministic action planner over an online-built occupancy map, while pretrained object detectors and segmenters ensure robust recognition at the goal. This yields a unified zero-shot navigation framework that requires no MLLM fine-tuning, circumvents the brittleness of RL-based policies and scales naturally with foundation model improvements. Across three navigation tasks, ReasonNavi consistently outperforms prior methods that demand extensive training or heavy scene modeling, offering a scalable, interpretable, and globally grounded solution to embodied navigation. Project page: this https URL

[CV-58] Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model

【速读】：该论文旨在解决从病理切片全幻灯片图像（Whole Slide Images, WSI）中生成精确且领域特定的诊断文本报告这一挑战，其核心难点在于WSI的吉像素级规模以及对医学术语准确表达的需求。解决方案的关键在于提出一种分层视觉语言框架：首先利用冻结的病理基础模型（pathology foundation model）提取多尺度金字塔patch特征（下采样因子为2³至2⁶），并通过拉普拉斯方差和HSV阈值去除背景与伪影；随后将这些patch特征输入UNI Vision Transformer并投影至6层Transformer解码器，借助交叉注意力机制生成诊断文本；为提升生物医学术语表达能力，采用BioGPT进行输出分词；最后引入基于检索的验证步骤，利用Sentence BERT嵌入比对生成报告与参考语料库的相似度，若匹配度高则用真实参考报告替换生成结果，从而增强报告可靠性。

链接: https://arxiv.org/abs/2602.16422
作者: Ahmet Halici,Ece Tugba Cebeci,Musa Balci,Mustafa Cini,Serkan Sokmen
机构: ViseurAI(维赛尔人工智能)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages. Equal contribution: Ahmet Halici, Ece Tugba Cebeci, Musa Balci

点击查看摘要

Abstract:Generating diagnostic text from histopathology whole slide images (WSIs) is challenging due to the gigapixel scale of the input and the requirement for precise, domain specific language. We propose a hierarchical vision language framework that combines a frozen pathology foundation model with a Transformer decoder for report generation. To make WSI processing tractable, we perform multi resolution pyramidal patch selection (downsampling factors 2^3 to 2^6) and remove background and artifacts using Laplacian variance and HSV based criteria. Patch features are extracted with the UNI Vision Transformer and projected to a 6 layer Transformer decoder that generates diagnostic text via cross attention. To better represent biomedical terminology, we tokenize the output using BioGPT. Finally, we add a retrieval based verification step that compares generated reports with a reference corpus using Sentence BERT embeddings; if a high similarity match is found, the generated report is replaced with the retrieved ground truth reference to improve reliability.

[CV-59] RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion

【速读】：该论文旨在解决3D医学图像分割中模型精度与计算效率难以兼顾的问题，尤其针对基于Transformer的架构参数量大、内存消耗高、难以在临床环境中部署的挑战。其解决方案的关键在于提出一种轻量级分层Transformer架构RefineFormer3D，通过三个核心组件实现高效性与准确性的平衡：(i) 基于GhostConv3D的补丁嵌入模块以减少冗余特征提取；(ii) 引入MixFFN3D模块，结合低秩投影和深度可分离卷积实现参数高效的特征变换；(iii) 设计交叉注意力融合解码器，自适应地整合多尺度跳跃连接。该方法仅需2.94M参数，在ACDC和BraTS数据集上分别达到93.44%和85.9%的平均Dice分数，同时具备快速推理能力（GPU上每体积8.35ms），满足资源受限临床场景的部署需求。

链接: https://arxiv.org/abs/2602.16320
作者: Kavyansh Tyagi,Vishwas Rathi,Puneet Goyal
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Accurate and computationally efficient 3D medical image segmentation remains a critical challenge in clinical workflows. Transformer-based architectures often demonstrate superior global contextual modeling but at the expense of excessive parameter counts and memory demands, restricting their clinical deployment. We propose RefineFormer3D, a lightweight hierarchical transformer architecture that balances segmentation accuracy and computational efficiency for volumetric medical imaging. The architecture integrates three key components: (i) GhostConv3D-based patch embedding for efficient feature extraction with minimal redundancy, (ii) MixFFN3D module with low-rank projections and depthwise convolutions for parameter-efficient feature extraction, and (iii) a cross-attention fusion decoder enabling adaptive multi-scale skip connection integration. RefineFormer3D contains only 2.94M parameters, substantially fewer than contemporary transformer-based methods. Extensive experiments on ACDC and BraTS benchmarks demonstrate that RefineFormer3D achieves 93.44% and 85.9% average Dice scores respectively, outperforming or matching state-of-the-art methods while requiring significantly fewer parameters. Furthermore, the model achieves fast inference (8.35 ms per volume on GPU) with low memory requirements, supporting deployment in resource-constrained clinical environments. These results establish RefineFormer3D as an effective and scalable solution for practical 3D medical image segmentation.

[CV-60] ROIX-Comp: Optimizing X-ray Computed Tomography Imaging Strategy for Data Reduction and Reconstruction HPCA

【速读】：该论文旨在解决高性能量子计算（High-Performance Computing, HPC）环境中，同步辐射设施中产生的海量X射线计算机断层扫描（X-ray Computed Tomography, X-CT）数据在处理时面临的巨大计算与存储挑战。传统方法因数据维度高、体量大，导致存储需求高、传输带宽压力大，难以支持实时处理和高效工作流。解决方案的关键在于提出一种基于感兴趣区域（Region-of-Interest, ROI）驱动的数据提取框架（ROIX-Comp），通过智能识别并保留关键特征实现数据压缩；具体包括两个阶段：预处理阶段采用误差有界量化（error-bounded quantization）减少待处理数据量以提升计算效率，压缩阶段则结合目标提取与多种先进无损/有损压缩算法，显著提高压缩比，实验表明相较标准压缩方法可实现12.34倍的压缩比提升。

链接: https://arxiv.org/abs/2602.15917
作者: Amarjit Singh,Kento Sato,Kohei Yoshida,Kentaro Uesugi,Yasumasa Joti,Takaki Hatsui,Andrès Rubio Proaño
机构: RIKEN (R-CCS)(理化学研究所); The University of Electro-Communications(电波通信大学); Japan Synchrotron Radiation Research Institute(日本同步辐射研究机构); RIKEN SPring-8 Center(理化学研究所SPring-8中心); Centro de Investigación en Mecatrónica y Sistemas Interactivos(MIST), Universidad Tecnológica Indoamérica(机电一体化与交互系统研究中心(MIST)，印度美洲科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
备注: 11 pages, SCA/HPCAsia2026

点击查看摘要

Abstract:In high-performance computing (HPC) environments, particularly in synchrotron radiation facilities, vast amounts of X-ray images are generated. Processing large-scale X-ray Computed Tomography (X-CT) datasets presents significant computational and storage challenges due to their high dimensionality and data volume. Traditional approaches often require extensive storage capacity and high transmission bandwidth, limiting real-time processing capabilities and workflow efficiency. To address these constraints, we introduce a region-of-interest (ROI)-driven extraction framework (ROIX-Comp) that intelligently compresses X-CT data by identifying and retaining only essential features. Our work reduces data volume while preserving critical information for downstream processing tasks. At pre-processing stage, we utilize error-bounded quantization to reduce the amount of data to be processed and therefore improve computational efficiencies. At the compression stage, our methodology combines object extraction with multiple state-of-the-art lossless and lossy compressors, resulting in significantly improved compression ratios. We evaluated this framework against seven X-CT datasets and observed a relative compression ratio improvement of 12.34x compared to the standard compression.

[CV-61] Foundation Models for Medical Imaging: Status Challenges and Directions

【速读】：该论文旨在解决当前医学影像领域中模型泛化能力不足、任务专一性过强的问题，即传统深度学习模型通常仅针对特定模态、解剖结构或临床任务进行训练，难以跨场景迁移与应用。其解决方案的关键在于系统梳理基础模型（Foundation Models, FMs）在医学影像中的设计原理、应用场景及未来挑战，提出一种技术扎实、临床敏感且面向未来的开发路径，以推动FMs在多模态、多器官和多任务场景下的通用性、可靠性与可信赖性，从而实现从研究到临床实践的负责任转化。

链接: https://arxiv.org/abs/2602.15913
作者: Chuang Niu,Pengwei Wu,Bruno De Man,Ge Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models (FMs) are rapidly reshaping medical imaging, shifting the field from narrowly trained, task-specific networks toward large, general-purpose models that can be adapted across modalities, anatomies, and clinical tasks. In this review, we synthesize the emerging landscape of medical imaging FMs along three major axes: principles of FM design, applications of FMs, and forward-looking challenges and opportunities. Taken together, this review provides a technically grounded, clinically aware, and future-facing roadmap for developing FMs that are not only powerful and versatile but also trustworthy and ready for responsible translation into clinical practice.

[CV-62] Rotterdam artery-vein segmentation (RAV) dataset

【速读】：该论文旨在解决当前用于视网膜血管分析的机器学习模型缺乏高质量、多样化且标注精确的数据集问题，尤其在动脉-静脉（Artery-Vein, A/V）分类任务中存在标注不一致与图像质量差异大导致模型泛化能力弱的问题。其解决方案的关键在于构建了一个来自荷兰鹿特丹研究（Rotterdam Study）的大型彩色眼底图像（Color Fundus Images, CFIs）数据集——RAV数据集，包含1024×1024像素的RGB图像、对比度增强版本及RGB编码的A/V分割掩膜，并通过定制标注界面和连通性验证工具确保血管结构的拓扑正确性，从而提供具有真实世界多样性和高精度标注的训练与评估基准，支持开发更具临床适用性的通用型视网膜血管分析算法。

链接: https://arxiv.org/abs/2512.17322
作者: Jose Vargas Quiros,Bart Liefers,Karin van Garderen,Jeroen Vermeulen,Eyened Reading Center,Caroline Klaver
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology. Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools. Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information. Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings. Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.17322 [eess.IV] (or arXiv:2512.17322v2 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2512.17322 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jose David Vargas Quiros [view email] [v1] Fri, 19 Dec 2025 08:09:02 UTC (432 KB) [v2] Wed, 18 Feb 2026 15:06:58 UTC (384 KB) Full-text links: Access Paper: View a PDF of the paper titled Rotterdam artery-vein segmentation (RAV) dataset, by Jose Vargas Quiros and 5 other authorsView PDF view license Current browse context: eess.IV prev | next new | recent | 2025-12 Change to browse by: cs cs.CV eess References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

人工智能

[AI-0] Measuring Mid-2025 LLM -Assistance on Novice Performance in Biology

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在生物基准测试中表现优异，是否能实际提升新手在真实实验室环境中执行复杂实验流程（如病毒反向遗传学工作流）的能力。其解决方案的关键在于设计并实施一项预注册、研究者盲法的随机对照试验（n = 153），系统评估LLM辅助与互联网搜索辅助对新手完成实验室任务的效果差异，从而量化LLM在物理世界中的实用价值，并揭示其在虚拟评估与现实应用之间的差距。

链接: https://arxiv.org/abs/2602.16703
作者: Shen Zhou Hong,Alex Kleinman,Alyssa Mathiowetz,Adam Howes,Julian Cohen,Suveer Ganta,Alex Letizia,Dora Liao,Deepika Pahari,Xavier Roberts-Gaal,Luca Righetti,Joe Torres
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized controlled trial (June-August 2025; n = 153) evaluating whether LLMs improve novice performance in tasks that collectively model a viral reverse genetics workflow. We observed no significant difference in the primary endpoint of workflow completion (5.2% LLM vs. 6.6% Internet; P = 0.759), nor in the success rate of individual tasks. However, the LLM arm had numerically higher success rates in four of the five tasks, most notably for the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059). Post-hoc Bayesian modeling of pooled data estimates an approximate 1.4-fold increase (95% CrI 0.74-2.62) in success for a “typical” reverse genetics task under LLM assistance. Ordinal regression modelling suggests that participants in the LLM arm were more likely to progress through intermediate steps across all tasks (posterior probability of a positive effect: 81%-96%). Overall, mid-2025 LLMs did not substantially increase novice completion of complex laboratory procedures but were associated with a modest performance benefit. These results reveal a gap between in silico benchmarks and real-world utility, underscoring the need for physical-world validation of AI biosecurity assessments as model capabilities and user proficiency evolve.

[AI-1] SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation

【速读】：该论文旨在解决C语言自动化单元测试生成中的核心挑战，即高阶程序意图与指针运算及手动内存管理等刚性语法约束之间的语义鸿沟问题。传统大型语言模型（Large Language Models, LLMs）在直接进行意图到代码的合成时，常因缺乏对程序结构和语义的充分理解而出现“跳码失败”（leap-to-code failure mode），导致生成不可编译的测试用例、虚构的函数签名、低分支覆盖率以及语义无关的断言。为此，作者提出了一种神经符号（neuro-symbolic）、基于场景的框架SPARC，其关键在于通过四个阶段实现对LLM推理的结构化引导：(1) 控制流图（Control Flow Graph, CFG）分析以明确程序路径；(2) 操作映射（Operation Map）将LLM推理锚定在经过验证的实用辅助函数上；(3) 路径目标测试生成确保覆盖关键执行路径；(4) 基于编译器和运行时反馈的迭代自校正验证循环，持续优化测试质量。此设计显著提升了测试覆盖率与有效性，并保留了高可读性和可维护性，为工业级遗留C代码测试提供了可扩展的解决方案。

链接: https://arxiv.org/abs/2602.16671
作者: Jaid Monwar Chowdhury,Chi-An Fu,Reyhaneh Jabbarvand
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Automated unit test generation for C remains a formidable challenge due to the semantic gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management. While Large Language Models (LLMs) exhibit strong generative capabilities, direct intent-to-code synthesis frequently suffers from the leap-to-code failure mode, where models prematurely emit code without grounding in program structure, constraints, and semantics. This will result in non-compilable tests, hallucinated function signatures, low branch coverage, and semantically irrelevant assertions that cannot properly capture bugs. We introduce SPARC, a neuro-symbolic, scenario-based framework that bridges this gap through four stages: (1) Control Flow Graph (CFG) analysis, (2) an Operation Map that grounds LLM reasoning in validated utility helpers, (3) Path-targeted test synthesis, and (4) an iterative, self-correction validation loop using compiler and runtime feedback. We evaluate SPARC on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE on complex subjects. SPARC retains 94.3% of tests through iterative repair and produces code with significantly higher developer-rated readability and maintainability. By aligning LLM reasoning with program structure, SPARC provides a scalable path for industrial-grade testing of legacy C codebases.

[AI-2] owards a Science of AI Agent Reliability

【速读】：该论文旨在解决当前AI代理（AI agent）评估体系中存在的根本性局限问题，即仅依赖单一成功指标（如标准基准上的准确率）无法全面反映代理在实际应用中的可靠性表现。这种简化评估方式忽视了代理在多次运行中的一致性、对扰动的鲁棒性、故障的可预测性以及错误严重程度的边界性等关键维度。为应对这一挑战，作者基于安全关键工程（safety-critical engineering）理念，提出了一套包含十二项具体指标的综合性性能分析框架，将代理可靠性分解为四个核心维度：一致性（consistency）、鲁棒性（robustness）、可预测性（predictability）和安全性（safety）。该解决方案的关键在于通过多维量化指标系统地刻画代理的行为特征与失效模式，从而揭示其在真实场景中可能存在的隐蔽缺陷，弥补传统评价方法的不足，并为理解代理如何执行、退化及失败提供可操作的分析工具。

链接: https://arxiv.org/abs/2602.16666
作者: Stephan Rabanser,Sayash Kapoor,Peter Kirgis,Kangheng Liu,Saiteja Utpala,Arvind Narayanan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

[AI-3] Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments

【速读】：该论文旨在解决在工业场景中，由于数据安全和预算限制无法持续依赖公共API时，小型语言模型（Small Language Models, SLMs）在高度定制化任务中表现受限的问题。其核心挑战在于如何提升SLMs在复杂、专业化环境下的任务准确性与鲁棒性。解决方案的关键在于引入并形式化定义了“Agent Skill”流程——一种通过结构化技能选择与执行机制来优化模型推理能力的框架。实证研究表明，适度规模的SLMs（约12B–30B参数）能显著受益于该方法，而代码专用型80B参数模型甚至可达到闭源基线性能的同时提升GPU效率，从而为SLM驱动的智能代理部署提供了理论依据与实践路径。

链接: https://arxiv.org/abs/2602.16653
作者: Yangjie Xu,Lujun Li,Lama Sleem,Niccolo Gentile,Yewei Song,Yiqun Wang,Siming Ji,Wenbo Wu,Radu State
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent Skill framework, now widely and officially supported by major players such as GitHub Copilot, LangChain, and OpenAI, performs especially well with proprietary models by improving context engineering, reducing hallucinations, and boosting task accuracy. Based on these observations, an investigation is conducted to determine whether the Agent Skill paradigm provides similar benefits to small language models (SLMs). This question matters in industrial scenarios where continuous reliance on public APIs is infeasible due to data-security and budget constraints requirements, and where SLMs often show limited generalization in highly customized scenarios. This work introduces a formal mathematical definition of the Agent Skill process, followed by a systematic evaluation of language models of varying sizes across multiple use cases. The evaluation encompasses two open-source tasks and a real-world insurance claims data set. The results show that tiny models struggle with reliable skill selection, while moderately sized SLMs (approximately 12B - 30B) parameters) benefit substantially from the Agent Skill approach. Moreover, code-specialized variants at around 80B parameters achieve performance comparable to closed-source baselines while improving GPU efficiency. Collectively, these findings provide a comprehensive and nuanced characterization of the capabilities and constraints of the framework, while providing actionable insights for the effective deployment of Agent Skills in SLM-centered environments.

[AI-4] Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

【速读】：该论文旨在解决聚合物文献中大量实验知识因文本非结构化和术语不一致而难以系统检索与推理的问题。现有工具通常孤立地提取特定研究的窄范围事实，无法保留跨研究的上下文以回答更广泛的科学问题。解决方案的关键在于构建两种定制化的检索增强生成（Retrieval-Augmented Generation, RAG）管道：一种基于密集语义向量的VectorRAG方法，另一种基于知识图谱的GraphRAG方法。通过构建包含超1000篇聚羟基烷酸酯（Polyhydroxyalkanoate, PHA）文献的上下文保持段落嵌入和标准化结构化知识图谱，实现了实体消歧与多跳推理能力。结果表明，GraphRAG在精度和可解释性上表现更优，而VectorRAG具有更广的召回率，二者形成互补；专家验证进一步证明，特别是GraphRAG能生成基于证据、引用可靠且领域相关性强的回答，从而支持研究人员高效导航文献、比较研究成果并发现人工难以识别的模式。

链接: https://arxiv.org/abs/2602.16650
作者: Sonakshi Gupta,Akhlak Mahmood,Wei Xiong,Rampi Ramprasad
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Polymer literature contains a large and growing body of experimental knowledge, yet much of it is buried in unstructured text and inconsistent terminology, making systematic retrieval and reasoning difficult. Existing tools typically extract narrow, study-specific facts in isolation, failing to preserve the cross-study context required to answer broader scientific questions. Retrieval-augmented generation (RAG) offers a promising way to overcome this limitation by combining large language models (LLMs) with external retrieval, but its effectiveness depends strongly on how domain knowledge is represented. In this work, we develop two retrieval pipelines: a dense semantic vector-based approach (VectorRAG) and a graph-based approach (GraphRAG). Using over 1,000 polyhydroxyalkanoate (PHA) papers, we construct context-preserving paragraph embeddings and a canonicalized structured knowledge graph supporting entity disambiguation and multi-hop reasoning. We evaluate these pipelines through standard retrieval metrics, comparisons with general state-of-the-art systems such as GPT and Gemini, and qualitative validation by a domain chemist. The results show that GraphRAG achieves higher precision and interpretability, while VectorRAG provides broader recall, highlighting complementary trade-offs. Expert validation further confirms that the tailored pipelines, particularly GraphRAG, produce well-grounded, citation-reliable responses with strong domain relevance. By grounding every statement in evidence, these systems enable researchers to navigate the literature, compare findings across studies, and uncover patterns that are difficult to extract manually. More broadly, this work establishes a practical framework for building materials science assistants using curated corpora and retrieval design, reducing reliance on proprietary models while enabling trustworthy literature analysis at scale.

[AI-5] Almost Sure Convergence of Differential Temporal Difference Learning for Averag e Reward Markov Decision Processes

【速读】：该论文旨在解决差分时序差分（differential temporal difference, TD）学习在无局部时钟（local clock）条件下收敛性分析的理论局限性问题，尤其是在非表格型（non-tabular）和离策略（off-policy）场景下的应用瓶颈。现有方法依赖于与状态访问频次绑定的学习率调度机制（即局部时钟），这在实践中难以实现且不适用于函数逼近场景。论文的关键解决方案在于：首先证明了策略内（on-policy）n步差分TD算法在使用标准递减学习率（无需局部时钟）时几乎必然收敛；其次进一步推导出三种确保离策略n步差分TD同样收敛的充分条件，从而显著强化了差分TD的理论基础，并使其收敛性分析更贴近实际应用需求。

链接: https://arxiv.org/abs/2602.16629
作者: Ethan Blaser,Jiuqi Wang,Shangtong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy n -step differential TD for any n using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy n -step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.

[AI-6] A Systematic Evaluation of Sample-Level Tokenization Strategies for MEG Foundation Models

【速读】：该论文旨在解决神经影像数据中离散化策略（即“tokenization”）对基于Transformer的大规模神经成像模型（LNMs）性能影响不明确的问题。其关键解决方案在于系统评估了可学习与不可学习的样本级tokenization方法，特别是提出了一种基于自编码器（autoencoder）的新型可学习tokenizer，并在三个公开的脑磁图（MEG）数据集上验证了其在信号重建保真度、下游任务表现及个体特异性信息保留等方面的性能。结果表明，简单固定的样本级tokenization策略已能实现与可学习方法相当的建模效果，为神经基础模型的开发提供了实用且高效的预处理方案。

链接: https://arxiv.org/abs/2602.16626
作者: SungJun Cho,Chetan Gohil,Rukuang Huang,Oiwi Parker Jones,Mark W. Woolrich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 15 pages, 10 figures, 1 table

点击查看摘要

Abstract:Recent success in natural language processing has motivated growing interest in large-scale foundation models for neuroimaging data. Such models often require discretization of continuous neural time series data, a process referred to as ‘tokenization’. However, the impact of different tokenization strategies for neural data is currently poorly understood. In this work, we present a systematic evaluation of sample-level tokenization strategies for transformer-based large neuroimaging models (LNMs) applied to magnetoencephalography (MEG) data. We compare learnable and non-learnable tokenizers by examining their signal reconstruction fidelity and their impact on subsequent foundation modeling performance (token prediction, biological plausibility of generated data, preservation of subject-specific information, and performance on downstream tasks). For the learnable tokenizer, we introduce a novel approach based on an autoencoder. Experiments were conducted on three publicly available MEG datasets spanning different acquisition sites, scanners, and experimental paradigms. Our results show that both learnable and non-learnable discretization schemes achieve high reconstruction accuracy and broadly comparable performance across most evaluation criteria, suggesting that simple fixed sample-level tokenization strategies can be used in the development of neural foundation models. The code is available at this https URL.

[AI-7] Causal and Compositional Abstraction

【速读】：该论文旨在解决科学实践中从低层次模型到高层次解释性模型的抽象问题，尤其关注因果结构的保留，以实现更稳健、高效且可解释的人工智能（AI）系统。其核心挑战在于如何形式化不同层次因果模型之间的抽象关系，从而统一现有多种抽象方法（如构造性因果抽象、Q-τ一致性、交换干预抽象等）。解决方案的关键在于引入范畴论（category theory）作为统一框架，将因果模型及其查询（如干预操作）视为具有特定语义的组合模型（compositional model），并定义两类基本抽象：向下抽象（将高层查询映射至低层）与向上抽象（将底层具体查询如Do-干预映射至高层）。研究进一步提出“组件级”抽象的新概念，推动构造性因果抽象在机制层面的强化，并证明了相关刻画结果；同时拓展抽象至量子组合电路模型与经典因果模型之间，为可解释量子人工智能提供理论基础。

链接: https://arxiv.org/abs/2602.16612
作者: Robin Lorenz,Sean Tull
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Category Theory (math.CT); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Abstracting from a low level to a more explanatory high level of description, and ideally while preserving causal structure, is fundamental to scientific practice, to causal inference problems, and to robust, efficient and interpretable AI. We present a general account of abstractions between low and high level models as natural transformations, focusing on the case of causal models. This provides a new formalisation of causal abstraction, unifying several notions in the literature, including constructive causal abstraction, Q- \tau consistency, abstractions based on interchange interventions, and distributed' causal abstractions. Our approach is formalised in terms of category theory, and uses the general notion of a compositional model with a given set of queries and semantics in a monoidal, cd- or Markov category; causal models and their queries such as interventions being special cases. We identify two basic notions of abstraction: downward abstractions mapping queries from high to low level; and upward abstractions, mapping concrete queries such as Do-interventions from low to high. Although usually presented as the latter, we show how common causal abstractions may, more fundamentally, be understood in terms of the former. Our approach also leads us to consider a new stronger notion of component-level’ abstraction, applying to the individual components of a model. In particular, this yields a novel, strengthened form of constructive causal abstraction at the mechanism-level, for which we prove characterisation results. Finally, we show that abstraction can be generalised to further compositional models, including those with a quantum semantics implemented by quantum circuits, and we take first steps in exploring abstractions between quantum compositional circuit models and high-level classical causal models as a means to explainable quantum AI.

[AI-8] FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）服务系统中因计算密集型预填充（prefill）阶段引发的队首阻塞（head-of-line blocking）问题，该问题导致高优先级请求被延迟，从而造成广泛的时间到首个标记（time-to-first-token, TTFT）服务级别目标（SLO）违规。传统分块预填充（chunked prefill）虽能实现可中断性，但存在响应延迟与吞吐量之间的固有权衡：小块尺寸提升响应速度却降低计算效率，大块尺寸则最大化吞吐但加剧阻塞。为突破这一限制，论文提出FlowPrefill系统，其核心创新在于解耦抢占粒度与调度频率：一是引入操作符级抢占（Operator-Level Preemption），利用算子边界实现细粒度执行中断而不损失固定小块划分带来的效率；二是采用事件驱动调度（Event-Driven Scheduling），仅在请求到达或完成时触发调度决策，从而在保持高响应性的同时显著降低控制面开销。此设计使FlowPrefill在满足异构SLO的前提下，相较最先进系统将最大良好吞吐量（goodput）提升最高达5.6倍。

链接: https://arxiv.org/abs/2602.16603
作者: Chia-chi Hsieh,Zan Zong,Xinyang Chen,Jianjiang Li,Jidong Zhai,Lijie Wen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:The growing demand for large language models (LLMs) requires serving systems to handle many concurrent requests with diverse service level objectives (SLOs). This exacerbates head-of-line (HoL) blocking during the compute-intensive prefill phase, where long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) SLO violations. While chunked prefill enables interruptibility, it introduces an inherent trade-off between responsiveness and throughput: reducing chunk size improves response latency but degrades computational efficiency, whereas increasing chunk size maximizes throughput but exacerbates blocking. This necessitates an adaptive preemption mechanism. However, dynamically balancing execution granularity against scheduling overheads remains a key challenge. In this paper, we propose FlowPrefill, a TTFT-goodput-optimized serving system that resolves this conflict by decoupling preemption granularity from scheduling frequency. To achieve adaptive prefill scheduling, FlowPrefill introduces two key innovations: 1) Operator-Level Preemption, which leverages operator boundaries to enable fine-grained execution interruption without the efficiency loss associated with fixed small chunking; and 2) Event-Driven Scheduling, which triggers scheduling decisions only upon request arrival or completion events, thereby supporting efficient preemption responsiveness while minimizing control-plane overhead. Evaluation on real-world production traces shows that FlowPrefill improves maximum goodput by up to 5.6 \times compared to state-of-the-art systems while satisfying heterogeneous SLOs. Comments: 13 pages Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.16603 [cs.DC] (or arXiv:2602.16603v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2602.16603 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-9] DataJoint 2.0: A Computational Substrate for Agent ic Scientific Workflows

【速读】：该论文旨在解决科学数据管道中因缺乏统一运维机制而导致的数据溯源碎片化与事务性保障缺失问题，从而影响人机协同（human-agent collaboration）的可靠性。其核心解决方案是提出DataJoint 2.0，关键在于采用关系型工作流模型（relational workflow model）：通过表表示流程步骤、行表示数据产物、外键定义执行顺序，使数据结构、计算依赖和完整性约束在单一形式化系统中可查询、可强制且机器可读，从而构建支持SciOps（Scientific Operations）的基础设施，确保代理（agent）参与科学工作流时不会引发数据损坏。

链接: https://arxiv.org/abs/2602.16585
作者: Dimitri Yatsenko,Thinh T. Nguyen(DataJoint Inc., Houston, USA)
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 20 pages, 2 figures, 1 table

点击查看摘要

Abstract:Operational rigor determines whether human-agent collaboration succeeds or fails. Scientific data pipelines need the equivalent of DevOps – SciOps – yet common approaches fragment provenance across disconnected systems without transactional guarantees. DataJoint 2.0 addresses this gap through the relational workflow model: tables represent workflow steps, rows represent artifacts, foreign keys prescribe execution order. The schema specifies not only what data exists but how it is derived – a single formal system where data structure, computational dependencies, and integrity constraints are all queryable, enforceable, and machine-readable. Four technical innovations extend this foundation: object-augmented schemas integrating relational metadata with scalable object storage, semantic matching using attribute lineage to prevent erroneous joins, an extensible type system for domain-specific formats, and distributed job coordination designed for composability with external orchestration. By unifying data structure, data, and computational transformations, DataJoint creates a substrate for SciOps where agents can participate in scientific workflows without risking data corruption.

[AI-10] AIFL: A Global Daily Streamflow Forecasting Model Using Deterministic LSTM Pre-trained on ERA5-Land and Fine-tuned on IFS

【速读】：该论文旨在解决数据驱动的水文模型在从历史再分析数据（reanalysis）向业务预报产品（operational forecast）迁移时存在的性能下降问题，即“再分析到预报的域偏移”（reanalysis-to-forecast domain shift）。解决方案的关键在于提出一种两阶段训练策略：首先在40年ERA5-Land再分析数据上预训练一个确定性LSTM模型（AIFL），以学习稳健的水文过程；随后在2016–2019年ECMWF集成预报系统（IFS）控制预报数据上微调，使其适应业务数值天气预测中的特定误差结构与偏差。该方法显著提升了全球日尺度径流预报的准确性与可靠性，尤其在极端事件检测方面表现优异。

链接: https://arxiv.org/abs/2602.16579
作者: Maria Luisa Taccari,Kenza Tazi,Oisín M. Morrison,Andreas Grafberger,Juan Colonese,Corentin Carton de Wiart,Christel Prudhomme,Cinzia Mazzetti,Matthew Chantry,Florian Pappenberger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注:

点击查看摘要

Abstract:Reliable global streamflow forecasting is essential for flood preparedness and water resource management, yet data-driven models often suffer from a performance gap when transitioning from historical reanalysis to operational forecast products. This paper introduces AIFL (Artificial Intelligence for Floods), a deterministic LSTM-based model designed for global daily streamflow forecasting. Trained on 18,588 basins curated from the CARAVAN dataset, AIFL utilises a novel two-stage training strategy to bridge the reanalysis-to-forecast domain shift. The model is first pre-trained on 40 years of ERA5-Land reanalysis (1980-2019) to capture robust hydrological processes, then fine-tuned on operational Integrated Forecasting System (IFS) control forecasts (2016-2019) to adapt to the specific error structures and biases of operational numerical weather prediction. To our knowledge, this is the first global model trained end-to-end within the CARAVAN ecosystem. On an independent temporal test set (2021-2024), AIFL achieves high predictive skill with a median modified Kling-Gupta Efficiency (KGE’) of 0.66 and a median Nash-Sutcliffe Efficiency (NSE) of 0.53. Benchmarking results show that AIFL is highly competitive with current state-of-the-art global systems, achieving comparable accuracy while maintaining a transparent and reproducible forcing pipeline. The model demonstrates exceptional reliability in extreme-event detection, providing a streamlined and operationally robust baseline for the global hydrological community.

[AI-11] MerLean: An Agent ic Framework for Autoformalization in Quantum Computation

【速读】：该论文旨在解决前沿量子计算研究中数学表述难以自动化形式化（autoformalization）的问题，即如何将自然语言描述的数学命题高效、准确地转化为可验证的机器代码。其解决方案的关键在于提出 MerLean——一个全自动化代理框架（agentic framework），能够从 LaTeX 源文件中提取数学陈述，将其形式化为基于 Mathlib 的 Lean 4 可验证代码，并将结果翻译回人类可读的 LaTeX 格式以供语义审查。该方法实现了端到端的形式化流程，在三篇理论量子计算论文上成功生成了 2,050 条 Lean 声明，显著降低了验证负担，仅需对新引入的定义和公理进行人工校验，从而为机器验证同行评审和训练下一代推理模型提供了可扩展的数据引擎。

链接: https://arxiv.org/abs/2602.16554
作者: Yuanjie Ren,Jinzheng Li,Yidi Qi
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:We introduce MerLean, a fully automated agentic framework for autoformalization in quantum computation. MerLean extracts mathematical statements from \LaTeX source files, formalizes them into verified Lean~4 code built on Mathlib, and translates the result back into human-readable \LaTeX for semantic review. We evaluate MerLean on three theoretical quantum computing papers producing 2,050 Lean declarations from 114 statements in total. MerLean achieves end-to-end formalization on all three papers, reducing the verification burden to only the newly introduced definitions and axioms. Our results demonstrate that agentic autoformalization can scale to frontier research, offering both a practical tool for machine-verified peer review and a scalable engine for mining high-quality synthetic data to train future reasoning models. Our approach can also be generalized to any other rigorous research in mathematics and theoretical physics.

[AI-12] Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在代理系统中因恶意提示词（jailbreak prompts）导致的安全风险问题，尤其是针对利用长上下文隐藏、语义伪装和轻量混淆技术绕过单次检测机制的攻击。解决方案的关键在于提出一种基于递归语言模型（Recursive Language Models, RLMs）的端到端检测框架 RLM-JB：该框架通过根模型协调有限分析程序，对输入进行规范化与去混淆处理，分块降低上下文稀释并确保覆盖完整内容，执行并行片段筛查，并融合跨片段信号以恢复被拆分的攻击载荷，从而将检测任务转化为可审计的程序化过程，显著提升对 AutoDAN 类对抗输入的检出率（ASR/Recall 92.5–98.0%），同时保持高精度（98.99–100%）与低误报率（0.0–2.0%）。

链接: https://arxiv.org/abs/2602.16520
作者: Doron Shavit
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 5 pages and 1 figure. Appendix: an additional 5 pages

点击查看摘要

Abstract:Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0.0-2.0%), highlighting a practical sensitivity-specificity trade-off as the screening backend changes.

[AI-13] Framework of Thoughts: A Foundation Framework for Dynamic and Optimized Reasoning based on Chains Trees and Graphs

【速读】：该论文旨在解决当前生成式 AI（Generative AI）中推理框架存在的两大核心问题：一是现有提示策略（如 Chain of Thought、Tree of Thoughts 和 Graph of Thoughts）依赖静态、特定问题的推理结构，缺乏对动态或未见问题类型的自适应能力；二是这些方案在超参数、提示优化、运行时效率和提示成本方面普遍未被充分调优。解决方案的关键在于提出一种通用基础框架——Framework of Thoughts (FoT)，其内置超参数自动调优、提示优化、并行执行与智能缓存等功能，从而显著提升推理方案的动态适应性与运行效率，并通过在 FoT 中实现 Tree of Thoughts、Graph of Thoughts 和 ProbTree 三种主流方案验证了其在加速执行、降低成本和提升任务性能方面的有效性。

链接: https://arxiv.org/abs/2602.16512
作者: Felix Fricke,Simon Malberg,Georg Groh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompting schemes such as Chain of Thought, Tree of Thoughts, and Graph of Thoughts can significantly enhance the reasoning capabilities of large language models. However, most existing schemes require users to define static, problem-specific reasoning structures that lack adaptability to dynamic or unseen problem types. Additionally, these schemes are often under-optimized in terms of hyperparameters, prompts, runtime, and prompting cost. To address these limitations, we introduce Framework of Thoughts (FoT)–a general-purpose foundation framework for building and optimizing dynamic reasoning schemes. FoT comes with built-in features for hyperparameter tuning, prompt optimization, parallel execution, and intelligent caching, unlocking the latent performance potential of reasoning schemes. We demonstrate FoT’s capabilities by implementing three popular schemes–Tree of Thoughts, Graph of Thoughts, and ProbTree–within FoT. We empirically show that FoT enables significantly faster execution, reduces costs, and achieves better task scores through optimization. We release our codebase to facilitate the development of future dynamic and efficient reasoning schemes.

[AI-14] Interpretability-by-Design with Accurate Locally Additive Models and Conditional Feature Effects

【速读】：该论文旨在解决广义加性模型（Generalized Additive Models, GAMs）在存在特征交互作用时预测能力不足，而加入成对交互项的GA²Ms模型虽提升了准确性却牺牲了可解释性的问题。其解决方案的关键在于提出了一种新的模型类——条件加性局部模型（Conditionally Additive Local Models, CALMs），该模型允许每个特征在输入空间的不同子区域中具有多个独立的形状函数（shape functions），并通过基于逻辑条件（阈值）定义的区域划分实现局部加性结构，从而在保持良好可解释性的同时捕捉非线性交互效应。此外，论文设计了一种基于蒸馏的训练流程，通过识别低交互性的同质区域并采用区域感知的回代算法拟合形状函数，进一步提升了模型性能与可审计性。

链接: https://arxiv.org/abs/2602.16503
作者: Vasilis Gkolemis,Loukas Kavouras,Dimitrios Kyriakopoulos,Konstantinos Tsopelas,Dimitrios Rontogiannis,Giuseppe Casalicchio,Theodore Dalamagas,Christos Diou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalized additive models (GAMs) offer interpretability through independent univariate feature effects but underfit when interactions are present in data. GA ^2 Ms add selected pairwise interactions which improves accuracy, but sacrifices interpretability and limits model auditing. We propose \emphConditionally Additive Local Models (CALMs), a new model class, that balances the interpretability of GAMs with the accuracy of GA ^2 Ms. CALMs allow multiple univariate shape functions per feature, each active in different regions of the input space. These regions are defined independently for each feature as simple logical conditions (thresholds) on the features it interacts with. As a result, effects remain locally additive while varying across subregions to capture interactions. We further propose a principled distillation-based training pipeline that identifies homogeneous regions with limited interactions and fits interpretable shape functions via region-aware backfitting. Experiments on diverse classification and regression tasks show that CALMs consistently outperform GAMs and achieve accuracy comparable with GA ^2 Ms. Overall, CALMs offer a compelling trade-off between predictive accuracy and interpretability.

[AI-15] Fast and Scalable Analytical Diffusion

【速读】：该论文旨在解决分析型扩散模型（analytical diffusion models）在大规模数据集上训练和推理时的可扩展性瓶颈问题，即标准形式需要在每个时间步对整个数据集进行扫描，导致计算复杂度与数据规模呈线性关系。其解决方案的关键在于提出“后验渐进集中”（Posterior Progressive Concentration）现象：随着信噪比增加，去噪得分的有效支持区域从全局流形逐渐收缩至局部邻域。基于此发现，作者设计了无需训练的动态时间感知黄金子集扩散（GoldDiff），通过粗到精机制动态定位推理所需的“黄金子集”，从而将推断复杂度与数据规模解耦。理论层面，证明了稀疏近似收敛于精确得分；实验表明，GoldDiff在AFHQ上实现71倍加速且性能相当或更优，并首次成功将分析型扩散扩展至ImageNet-1K，为大规模生成建模提供了可扩展、免训练的新范式。

链接: https://arxiv.org/abs/2602.16498
作者: Xinyi Shang,Peng Sun,Jingyu Lin,Zhiqiang Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Analytical diffusion models offer a mathematically transparent path to generative modeling by formulating the denoising score as an empirical-Bayes posterior mean. However, this interpretability comes at a prohibitive cost: the standard formulation necessitates a full-dataset scan at every timestep, scaling linearly with dataset size. In this work, we present the first systematic study addressing this scalability bottleneck. We challenge the prevailing assumption that the entire training data is necessary, uncovering the phenomenon of Posterior Progressive Concentration: the effective golden support of the denoising score is not static but shrinks asymptotically from the global manifold to a local neighborhood as the signal-to-noise ratio increases. Capitalizing on this, we propose Dynamic Time-Aware Golden Subset Diffusion (GoldDiff), a training-free framework that decouples inference complexity from dataset size. Instead of static retrieval, GoldDiff uses a coarse-to-fine mechanism to dynamically pinpoint the ‘‘Golden Subset’’ for inference. Theoretically, we derive rigorous bounds guaranteeing that our sparse approximation converges to the exact score. Empirically, GoldDiff achieves a \bf 71 \times speedup on AFHQ while matching or achieving even better performance than full-scan baselines. Most notably, we demonstrate the first successful scaling of analytical diffusion to ImageNet-1K, unlocking a scalable, training-free paradigm for large-scale generative modeling.

[AI-16] Leverag ing Large Language Models for Causal Discovery: a Constraint-based Argumentation-driven Approach

【速读】：该论文旨在解决因果发现（causal discovery）中如何有效融合观测数据与专家知识以构建可靠因果图的问题，尤其关注在缺乏专业领域专家时，如何利用大语言模型（large language models, LLMs）作为“不完美专家”来提供语义结构先验。其解决方案的关键在于提出一种基于因果假设的论证框架（Causal Assumption-based Argumentation, Causal ABA），通过符号推理确保输入约束与输出因果图的一致性，并将变量名称和描述中提取的语义结构先验与条件独立性证据进行整合，从而实现数据驱动与知识引导的协同建模。实验表明该方法在标准基准和语义 grounded 的合成图上达到当前最优性能，并引入新的评估协议以缓解LLM在因果发现任务中的记忆偏差问题。

链接: https://arxiv.org/abs/2602.16481
作者: Zihao Li,Fabrizio Russo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, including appendix

点击查看摘要

Abstract:Causal discovery seeks to uncover causal relations from data, typically represented as causal graphs, and is essential for predicting the effects of interventions. While expert knowledge is required to construct principled causal graphs, many statistical methods have been proposed to leverage observational data with varying formal guarantees. Causal Assumption-based Argumentation (ABA) is a framework that uses symbolic reasoning to ensure correspondence between input constraints and output graphs, while offering a principled way to combine data and expertise. We explore the use of large language models (LLMs) as imperfect experts for Causal ABA, eliciting semantic structural priors from variable names and descriptions and integrating them with conditional-independence evidence. Experiments on standard benchmarks and semantically grounded synthetic graphs demonstrate state-of-the-art performance, and we additionally introduce an evaluation protocol to mitigate memorisation bias when assessing LLMs for causal discovery.

[AI-17] GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation

【速读】：该论文旨在解决生成式模型评估中因高维嵌入空间中的“hubness”现象（hubness phenomenon）导致的近邻关系失真问题，该现象会扭曲基于距离的评价指标，从而误导对生成数据质量的判断。解决方案的关键在于提出生成式迭代上下文差异度量（Generative ICDM, GICDM），该方法通过修正真实数据与生成数据的局部邻域估计来缓解hubness偏差；同时引入多尺度扩展以提升实际应用中的稳定性与有效性，实验表明GICDM能够恢复可靠的度量行为并更贴近人类判断。

链接: https://arxiv.org/abs/2602.16449
作者: Nicolas Salvy,Hugues Talbot,Bertrand Thirion
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Generative model evaluation commonly relies on high-dimensional embedding spaces to compute distances between samples. We show that dataset representations in these spaces are affected by the hubness phenomenon, which distorts nearest neighbor relationships and biases distance-based metrics. Building on the classical Iterative Contextual Dissimilarity Measure (ICDM), we introduce Generative ICDM (GICDM), a method to correct neighborhood estimation for both real and generated data. We introduce a multi-scale extension to improve empirical behavior. Extensive experiments on synthetic and real benchmarks demonstrate that GICDM resolves hubness-induced failures, restores reliable metric behavior, and improves alignment with human judgment.

[AI-18] RoboGene: Boosting VLA Pre-training via Diversity-Driven Agent ic Framework for Real-World Task Generation

【速读】：该论文旨在解决通用机器人操作（general-purpose robotic manipulation）中因真实世界交互数据稀缺而导致的瓶颈问题，尤其聚焦于如何自动化生成多样化且物理可行的操纵任务，以提升视觉-语言-动作（VLA）模型的训练质量和泛化能力。其解决方案的关键在于提出RoboGene框架，该框架集成三大核心组件：基于多样性驱动的采样策略以实现广泛的任务覆盖、自省机制（self-reflection mechanisms）以强制执行物理约束从而避免幻觉性指令，以及人机协同优化（human-in-the-loop refinement）以持续改进任务质量。实验证明，RoboGene显著优于现有主流基础模型（如GPT-4o和Gemini 2.5 Pro），并使预训练后的VLA模型在真实场景中表现出更高的成功率与更强的泛化性能。

链接: https://arxiv.org/abs/2602.16444
作者: Yixue Zhang(1,2),Kun Wu(1),Zhi Gao(3),Zhen Zhao(1),Pei Ren(1),Zhiyuan Xu(1),Fei Liao(1),Xinhua Wang(1),Shichao Fan(1,4),Di Wu(1,5),Qiuxuan Feng(1,5),Meng Li(1),Zhengping Che(1),Chang Liu(2),Jian Tang(1) ((1) Beijing Innovation Center of Humanoid Robotics, (2) The School of Advanced Manufacturing and Robotics, Peking University, (3) Beijing Institute of Technology, (4) The School of Mechanical Engineering and Automation, Beihang University, (5) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The pursuit of general-purpose robotic manipulation is hindered by the scarcity of diverse, real-world interaction data. Unlike data collection from web in vision or language, robotic data collection is an active process incurring prohibitive physical costs. Consequently, automated task curation to maximize data value remains a critical yet under-explored challenge. Existing manual methods are unscalable and biased toward common tasks, while off-the-shelf foundation models often hallucinate physically infeasible instructions. To address this, we introduce RoboGene, an agentic framework designed to automate the generation of diverse, physically plausible manipulation tasks across single-arm, dual-arm, and mobile robots. RoboGene integrates three core components: diversity-driven sampling for broad task coverage, self-reflection mechanisms to enforce physical constraints, and human-in-the-loop refinement for continuous improvement. We conduct extensive quantitative analysis and large-scale real-world experiments, collecting datasets of 18k trajectories and introducing novel metrics to assess task quality, feasibility, and diversity. Results demonstrate that RoboGene significantly outperforms state-of-the-art foundation models (e.g., GPT-4o, Gemini 2.5 Pro). Furthermore, real-world experiments show that VLA models pre-trained with RoboGene achieve higher success rates and superior generalization, underscoring the importance of high-quality task generation. Our project is available at this https URL.

[AI-19] Hardware-accelerated graph neural networks: an alternative approach for neuromorphic event-based audio classification and keyword spotting on SoC FPGA

【速读】：该论文旨在解决嵌入式边缘传感器（尤其是产生离散事件流的类脑器件）数据量增长背景下，如何实现高效、低延迟和低功耗本地处理的问题。其核心挑战在于传统神经网络架构在资源受限设备上的部署效率低下，难以满足实时音频处理需求。解决方案的关键在于提出一种面向FPGA的事件图神经网络（event-graph neural networks）架构，结合人工耳蜗将时序信号转换为稀疏事件数据以降低内存与计算开销，并通过量化模型优化硬件资源利用率与推理延迟。该方案在SHD和SSC数据集上实现了高精度分类（最高达92.7%），同时显著减少参数量（超过10倍）和功耗（仅1.18 W），并在首个端到端FPGA加速事件音频关键词检测（KWS）系统中实现了高达95%的词尾检测准确率与10.53微秒的超低延迟，确立了能效比的新基准。

链接: https://arxiv.org/abs/2602.16442
作者: Kamil Jeziorek,Piotr Wzorek,Krzysztof Blachut,Hiroshi Nakano,Manon Dampfhoffer,Thomas Mesquida,Hiroaki Nishi,Thomas Dalgaty,Tomasz Kryjak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Under revision in TRETS Journal

点击查看摘要

Abstract:As the volume of data recorded by embedded edge sensors increases, particularly from neuromorphic devices producing discrete event streams, there is a growing need for hardware-aware neural architectures that enable efficient, low-latency, and energy-conscious local processing. We present an FPGA implementation of event-graph neural networks for audio processing. We utilise an artificial cochlea that converts time-series signals into sparse event data, reducing memory and computation costs. Our architecture was implemented on a SoC FPGA and evaluated on two open-source datasets. For classification task, our baseline floating-point model achieves 92.7% accuracy on SHD dataset - only 2.4% below the state of the art - while requiring over 10x and 67x fewer parameters. On SSC, our models achieve 66.9-71.0% accuracy. Compared to FPGA-based spiking neural networks, our quantised model reaches 92.3% accuracy, outperforming them by up to 19.3% while reducing resource usage and latency. For SSC, we report the first hardware-accelerated evaluation. We further demonstrate the first end-to-end FPGA implementation of event-audio keyword spotting, combining graph convolutional layers with recurrent sequence modelling. The system achieves up to 95% word-end detection accuracy, with only 10.53 microsecond latency and 1.18 W power consumption, establishing a strong benchmark for energy-efficient event-driven KWS.

[AI-20] Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

【速读】：该论文试图解决当前大型语言模型（Large Language Model, LLM）公平性对齐中忽视多维敏感属性交互作用的问题，即传统方法仅针对单一敏感属性进行偏见缓解，可能导致在未被关注的属性上产生偏见溢出（bias spillover）现象。解决方案的关键在于引入一种情境感知的多属性公平性评估框架，通过在模糊和明确语境下使用直接偏好优化（Direct Preference Optimization）与BBQ基准测试，系统性地分析目标性别对齐如何影响九个敏感属性的公平表现。研究发现，在整体指标改善的同时，模糊语境下物理外观、性取向及残疾状态等属性的公平性显著恶化（p < 0.001），揭示了单维度公平优化可能加剧其他维度不公平的风险，强调必须建立兼顾上下文敏感性和多属性协同考量的公平评估体系。

链接: https://arxiv.org/abs/2602.16438
作者: Eva Paraschou,Line Harder Clemmensen,Sneha Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to the BiAlign CHI Workshop 2026

点击查看摘要

Abstract:Conventional large language model (LLM) fairness alignment largely focuses on mitigating bias along single sensitive attributes, overlooking fairness as an inherently multidimensional and context-specific value. This approach risks creating systems that achieve narrow fairness metrics while exacerbating disparities along untargeted attributes, a phenomenon known as bias spillover. While extensively studied in machine learning, bias spillover remains critically underexplored in LLM alignment. In this work, we investigate how targeted gender alignment affects fairness across nine sensitive attributes in three state-of-the-art LLMs (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B). Using Direct Preference Optimization and the BBQ benchmark, we evaluate fairness under ambiguous and disambiguous contexts. Our findings reveal noticeable bias spillover: while aggregate results show improvements, context-aware analysis exposes significant degradations in ambiguous contexts, particularly for physical appearance ( p 0.001 across all models), sexual orientation, and disability status. We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the necessity of context-aware, multi-attribute fairness evaluation frameworks.

[AI-21] HAWX: A Hardware-Aware FrameWork for Fast and Scalable ApproXimation of DNNs

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNN）在硬件部署过程中，如何高效探索异构近似计算（Approximate Computing, AxC）模块集成方案的问题。传统方法依赖于穷举搜索，计算开销巨大，难以扩展至大规模模型。解决方案的关键在于提出HAWX框架，该框架通过多层级敏感性评分机制（在操作符、滤波器、层和模型四个抽象层次上进行评估），结合精度、功耗与面积的预测模型，实现对候选配置的快速筛选与评估，从而显著加速搜索过程——在LeNet-5的滤波器级搜索中实现超过3×10⁶倍的速度提升，同时保持与穷举搜索相当的精度表现。此方法支持空间与时间架构的异构加速器设计，可灵活适配现成或定制化的近似计算单元。

链接: https://arxiv.org/abs/2602.16336
作者: Samira Nazari,Mohammad Saeed Almasi,Mahdi Taheri,Ali Azarpeyvand,Ali Mokhtari,Ali Mahani,Christian Herglotz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work presents HAWX, a hardware-aware scalable exploration framework that employs multi-level sensitivity scoring at different DNN abstraction levels (operator, filter, layer, and model) to guide selective integration of heterogeneous AxC blocks. Supported by predictive models for accuracy, power, and area, HAWX accelerates the evaluation of candidate configurations, achieving over 23* speedup in a layer-level search with two candidate approximate blocks and more than (3106) speedup at the filter-level search only for LeNet-5, while maintaining accuracy comparable to exhaustive search. Experiments across state-of-the-art DNN benchmarks such as VGG-11, ResNet-18, and EfficientNetLite demonstrate that the efficiency benefits of HAWX scale exponentially with network size. The HAWX hardware-aware search algorithm supports both spatial and temporal accelerator architectures, leveraging either off-the-shelf approximate components or customized designs.

[AI-22] Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

【速读】：该论文旨在解决空间音频理解（Spatial Audio Understanding）中关于运动推理的问题，即如何让模型从立体声音频中直接推断出声源的运动轨迹、位置及方向变化。其核心解决方案包括三个关键点：首先，提出一种以运动为中心的空间音频增强框架，通过合成孤立单声道事件的多样化运动模式来生成可控且可扩展的训练数据；其次，设计了一种带有“思考模式”（thinking mode）的端到端多模态微调方法，使音频-语言模型能够在预测答案前输出显式的中间推理步骤，从而提升逻辑可解释性与准确性；最后，系统评估了查询条件下的声源分离作为预处理阶段的影响，并比较了无掩码、音频定位模型（Audio Grounding Model, AGM）和真实掩码三种推理范式，发现推理机制能显著放大声源分离的效果，尤其在单一事件场景下，“思考模式”带来+5.1%的性能提升。这一系列工作揭示了运动建模、推理能力与分离质量之间的协同关系，为推进空间音频理解提供了新思路。

链接: https://arxiv.org/abs/2602.16334
作者: Arvind Krishna Sridhar,Yinyi Guo,Erik Visser
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial audio understanding aims to enable machines to interpret complex auditory scenes, particularly when sound sources move over time. In this work, we study Spatial Audio Question Answering (Spatial AQA) with a focus on movement reasoning, where a model must infer object motion, position, and directional changes directly from stereo audio. First, we introduce a movement-centric spatial audio augmentation framework that synthesizes diverse motion patterns from isolated mono audio events, enabling controlled and scalable training data generation. Second, we propose an end-to-end multimodal finetuning approach with a thinking mode, which allows audio-language models to produce explicit intermediate reasoning steps before predicting an answer. Third, we investigate the impact of query-conditioned source separation as a preprocessing stage and compare three inference regimes: no masking, an audio grounding model (AGM), and ground-truth masks. Our results show that reasoning amplifies the benefits of source separation, with thinking mode showing significant improvement of +5.1% when a single event is present in the question. These findings highlight the interplay between movement modeling, reasoning, and separation quality, offering new insights for advancing spatial audio understanding.

[AI-23] A Graph Meta-Network for Learning on Kolmogorov-Arnold Networks

【速读】：该论文旨在解决权重空间建模（weight-space modeling）中如何高效、准确地从神经网络参数中学习任务性能预测的问题，尤其是针对Kolmogorov-Arnold Networks (KANs) 缺乏专门设计的权重空间架构这一挑战。传统方法如直接将多层感知机（MLP）应用于展平后的参数向量表现不佳，因此亟需更合理的架构设计。其解决方案的关键在于：首先证明了KANs与标准多层感知机（MLP）具有相同的排列对称性（permutation symmetry），进而提出KAN图（KAN-graph）作为KAN计算过程的图结构表示，并基于此构建了首个专为KAN设计的权重空间模型WS-KAN。该模型天然地利用了KAN的对称性，具备理论上的表达能力（可复现输入KAN的前向传播），并通过在多样化任务上训练得到的KAN“动物园”基准进行实证验证，结果表明WS-KAN在所有任务中均显著优于无结构感知的基线方法。

链接: https://arxiv.org/abs/2602.16316
作者: Guy Bar-Shalom,Ami Tavory,Itay Evron,Maya Bechler-Speicher,Ido Guy,Haggai Maron
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Weight-space models learn directly from the parameters of neural networks, enabling tasks such as predicting their accuracy on new datasets. Naive methods – like applying MLPs to flattened parameters – perform poorly, making the design of better weight-space architectures a central challenge. While prior work leveraged permutation symmetries in standard networks to guide such designs, no analogous analysis or tailored architecture yet exists for Kolmogorov-Arnold Networks (KANs). In this work, we show that KANs share the same permutation symmetries as MLPs, and propose the KAN-graph, a graph representation of their computation. Building on this, we develop WS-KAN, the first weight-space architecture that learns on KANs, which naturally accounts for their symmetry. We analyze WS-KAN’s expressive power, showing it can replicate an input KAN’s forward pass - a standard approach for assessing expressiveness in weight-space architectures. We construct a comprehensive ``zoo’’ of trained KANs spanning diverse tasks, which we use as benchmarks to empirically evaluate WS-KAN. Across all tasks, WS-KAN consistently outperforms structure-agnostic baselines, often by a substantial margin. Our code is available at this https URL.

[AI-24] he Weight of a Bit: EMFI Sensitivity Analysis of Embedded Deep Learning Models

【速读】：该论文旨在解决嵌入式神经网络模型在电磁故障注入（EMFI）攻击下的脆弱性问题，特别是不同数值表示方式对模型抗攻击能力的影响尚未被系统评估。其关键解决方案是通过实验对比四种典型数值表示（32位浮点、16位浮点、8位整数和4位整数）在嵌入式图像分类模型（ResNet-18/34/50 和 VGG-11）上的抗EMFI性能，结果表明整数表示（尤其是8位）显著优于浮点表示，能在单次故障注入后维持较高准确率（如VGG-11在8位下Top-1准确率仍达70%），揭示了数值精度与模型鲁棒性之间的权衡关系。

链接: https://arxiv.org/abs/2602.16309
作者: Jakub Breier,Štefan Kučerák,Xiaolu Hou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fault injection attacks on embedded neural network models have been shown as a potent threat. Numerous works studied resilience of models from various points of view. As of now, there is no comprehensive study that would evaluate the influence of number representations used for model parameters against electromagnetic fault injection (EMFI) attacks. In this paper, we investigate how four different number representations influence the success of an EMFI attack on embedded neural network models. We chose two common floating-point representations (32-bit, and 16-bit), and two integer representations (8-bit, and 4-bit). We deployed four common image classifiers, ResNet-18, ResNet-34, ResNet-50, and VGG-11, on an embedded memory chip, and utilized a low-cost EMFI platform to trigger faults. Our results show that while floating-point representations exhibit almost a complete degradation in accuracy (Top-1 and Top-5) after a single fault injection, integer representations offer better resistance overall. Especially, when considering the the 8-bit representation on a relatively large network (VGG-11), the Top-1 accuracies stay at around 70% and the Top-5 at around 90%. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.16309 [cs.CR] (or arXiv:2602.16309v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.16309 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-25] Multi-agent cooperation through in-context co-player inference

【速读】：该论文旨在解决多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）中自利智能体之间难以实现合作的根本挑战。传统方法通常依赖于对同伴智能体学习规则的硬编码假设或强制快慢时间尺度分离（如“朴素学习者”与“元学习者”的区分），这限制了方法的灵活性和泛化能力。解决方案的关键在于利用序列模型（sequence models）的上下文学习（in-context learning）能力，无需显式假设或时间尺度分离即可实现对同伴学习动态的感知。通过在多样化同伴分布上训练序列模型智能体，自然涌现出基于上下文的最佳响应策略，从而在单个episode内部快速适应并诱导出合作行为——其机制源于对勒索（extortion）的脆弱性及由此产生的相互塑造压力，最终促发合作策略的学习。

链接: https://arxiv.org/abs/2602.16301
作者: Marissa A. Weis,Maciej Wołczyk,Rajai Nasser,Rif A. Saurous,Blaise Agüera y Arcas,João Sacramento,Alexander Meulemans
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 4 figures

点击查看摘要

Abstract:Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between “learning-aware” agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between “naive learners” updating on fast timescales and “meta-learners” observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale. We find that the cooperative mechanism identified in prior work-where vulnerability to extortion drives mutual shaping-emerges naturally in this setting: in-context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent’s in-context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors.

[AI-26] oward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

【速读】：该论文旨在解决当前用于评估交互式大语言模型（LLM）代理的基准测试方法依赖全确定性后端所带来的高成本与低可扩展性问题。现有基准（如tau-bench、AppWorld）虽能提供稳定评估，但其构建和迭代复杂度高，难以适应工业级应用中多轮对话与多步工具调用场景下的动态需求。解决方案的关键在于提出“基于代理状态的评估”（Proxy State-Based Evaluation），即通过一个由LLM驱动的模拟框架，在无需确定性数据库的前提下，利用LLM状态追踪器从完整交互日志中推断结构化的代理状态（proxy state），并由LLM裁判验证目标达成情况及检测工具或用户幻觉，从而实现最终状态导向的可靠评估。该方法不仅生成稳定的模型排名，还能产出可用于训练的on-policy数据，并在不同用户角色下支持敏感性分析，同时保持接近90%的人类-LLM判别一致性，展现出良好的自动化评估可靠性与可扩展性。

链接: https://arxiv.org/abs/2602.16246
作者: Yun-Shiuan Chuang,Chaitanya Kulkarni,Alec Chiu,Avinash Thangali,Zijie Pan,Shivani Shekhar,Yirou Ge,Yixi Li,Uma Kona,Linsey Pang,Prakhar Mehrotra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

[AI-27] UCTECG-Net: Uncertainty-aware Convolution Transformer ECG Network for Arrhythmia Detection

【速读】：该论文旨在解决深度学习在心电图（ECG）自动分类中预测可靠性不足的问题，这限制了其在安全关键场景中的应用。解决方案的关键在于提出一种不确定性感知的混合架构UCTECG-Net，该模型融合了一维卷积神经网络（1D CNN）与Transformer编码器，联合处理原始ECG信号及其频谱图（spectrogram），从而提升分类性能；同时，通过集成蒙特卡洛Dropout（Monte Carlo Dropout）、深度集成（Deep Ensembles）和集成蒙特卡洛Dropout（Ensemble Monte Carlo Dropout）三种不确定性量化方法，对预测置信度进行建模，并利用不确定性感知混淆矩阵及衍生指标分析模型输出的可靠性，结果表明UCTECG-Net在不确定性估计方面优于LSTM、CNN1D和纯Transformer基线模型，尤其在使用深度集成或EMCD时表现出更可靠且与实际预测表现更一致的不确定性估计。

链接: https://arxiv.org/abs/2602.16216
作者: Hamzeh Asgharnezhad,Pegah Tabarisaadi,Abbas Khosravi,Roohallah Alizadehsani,U. Rajendra Acharya
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has improved automated electrocardiogram (ECG) classification, but limited insight into prediction reliability hinders its use in safety-critical settings. This paper proposes UCTECG-Net, an uncertainty-aware hybrid architecture that combines one-dimensional convolutions and Transformer encoders to process raw ECG signals and their spectrograms jointly. Evaluated on the MIT-BIH Arrhythmia and PTB Diagnostic datasets, UCTECG-Net outperforms LSTM, CNN1D, and Transformer baselines in terms of accuracy, precision, recall and F1 score, achieving up to 98.58% accuracy on MIT-BIH and 99.14% on PTB. To assess predictive reliability, we integrate three uncertainty quantification methods (Monte Carlo Dropout, Deep Ensembles, and Ensemble Monte Carlo Dropout) into all models and analyze their behavior using an uncertainty-aware confusion matrix and derived metrics. The results show that UCTECG-Net, particularly with Ensemble or EMCD, provides more reliable and better-aligned uncertainty estimates than competing architectures, offering a stronger basis for risk-aware ECG decision support.

[AI-28] Geometric Neural Operators via Lie Group-Constrained Latent Dynamics

【速读】：该论文旨在解决现有神经算子（Neural Operators）在多层迭代和长时间滚动预测中出现的不稳定性问题，其根源在于未受约束的欧几里得潜空间更新违背了系统的几何结构与守恒定律。解决方案的关键在于引入基于李群（Lie group）的低秩李代数参数化方法，通过在潜表示上执行群作用更新（group action updates）来约束流形结构，从而为神经算子提供几何归纳偏置（geometric inductive bias）。该方法被命名为MCL（Manifold Constraining based on Lie group），可作为即插即用模块集成至现有神经算子架构中，在仅增加2.26%参数量的情况下，显著降低30–50%的相对预测误差，提升长期预测精度。

链接: https://arxiv.org/abs/2602.16209
作者: Jiaquan Zhang,Fachrina Dewi Puspitasari,Songbo Zhang,Yibei Liu,Kuien Liu,Caiyan Qin,Fan Mo,Peng Wang,Yang Yang,Chaoning Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural operators offer an effective framework for learning solutions of partial differential equations for many physical systems in a resolution-invariant and data-driven manner. Existing neural operators, however, often suffer from instability in multi-layer iteration and long-horizon rollout, which stems from the unconstrained Euclidean latent space updates that violate the geometric and conservation laws. To address this challenge, we propose to constrain manifolds with low-rank Lie algebra parameterization that performs group action updates on the latent representation. Our method, termed Manifold Constraining based on Lie group (MCL), acts as an efficient \emphplug-and-play module that enforces geometric inductive bias to existing neural operators. Extensive experiments on various partial differential equations, such as 1-D Burgers and 2-D Navier-Stokes, over a wide range of parameters and steps demonstrate that our method effectively lowers the relative prediction error by 30-50% at the cost of 2.26% of parameter increase. The results show that our approach provides a scalable solution for improving long-term prediction fidelity by addressing the principled geometric constraints absent in the neural operator updates.

[AI-29] mporal Panel Selection in Ongoing Citizens Assemblies AAMAS2026

【速读】：该论文旨在解决如何在永久性公民议事会（permanent citizens’ assemblies）中实现跨时间的公平代表性问题，即在多轮随机遴选（temporal sortition）过程中，既保证每一轮面板内部的群体比例代表性，又确保在整个面板序列中不同群体能够得到持续且公平的覆盖。其关键解决方案在于构建一个时空统一的代表性框架：要求任意初始段面板序列作为累积整体时，仍能按人口结构比例反映各群体特征，同时维持个体层面的选择公平性（individual fairness），即每个公民被选中的概率均等。作者基于度量空间中的群体分布模型，提出了一系列算法，在保障个体公平的前提下，提供对单个面板及多轮面板序列的可证明比例代表性保证。

链接: https://arxiv.org/abs/2602.16194
作者: Yusuf Hakan Kalayci,Evi Micha
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: 20 pages, 2 figures, Accepted to AAMAS 2026

点击查看摘要

Abstract:Permanent citizens’ assemblies are ongoing deliberative bodies composed of randomly selected citizens, organized into panels that rotate over time. Unlike one-off panels, which represent the population in a single snapshot, permanent assemblies enable shifting participation across multiple rounds. This structure offers a powerful framework for ensuring that different groups of individuals are represented over time across successive panels. In particular, it allows smaller groups of individuals that may not warrant representation in every individual panel to be represented across a sequence of them. We formalize this temporal sortition framework by requiring proportional representation both within each individual panel and across the sequence of panels. Building on the work of Ebadian and Micha (2025), we consider a setting in which the population lies in a metric space, and the goal is to achieve both proportional representation, ensuring that every group of citizens receives adequate representation, and individual fairness, ensuring that each individual has an equal probability of being selected. We extend the notion of representation to a temporal setting by requiring that every initial segment of the panel sequence, viewed as a cumulative whole, proportionally reflects the structure of the population. We present algorithms that provide varying guarantees of proportional representation, both within individual panels and across any sequence of panels, while also maintaining individual fairness over time. Comments: 20 pages, 2 figures, Accepted to AAMAS 2026 Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.16194 [cs.GT] (or arXiv:2602.16194v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2602.16194 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] Rethinking Input Domains in Physics-Informed Neural Networks via Geometric Compactification Mappings

【速读】：该论文旨在解决物理信息神经网络（Physics-Informed Neural Networks, PINNs）在求解多尺度偏微分方程（Multi-scale Partial Differential Equations, PDEs）时因固定坐标系输入导致的梯度刚性（gradient stiffness）和病态性（ill-conditioning）问题，这些问题会显著阻碍模型收敛。解决方案的关键在于提出一种几何紧化（Geometric Compactification, GC）映射范式，通过可微的几何紧化映射重塑输入坐标，并将PDE的几何结构与残差算子的谱特性相耦合；在此基础上设计了三种映射策略——周期边界处理、远场尺度扩展和局部奇异结构适配，从而在不改变原有PINN架构的前提下，实现更均匀的残差分布、更高的解精度以及更快的训练稳定性和收敛速度。

链接: https://arxiv.org/abs/2602.16193
作者: Zhenzhen Huang,Haoyu Bian,Jiaquan Zhang,Yibei Liu,Kuien Liu,Caiyan Qin,Guoqing Wang,Yang Yang,Chaoning Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Several complex physical systems are governed by multi-scale partial differential equations (PDEs) that exhibit both smooth low-frequency components and localized high-frequency structures. Existing physics-informed neural network (PINN) methods typically train with fixed coordinate system inputs, where geometric misalignment with these structures induces gradient stiffness and ill-conditioning that hinder convergence. To address this issue, we introduce a mapping paradigm that reshapes the input coordinates through differentiable geometric compactification mappings and couples the geometric structure of PDEs with the spectral properties of residual operators. Based on this paradigm, we propose Geometric Compactification (GC)-PINN, a framework that introduces three mapping strategies for periodic boundaries, far-field scale expansion, and localized singular structures in the input domain without modifying the underlying PINN architecture. Extensive empirical evaluation demonstrates that this approach yields more uniform residual distributions and higher solution accuracy on representative 1D and 2D PDEs, while improving training stability and convergence speed.

[AI-31] Revolutionizing Long-Term Memory in AI: New Horizons with High-Capacity and High-Speed Storag e

【速读】：该论文旨在解决当前人工超智能（Artificial Superintelligence, ASI）发展中因信息提取与存储方式导致的知识损失问题。其核心挑战在于主流的“提取后存储”（extract then store）范式会因预判性筛选而丢弃潜在对不同任务有价值的原始经验。论文提出的关键解决方案是采用“存储后按需提取”（store then on-demand extract）的方法，即保留完整的原始经验数据，并在需要时灵活调用和提取有用信息，从而避免信息丢失；同时强调通过挖掘大规模概率性经验以发现深层洞见、以及共享存储经验来提升经验收集效率，这些方向虽具直观有效性，但尚待深入研究。

链接: https://arxiv.org/abs/2602.16192
作者: Hiroaki Yamanaka,Daisuke Miyashita,Takashi Toi,Asuka Maki,Taiga Ikeda,Jun Deguchi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Driven by our mission of “uplifting the world with memory,” this paper explores the design concept of “memory” that is essential for achieving artificial superintelligence (ASI). Rather than proposing novel methods, we focus on several alternative approaches whose potential benefits are widely imaginable, yet have remained largely unexplored. The currently dominant paradigm, which can be termed “extract then store,” involves extracting information judged to be useful from experiences and saving only the extracted content. However, this approach inherently risks the loss of information, as some valuable knowledge particularly for different tasks may be discarded in the extraction process. In contrast, we emphasize the “store then on-demand extract” approach, which seeks to retain raw experiences and flexibly apply them to various tasks as needed, thus avoiding such information loss. In addition, we highlight two further approaches: discovering deeper insights from large collections of probabilistic experiences, and improving experience collection efficiency by sharing stored experiences. While these approaches seem intuitively effective, our simple experiments demonstrate that this is indeed the case. Finally, we discuss major challenges that have limited investigation into these promising directions and propose research topics to address them.

[AI-32] SIT-LMPC: Safe Information-Theoretic Learning Model Predictive Control for Iterative Tasks ICRA2026

【速读】：该论文旨在解决复杂不确定环境中机器人执行迭代任务时，如何在保证系统约束安全的前提下实现高鲁棒性与高性能控制的问题。其核心挑战在于平衡安全性、鲁棒性和优化性能，尤其是在离散时间非线性随机系统中求解无限时域最优控制问题。解决方案的关键在于提出一种安全信息论学习模型预测控制（Safe Information-Theoretic Learning Model Predictive Control, SIT-LMPC）算法：首先基于信息论模型预测控制构建迭代控制框架，并引入自适应惩罚机制以确保安全约束的同时兼顾最优性；其次利用前序迭代轨迹通过归一化流（normalizing flows）学习价值函数，从而实现比高斯先验更丰富的不确定性建模；最后该算法设计为可在图形处理单元（GPU）上高度并行执行，支持实时优化。实验表明，SIT-LMPC能够在迭代过程中持续提升系统性能并稳健满足约束条件。

链接: https://arxiv.org/abs/2602.16187
作者: Zirui Zang,Ahmad Amine,Nick-Marios T. Kokolakis,Truong X. Nghiem,Ugo Rosolia,Rahul Mangharam
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, 5 figures. Published in IEEE RA-L, vol. 11, no. 1, Jan. 2026. Presented at ICRA 2026

点击查看摘要

Abstract:Robots executing iterative tasks in complex, uncertain environments require control strategies that balance robustness, safety, and high performance. This paper introduces a safe information-theoretic learning model predictive control (SIT-LMPC) algorithm for iterative tasks. Specifically, we design an iterative control framework based on an information-theoretic model predictive control algorithm to address a constrained infinite-horizon optimal control problem for discrete-time nonlinear stochastic systems. An adaptive penalty method is developed to ensure safety while balancing optimality. Trajectories from previous iterations are utilized to learn a value function using normalizing flows, which enables richer uncertainty modeling compared to Gaussian priors. SIT-LMPC is designed for highly parallel execution on graphics processing units, allowing efficient real-time optimization. Benchmark simulations and hardware experiments demonstrate that SIT-LMPC iteratively improves system performance while robustly satisfying system constraints.

[AI-33] EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

【速读】：该论文旨在解决当前AI代理在训练环境中获得的能力难以泛化到真实世界任务的问题。其核心挑战在于，现有环境往往缺乏足够的复杂性和现实性，导致模型在特定分布内表现良好但无法适应新场景。解决方案的关键在于构建一个高保真度的企业级强化学习环境——\corecraft，该环境模拟了一个包含2500多个实体和23种工具的客户支持组织，具备任务导向的世界构建、专家制定的评分标准以实现可靠奖励计算，以及反映真实职场流程的复杂工作流。通过在此环境中训练GLM-4.6模型（采用Group Relative Policy Optimization与自适应裁剪策略），仅用单个训练轮次即显著提升任务通过率，并在多个分布外基准测试中实现跨域性能增益，验证了环境质量、多样性与现实性对生成可泛化智能体能力的重要性。

链接: https://arxiv.org/abs/2602.16179
作者: Sushant Mehta,Logan Ritchie,Suhaas Garre,Nick Heiner,Edwin Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce \corecraft, the first environment in \textscEnterpriseGym, Surge AI’s suite of agentic RL environments. \corecraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM~4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on \tau^2 -Bench Retail, and +6.8% on Toolathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.

[AI-34] Edge Learning via Federated Split Decision Transformers for Metaverse Resource Allocation

【速读】：该论文旨在解决移动边缘计算（Mobile Edge Computing, MEC）环境下虚拟现实（Virtual Reality, VR）用户对高质量体验（Quality of Experience, QoE）的需求与严苛时延约束之间的矛盾，尤其是在多接入技术异构环境中，传统联邦学习（Federated Learning, FL）因全模型参数传输和粗粒度全局聚合导致性能下降的问题。其解决方案的关键在于提出了一种离线强化学习框架——联邦分割决策变压器（Federated Split Decision Transformer, FSDT），该框架将变压器（Transformer）模型在MEC服务器与云端之间进行分割：MEC端保留代理特定组件（如嵌入层和预测层）以实现本地自适应，云端部署共享全局层以支持跨MEC服务器的协同训练；实验表明，FSDT在异构环境中可提升QoE达10%，同时将近98%的模型参数卸载至云端，显著降低MEC服务器的计算负载。

链接: https://arxiv.org/abs/2602.16174
作者: Fatih Temiz,Shavbo Salehi,Melike Erol-Kantarci
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 6 pages, 4 figures, Accepted paper at IEEE International Conference on Communications (ICC) 2026

点击查看摘要

Abstract:Mobile edge computing (MEC) based wireless metaverse services offer an untethered, immersive experience to users, where the superior quality of experience (QoE) needs to be achieved under stringent latency constraints and visual quality demands. To achieve this, MEC-based intelligent resource allocation for virtual reality users needs to be supported by coordination across MEC servers to harness distributed data. Federated learning (FL) is a promising solution, and can be combined with reinforcement learning (RL) to develop generalized policies across MEC-servers. However, conventional FL incurs transmitting the full model parameters across the MEC-servers and the cloud, and suffer performance degradation due to naive global aggregation, especially in heterogeneous multi-radio access technology environments. To address these challenges, this paper proposes Federated Split Decision Transformer (FSDT), an offline RL framework where the transformer model is partitioned between MEC servers and the cloud. Agent-specific components (e.g., MEC-based embedding and prediction layers) enable local adaptability, while shared global layers in the cloud facilitate cooperative training across MEC servers. Experimental results demonstrate that FSDT enhances QoE for up to 10% in heterogeneous environments compared to baselines, while offloadingnearly 98% of the transformer model parameters to the cloud, thereby reducing the computational burden on MEC servers.

[AI-35] HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）作为多轮交互代理在长程任务中因稀疏奖励和延迟反馈而导致的信用分配不稳定与优化效率低下的问题。现有强化学习（Reinforcement Learning, RL）方法通常将LLM建模为单一时间尺度的扁平策略，难以有效传播奖励信号至整个轨迹。为此，作者提出HiPER（Hierarchical Plan-Execute Reinforcement Learning）框架，其核心创新在于显式地将策略分解为高层规划器（high-level planner）与低层执行器（low-level executor），并通过一种称为分层优势估计（Hierarchical Advantage Estimation, HAE）的技术，在规划层和执行层分别进行无偏梯度估计并降低方差。HAE通过聚合每个子目标执行期间的回报，并协调两层更新，显著提升了长程任务中的训练稳定性和性能，尤其在需多个依赖子任务的复杂场景下表现突出。

链接: https://arxiv.org/abs/2602.16165
作者: Jiangweizhi Peng,Yuanxin Liu,Ruida Zhou,Charles Fleming,Zhaoran Wang,Alfredo Garcia,Mingyi Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key technique called hierarchical advantage estimation (HAE), which carefully assigns credit at both the planning and execution levels. By aggregating returns over the execution of each subgoal and coordinating updates across the two levels, HAE provides an unbiased gradient estimator and provably reduces variance compared to flat generalized advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4% success on ALFWorld and 83.3% on WebShop with Qwen2.5-7B-Instruct (+6.6% and +8.3% over the best prior method), with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.16165 [cs.LG] (or arXiv:2602.16165v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.16165 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-36] Federated Graph AGI for Cross-Border Insider Threat Intelligence in Government Financial Schemes

【速读】：该论文旨在解决跨司法管辖区的内部威胁检测难题，此类威胁对政府金融计划构成重大挑战，尤其在处理分布于多个司法管辖区且具有隐私敏感性的数据时。现有方法受限于无法有效跨边界共享情报、缺乏理解复杂多步骤攻击模式的推理能力，以及难以捕捉金融网络中复杂的图结构关系。解决方案的关键在于提出FedGraph-AGI框架，其核心创新包括：（1）基于联邦图神经网络（Federated Graph Neural Networks）实现数据主权保护下的联合建模；（2）采用Mixture-of-Experts（MoE）聚合机制应对不同司法管辖区的数据异构性；（3）引入生成式AI（Generative AI）驱动的通用人工智能（AGI）推理能力，通过大型动作模型（Large Action Models, LAM）对图数据执行因果推断。实验表明，该方案在50,000笔交易跨10个司法管辖区的数据集上达到92.3%准确率，显著优于基线方法，并首次实现了AGI推理与联邦图学习的融合，为隐私保护下的跨境情报共享开辟了新路径。

链接: https://arxiv.org/abs/2602.16109
作者: Srikumar Nayak,James Walmesley
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 35 Pages, 8 figures

点击查看摘要

Abstract:Cross-border insider threats pose a critical challenge to government financial schemes, particularly when dealing with distributed, privacy-sensitive data across multiple jurisdictions. Existing approaches face fundamental limitations: they cannot effectively share intelligence across borders due to privacy constraints, lack reasoning capabilities to understand complex multi-step attack patterns, and fail to capture intricate graph-structured relationships in financial networks. We introduce FedGraph-AGI, a novel federated learning framework integrating Artificial General Intelligence (AGI) reasoning with graph neural networks for privacy-preserving cross-border insider threat detection. Our approach combines: (1) federated graph neural networks preserving data sovereignty; (2) Mixture-of-Experts (MoE) aggregation for heterogeneous jurisdictions; and (3) AGI-powered reasoning via Large Action Models (LAM) performing causal inference over graph data. Through experiments on a 50,000-transaction dataset across 10 jurisdictions, FedGraph-AGI achieves 92.3% accuracy, significantly outperforming federated baselines (86.1%) and centralized approaches (84.7%). Our ablation studies reveal AGI reasoning contributes 6.8% improvement, while MoE adds 4.4%. The system maintains epsilon = 1.0 differential privacy while achieving near-optimal performance and scales efficiently to 50+ clients. This represents the first integration of AGI reasoning with federated graph learning for insider threat detection, opening new directions for privacy-preserving cross-border intelligence sharing.

[AI-37] GPSBench: Do Large Language Models Understand GPS Coordinates?

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在地理空间推理能力上的不足问题，尤其是其对GPS坐标和现实世界地理知识的处理能力尚未被充分探索。为评估这一能力，作者构建了GPSBench数据集，包含57,800个样本和17项任务，涵盖几何坐标运算（如距离与方位计算）以及融合坐标与世界知识的推理任务。解决方案的关键在于：首先，通过设计多维度、高覆盖度的基准测试体系，系统性地评估LLMs在不同层级地理知识（如国家、城市级别）和坐标噪声鲁棒性上的表现；其次，发现模型在真实地理推理上优于几何计算，并揭示了地理知识的层级退化特性；最后，提出基于GPS坐标的增强策略可提升下游任务性能，但微调会引发几何计算与世界知识之间的权衡关系。

链接: https://arxiv.org/abs/2602.16105
作者: Thinh Hung Truong,Jey Han Lau,Jianzhong Qi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability. Despite that, LLMs’ ability to reason about GPS coordinates and real-world geography remains underexplored. We introduce GPSBench, a dataset of 57,800 samples across 17 tasks for evaluating geospatial reasoning in LLMs, spanning geometric coordinate operations (e.g., distance and bearing computation) and reasoning that integrates coordinates with world knowledge. Focusing on intrinsic model capabilities rather than tool use, we evaluate 14 state-of-the-art LLMs and find that GPS reasoning remains challenging, with substantial variation across tasks: models are generally more reliable at real-world geographic reasoning than at geometric computations. Geographic knowledge degrades hierarchically, with strong country-level performance but weak city-level localization, while robustness to coordinate noise suggests genuine coordinate understanding rather than memorization. We further show that GPS-coordinate augmentation can improve in downstream geospatial tasks, and that finetuning induces trade-offs between gains in geometric computation and degradation in world knowledge. Our dataset and reproducible code are available at this https URL

[AI-38] ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios

【速读】：该论文旨在解决自动驾驶系统在复杂交通环境中难以平衡多目标（如避障、遵守交通规则和高效行驶）且这些目标常存在冲突的问题，同时现有评估基准缺乏对优先级规则与环境场景形式化建模的结合。解决方案的关键在于提出ScenicRules基准，其核心是通过形式化一组多样化的目标作为量化评估指标，并设计Hierarchical Rulebook框架以可解释且灵活的方式编码多目标及其优先级关系；此外，构建了一组用Scenic语言形式化建模的代表性驾驶场景，涵盖多样化的驾驶情境和近事故状态，从而有效暴露智能体在满足优先级目标上的失败。

链接: https://arxiv.org/abs/2602.16073
作者: Kevin Kai-Chun Chang,Ekin Beyazit,Alberto Sangiovanni-Vincentelli,Tichakorn Wongpiromsarn,Sanjit A. Seshia
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Systems and Control (eess.SY)
备注: 16 pages, 14 figures, 7 tables. Extended version of paper accepted to 2026 IEEE Intelligent Vehicles Symposium (IV 2026). ScenicRules benchmark available at this https URL

点击查看摘要

Abstract:Developing autonomous driving systems for complex traffic environments requires balancing multiple objectives, such as avoiding collisions, obeying traffic rules, and making efficient progress. In many situations, these objectives cannot be satisfied simultaneously, and explicit priority relations naturally arise. Also, driving rules require context, so it is important to formally model the environment scenarios within which such rules apply. Existing benchmarks for evaluating autonomous vehicles lack such combinations of multi-objective prioritized rules and formal environment models. In this work, we introduce ScenicRules, a benchmark for evaluating autonomous driving systems in stochastic environments under prioritized multi-objective specifications. We first formalize a diverse set of objectives to serve as quantitative evaluation metrics. Next, we design a Hierarchical Rulebook framework that encodes multiple objectives and their priority relations in an interpretable and adaptable manner. We then construct a compact yet representative collection of scenarios spanning diverse driving contexts and near-accident situations, formally modeled in the Scenic language. Experimental results show that our formalized objectives and Hierarchical Rulebooks align well with human driving judgments and that our benchmark effectively exposes agent failures with respect to the prioritized objectives. Our benchmark can be accessed at this https URL.

[AI-39] Omni-iEEG: A Large-Scale Comprehensive iEEG Dataset and Benchmark for Epilepsy Research ICLR2026

【速读】：该论文旨在解决癫痫术前定位中因人工审阅耗时、数据驱动方法受限于单中心数据异质性及缺乏标准化标注而引发的可重复性差、跨中心验证困难和临床相关性弱的问题。其关键解决方案是构建了Omni-iEEG这一大规模、预手术阶段的颅内脑电（intracranial EEG, iEEG）资源库，包含302名患者的178小时高分辨率记录，统一了多源数据格式与临床元数据，并由认证癫痫专家验证了癫痫发作起始区、切除范围及手术结果；同时提供了超过36,000条病理事件的专家标注，支持可靠生物标志物研究，并定义了基于临床先验的统一评估指标体系，从而推动机器学习模型在真实临床场景中的系统性评估与应用，实现从算法开发到临床转化的桥梁作用。

链接: https://arxiv.org/abs/2602.16072
作者: Chenda Duan,Yipeng Zhang,Sotaro Kanai,Yuanyi Ding,Atsuro Daida,Pengyue Yu,Tiancheng Zheng,Naoto Kuroda,Shaun A. Hussain,Eishi Asano,Hiroki Nariai,Vwani Roychowdhury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Epilepsy affects over 50 million people worldwide, and one-third of patients suffer drug-resistant seizures where surgery offers the best chance of seizure freedom. Accurate localization of the epileptogenic zone (EZ) relies on intracranial EEG (iEEG). Clinical workflows, however, remain constrained by labor-intensive manual review. At the same time, existing data-driven approaches are typically developed on single-center datasets that are inconsistent in format and metadata, lack standardized benchmarks, and rarely release pathological event annotations, creating barriers to reproducibility, cross-center validation, and clinical relevance. With extensive efforts to reconcile heterogeneous iEEG formats, metadata, and recordings across publicly available sources, we present \textbfOmni-iEEG , a large-scale, pre-surgical iEEG resource comprising \textbf302 patients and \textbf178 hours of high-resolution recordings. The dataset includes harmonized clinical metadata such as seizure onset zones, resections, and surgical outcomes, all validated by board-certified epileptologists. In addition, Omni-iEEG provides over 36K expert-validated annotations of pathological events, enabling robust biomarker studies. Omni-iEEG serves as a bridge between machine learning and epilepsy research. It defines clinically meaningful tasks with unified evaluation metrics grounded in clinical priors, enabling systematic evaluation of models in clinically relevant settings. Beyond benchmarking, we demonstrate the potential of end-to-end modeling on long iEEG segments and highlight the transferability of representations pretrained on non-neurophysiological domains. Together, these contributions establish Omni-iEEG as a foundation for reproducible, generalizable, and clinically translatable epilepsy research. The project page with dataset and code links is available at this http URL.

[AI-40] Improving Interactive In-Context Learning from Natural Language Feedback

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在缺乏交互式反馈机制下难以动态适应上下文的问题，即现有训练范式依赖静态语料库，忽视了人类学习中通过纠正性反馈进行动态调整的关键能力。解决方案的核心在于提出一种将交互式上下文学习（in-context learning）视为可训练技能的新框架：通过设计基于信息不对称的多轮教学互动任务，将单轮可验证任务转化为具有反馈循环的多轮交互过程，并利用这种结构化反馈训练模型提升其在上下文中的可塑性（plasticity）。关键创新点在于，该方法不仅显著提升了较小模型在复杂推理任务上的表现（接近大一个数量级模型的性能），还实现了跨领域的泛化能力，并最终通过建模教师批评信号，使模型具备无需外部教师即可自我修正的能力，从而为实现模型自进化提供统一路径。

链接: https://arxiv.org/abs/2602.16066
作者: Martin Klissarov,Jonathan Cook,Diego Antognini,Hao Sun,Jingling Li,Natasha Jaques,Claudiu Musat,Edward Grefenstette
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adapting one’s thought process based on corrective feedback is an essential ability in human learning, particularly in collaborative settings. In contrast, the current large language model training paradigm relies heavily on modeling vast, static corpora. While effective for knowledge acquisition, it overlooks the interactive feedback loops essential for models to adapt dynamically to their context. In this work, we propose a framework that treats this interactive in-context learning ability not as an emergent property, but as a distinct, trainable skill. We introduce a scalable method that transforms single-turn verifiable tasks into multi-turn didactic interactions driven by information asymmetry. We first show that current flagship models struggle to integrate corrective feedback on hard reasoning tasks. We then demonstrate that models trained with our approach dramatically improve the ability to interactively learn from language feedback. More specifically, the multi-turn performance of a smaller model nearly reaches that of a model an order of magnitude larger. We also observe robust out-of-distribution generalization: interactive training on math problems transfers to diverse domains like coding, puzzles and maze navigation. Our qualitative analysis suggests that this improvement is due to an enhanced in-context plasticity. Finally, we show that this paradigm offers a unified path to self-improvement. By training the model to predict the teacher’s critiques, effectively modeling the feedback environment, we convert this external signal into an internal capability, allowing the model to self-correct even without a teacher.

[AI-41] Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training

【速读】：该论文旨在解决生成式人工智能（Generative AI）在递归训练过程中因数据污染（data contamination）导致的模型性能退化问题，特别是当后续模型训练时使用了包含早期AI生成内容的数据混合时，可能引发的模型坍塌（model collapse）现象。其关键解决方案在于提出一个通用理论框架，在不假设真实数据分布的具体形式的前提下，允许生成模型为任意通用逼近器（universal approximator），并通过严格的数学分析证明：即使存在数据污染，递归训练仍能收敛，且收敛速率由基线模型的收敛速率与每轮迭代中真实数据占比的最小值决定。这一结果是首个无需对数据分布做假设的关于递归训练的正面理论成果，同时通过实证研究验证了理论结论的稳健性。

链接: https://arxiv.org/abs/2602.16065
作者: Kevin Wang,Hongqian Niu,Didong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (AI), such as large language models (LLMs), has become a transformative force across science, industry, and society. As these systems grow in popularity, web data becomes increasingly interwoven with this AI-generated material and it is increasingly difficult to separate them from naturally generated content. As generative models are updated regularly, later models will inevitably be trained on mixtures of human-generated data and AI-generated data from earlier versions, creating a recursive training process with data contamination. Existing theoretical work has examined only highly simplified settings, where both the real data and the generative model are discrete or Gaussian, where it has been shown that such recursive training leads to model collapse. However, real data distributions are far more complex, and modern generative models are far more flexible than Gaussian and linear mechanisms. To fill this gap, we study recursive training in a general framework with minimal assumptions on the real data distribution and allow the underlying generative model to be a general universal approximator. In this framework, we show that contaminated recursive training still converges, with a convergence rate equal to the minimum of the baseline model’s convergence rate and the fraction of real data used in each iteration. To the best of our knowledge, this is the first (positive) theoretical result on recursive training without distributional assumptions on the data. We further extend the analysis to settings where sampling bias is present in data collection and support all theoretical results with empirical studies.

[AI-42] AI-CARE: Carbon-Aware Reporting Evaluation Metric for AI Models

【速读】：该论文旨在解决当前机器学习（Machine Learning, ML）模型评估体系中忽视能源消耗与碳排放的问题。现有基准测试主要关注准确率、BLEU 或 mAP 等单一性能指标，未能反映模型在实际部署中的环境影响，尤其在移动设备、发展中国家及气候敏感型企业等能源受限场景下显得愈发不适用。解决方案的关键在于提出 AI-CARE 工具，用于量化报告 ML 模型的能耗与碳排放，并引入碳-性能权衡曲线（carbon-performance tradeoff curve），以可视化方式呈现性能与碳成本之间的帕累托前沿（Pareto frontier）。该方法通过理论分析和实证验证表明，碳意识驱动的评估可改变模型相对排名，激励开发兼具高精度与低碳足迹的架构，从而推动研究社区向多目标透明化评估转型，助力实现全球可持续发展目标。

链接: https://arxiv.org/abs/2602.16042
作者: KC Santosh,Srikanth Baride,Rodrigue Rizk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:As machine learning (ML) continues its rapid expansion, the environmental cost of model training and inference has become a critical societal concern. Existing benchmarks overwhelmingly focus on standard performance metrics such as accuracy, BLEU, or mAP, while largely ignoring energy consumption and carbon emissions. This single-objective evaluation paradigm is increasingly misaligned with the practical requirements of large-scale deployment, particularly in energy-constrained environments such as mobile devices, developing regions, and climate-aware enterprises. In this paper, we propose AI-CARE, an evaluation tool for reporting energy consumption, and carbon emissions of ML models. In addition, we introduce the carbon-performance tradeoff curve, an interpretable tool that visualizes the Pareto frontier between performance and carbon cost. We demonstrate, through theoretical analysis and empirical validation on representative ML workloads, that carbon-aware benchmarking changes the relative ranking of models and encourages architectures that are simultaneously accurate and environmentally responsible. Our proposal aims to shift the research community toward transparent, multi-objective evaluation and align ML progress with global sustainability goals. The tool and documentation are available at this https URL.

[AI-43] How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM -Based Automatic Assessment

【速读】：该论文旨在解决生成式 AI（Generative AI）在教育自动评估场景中因输出不确定性（output uncertainty）带来的可靠性问题。由于大语言模型（Large Language Models, LLMs）本质上具有概率性，其评分结果可能不稳定，进而影响教学干预的准确性与有效性。解决方案的关键在于系统性地基准测试多种不确定性量化（uncertainty quantification, UQ）方法，并通过多数据集、多LLM家族及不同生成控制策略下的综合分析，揭示LLMs在自动评分任务中的不确定性模式及其影响因素，从而为构建更可靠、可解释且具备不确定性感知能力的评分系统提供实证依据与实践指导。

链接: https://arxiv.org/abs/2602.16039
作者: Hang Li,Kaiqi Yang,Xianxuan Long,Fedor Filippov,Yucheng Chu,Yasemin Copur-Gencturk,Peng He,Cory Miller,Namsoo Shin,Joseph Krajcik,Hui Liu,Jiliang Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid rise of large language models (LLMs) is reshaping the landscape of automatic assessment in education. While these systems demonstrate substantial advantages in adaptability to diverse question types and flexibility in output formats, they also introduce new challenges related to output uncertainty, stemming from the inherently probabilistic nature of LLMs. Output uncertainty is an inescapable challenge in automatic assessment, as assessment results often play a critical role in informing subsequent pedagogical actions, such as providing feedback to students or guiding instructional decisions. Unreliable or poorly calibrated uncertainty estimates can lead to unstable downstream interventions, potentially disrupting students’ learning processes and resulting in unintended negative consequences. To systematically understand this challenge and inform future research, we benchmark a broad range of uncertainty quantification methods in the context of LLM-based automatic assessment. Although the effectiveness of these methods has been demonstrated in many tasks across other domains, their applicability and reliability in educational settings, particularly for automatic grading, remain underexplored. Through comprehensive analyses of uncertainty behaviors across multiple assessment datasets, LLM families, and generation control settings, we characterize the uncertainty patterns exhibited by LLMs in grading scenarios. Based on these findings, we evaluate the strengths and limitations of different uncertainty metrics and analyze the influence of key factors, including model families, assessment tasks, and decoding strategies, on uncertainty estimates. Our study provides actionable insights into the characteristics of uncertainty in LLM-based automatic assessment and lays the groundwork for developing more reliable and effective uncertainty-aware grading systems in the future.

[AI-44] owards Efficient Constraint Handling in Neural Solvers for Routing Problems ICLR2026

【速读】：该论文旨在解决神经路由求解器（neural routing solvers）在面对复杂约束时，尤其是硬约束（hard constraints）下，现有约束处理方法（如可行性掩码或隐式可行性感知）效率低下或不可适用的问题。解决方案的关键在于提出一种名为“构造与精炼”（Construct-and-Refine, CaR）的通用且高效的约束处理框架，其核心创新是通过显式学习驱动的可行性精炼机制，设计联合训练策略，引导构造模块生成多样化且高质量的初始解，从而支持轻量级改进过程（例如仅需10步迭代，远少于以往5000步），同时引入构造与改进共享表示（construction-improvement-shared representation），实现编码器层面的知识共享，显著提升在复杂约束场景下的可行性、解质量与计算效率。

链接: https://arxiv.org/abs/2602.16012
作者: Jieyi Bi,Zhiguang Cao,Jianan Zhou,Wen Song,Yaoxin Wu,Jie Zhang,Yining Ma,Cathy Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Neural solvers have achieved impressive progress in addressing simple routing problems, particularly excelling in computational efficiency. However, their advantages under complex constraints remain nascent, for which current constraint-handling schemes via feasibility masking or implicit feasibility awareness can be inefficient or inapplicable for hard constraints. In this paper, we present Construct-and-Refine (CaR), the first general and efficient constraint-handling framework for neural routing solvers based on explicit learning-based feasibility refinement. Unlike prior construction-search hybrids that target reducing optimality gaps through heavy improvements yet still struggle with hard constraints, CaR achieves efficient constraint handling by designing a joint training framework that guides the construction module to generate diverse and high-quality solutions well-suited for a lightweight improvement process, e.g., 10 steps versus 5k steps in prior work. Moreover, CaR presents the first use of construction-improvement-shared representation, enabling potential knowledge sharing across paradigms by unifying the encoder, especially in more complex constrained scenarios. We evaluate CaR on typical hard routing constraints to showcase its broader applicability. Results demonstrate that CaR achieves superior feasibility, solution quality, and efficiency compared to both classical and neural state-of-the-art solvers.

[AI-45] ODYN: An All-Shifted Non-Interior-Point Method for Quadratic Programming in Robotics and AI

【速读】：该论文旨在解决复杂密集和稀疏二次规划（Quadratic Programming, QP）问题的高效求解难题，尤其针对病态（ill-conditioned）和退化（degenerate）问题，传统方法常因约束线性无关性假设而失效。解决方案的关键在于提出ODYN——一种基于全移位非线性互补问题（Nonlinear Complementarity Problem, NCP）函数与增广拉格朗日乘子法（proximal method of multipliers）相结合的原始对偶非内点QP求解器，无需依赖约束的线性独立性即可鲁棒求解，并具备优异的热启动（warm-start）性能，适用于机器人学、人工智能等实时序列优化场景。

链接: https://arxiv.org/abs/2602.16005
作者: Jose Rojas,Aristotelis Papatheodorou,Sergi Martinez,Ioannis Havoutis,Carlos Mastalli
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce ODYN, a novel all-shifted primal-dual non-interior-point quadratic programming (QP) solver designed to efficiently handle challenging dense and sparse QPs. ODYN combines all-shifted nonlinear complementarity problem (NCP) functions with proximal method of multipliers to robustly address ill-conditioned and degenerate problems, without requiring linear independence of the constraints. It exhibits strong warm-start performance and is well suited to both general-purpose optimization, and robotics and AI applications, including model-based control, estimation, and kernel-based learning methods. We provide an open-source implementation and benchmark ODYN on the Maros-Mészáros test set, demonstrating state-of-the-art convergence performance in small-to-high-scale problems. The results highlight ODYN’s superior warm-starting capabilities, which are critical in sequential and real-time settings common in robotics and AI. These advantages are further demonstrated by deploying ODYN as the backend of an SQP-based predictive control framework (OdynSQP), as the implicitly differentiable optimization layer for deep learning (ODYNLayer), and the optimizer of a contact-dynamics simulation (ODYNSim).

[AI-46] ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM -Based Optimization

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在将自然语言转化为优化代码时存在的“沉默错误”（silent failures）问题，即生成的代码虽能执行并返回可行解，但其语义表述存在错误，导致正确率与可行性之间存在高达90个百分点的差距。解决方案的关键在于提出ReLoop框架，通过两个互补机制实现：一是结构化生成（structured generation），将代码生成分解为理解、形式化、合成和验证四个阶段，引入显式的变量类型推理和自验证机制以从源头防止公式错误；二是行为验证（behavioral verification），通过测试模型对求解器参数扰动的响应来检测残余错误，无需真实标签即可规避LLM代码审查中固有的自一致性问题。二者协同作用，在复杂组合问题上结构化生成占优，而在局部公式缺陷场景下行为验证贡献最大，最终显著提升正确率（从22.6%到31.1%）和执行成功率（从72.1%到100.0%）。

链接: https://arxiv.org/abs/2602.15983
作者: Junbo Jacob Lian,Yujun Sun,Huiling Chen,Chaoyu Zhang,Chung-Piaw Teo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注: Code and benchmark: \url{ this https URL }

点击查看摘要

Abstract:Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver-feasible solutions may encode semantically incorrect formulations, creating a feasibility-correctness gap of up to 90 percentage points on compositional problems. We introduce ReLoop, addressing silent failures from two complementary directions. Structured generation decomposes code production into a four-stage reasoning chain (understand, formalize, synthesize, verify) that mirrors expert modeling practice, with explicit variable-type reasoning and self-verification to prevent formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver-based parameter perturbation, without requiring ground truth – an external semantic signal that bypasses the self-consistency problem inherent in LLM-based code review. The two mechanisms are complementary: structured generation dominates on complex compositional problems, while behavioral verification becomes the largest single contributor on problems with localized formulation defects. Together with execution recovery via IIS-enhanced diagnostics, ReLoop raises correctness from 22.6% to 31.1% and execution from 72.1% to 100.0% on the strongest model, with consistent gains across five models spanning three paradigms (foundation, SFT, RL) and three benchmarks. We additionally release RetailOpt-190, 190 compositional retail optimization scenarios targeting the multi-constraint interactions where LLMs most frequently fail.

[AI-47] Hybrid Model Predictive Control with Physics-Informed Neural Network for Satellite Attitude Control

【速读】：该论文旨在解决航天器姿态控制中因物理模型不准确或难以获取而导致的性能瓶颈问题，尤其是在基于模型的控制策略（如模型预测控制，MPC）中，系统模型的质量直接限制了控制性能。传统数据驱动方法虽能从数据中学习动态特性，但往往存在稳定性差和泛化能力弱的问题。解决方案的关键在于引入物理信息神经网络（Physics-Informed Neural Networks, PINNs），通过在训练过程中嵌入先验物理知识（如牛顿-欧拉方程等动力学约束）对模型进行正则化，从而提升模型的预测可靠性与鲁棒性。实验表明，该方法相比纯数据驱动方法显著降低均方相对误差（减少68.17%），并在MPC闭环控制中实现更优跟踪性能和更强不确定性容忍能力；进一步结合线性名义模型的混合控制架构还加速了收敛速度（减少61.52%-76.42%的调节时间）。

链接: https://arxiv.org/abs/2602.15954
作者: Carlo Cena,Mauro Martini,Marcello Chiaberge
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Paper in peer-review. Copyright notice may change

点击查看摘要

Abstract:Reliable spacecraft attitude control depends on accurate prediction of attitude dynamics, particularly when model-based strategies such as Model Predictive Control (MPC) are employed, where performance is limited by the quality of the internal system model. For spacecraft with complex dynamics, obtaining accurate physics-based models can be difficult, time-consuming, or computationally heavy. Learning-based system identification presents a compelling alternative; however, models trained exclusively on data frequently exhibit fragile stability properties and limited extrapolation capability. This work explores Physics-Informed Neural Networks (PINNs) for modeling spacecraft attitude dynamics and contrasts it with a conventional data-driven approach. A comprehensive dataset is generated using high-fidelity numerical simulations, and two learning methodologies are investigated: a purely data-driven pipeline and a physics-regularized approach that incorporates prior knowledge into the optimization process. The results indicate that embedding physical constraints during training leads to substantial improvements in predictive reliability, achieving a 68.17% decrease in mean relative error relative. When deployed within an MPC architecture, the physics-informed models yield superior closed-loop tracking performance and improved robustness to uncertainty. Furthermore, a hybrid control formulation that merges the learned nonlinear dynamics with a nominal linear model enables consistent steady-state convergence and significantly faster response, reducing settling times by 61.52%-76.42% under measurement noise and reaction wheel friction.

[AI-48] From Tool Orchestration to Code Execution: A Study of MCP Design Choices

【速读】：该论文旨在解决基于模型上下文协议（Model Context Protocols, MCPs）的代理系统在规模扩展时面临的协调开销大、状态管理碎片化及不支持宽上下文操作等问题。传统逐工具调用方式难以适应大规模工具目录和多并发MCP服务器场景，限制了系统的可扩展性。解决方案的关键在于引入代码执行作为第一类能力的代码执行MCP（Code Execution MCP, CE-MCP），通过将复杂工作流（如SQL查询、文件分析和多步数据转换）封装为单一程序并在隔离运行环境中执行，显著降低令牌使用量和执行延迟。然而，CE-MCP也带来了更广阔的攻击面，为此论文进一步提出MAESTRO框架识别16类跨五个执行阶段的安全威胁，并构建包含容器沙箱与语义门控的分层防御架构，从而实现可生产级部署的可扩展与安全平衡。

链接: https://arxiv.org/abs/2602.15945
作者: Yuval Felendler,Parth A. Gandhi,Idan Habler,Yuval Elovici,Asaf Shabtai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model Context Protocols (MCPs) provide a unified platform for agent systems to discover, select, and orchestrate tools across heterogeneous execution environments. As MCP-based systems scale to incorporate larger tool catalogs and multiple concurrently connected MCP servers, traditional tool-by-tool invocation increases coordination overhead, fragments state management, and limits support for wide-context operations. To address these scalability challenges, recent MCP designs have incorporated code execution as a first-class capability, an approach called Code Execution MCP (CE-MCP). This enables agents to consolidate complex workflows, such as SQL querying, file analysis, and multi-step data transformations, into a single program that executes within an isolated runtime environment. In this work, we formalize the architectural distinction between context-coupled (traditional) and context-decoupled (CE-MCP) models, analyzing their fundamental scalability trade-offs. Using the MCP-Bench framework across 10 representative servers, we empirically evaluate task behavior, tool utilization patterns, execution latency, and protocol efficiency as the scale of connected MCP servers and available tools increases, demonstrating that while CE-MCP significantly reduces token usage and execution latency, it introduces a vastly expanded attack surface. We address this security gap by applying the MAESTRO framework, identifying sixteen attack classes across five execution phases-including specific code execution threats such as exception-mediated code injection and unsafe capability synthesis. We validate these vulnerabilities through adversarial scenarios across multiple LLMs and propose a layered defense architecture comprising containerized sandboxing and semantic gating. Our findings provide a rigorous roadmap for balancing scalability and security in production-ready executable agent workflows.

[AI-49] FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution

【速读】：该论文旨在解决当前通用视觉-语言模型在机器人部署中因处理长时序历史信息和生成高维未来预测而导致的显著延迟问题，从而限制了其在实时控制场景下的应用。解决方案的关键在于提出FUTURE-VLA架构，该架构将长期控制与未来预测统一建模为一个单一的序列生成任务，并采用双侧高效范式：一方面通过时间自适应压缩策略最大化时空信息密度，实现多视角长时历史的高效输入同时保持恒定推理延迟；另一方面在潜在空间中进行自回归建模，使可操作动力学与可回溯的视觉前瞻对齐于单次前向传播中，从而实现低延迟的实时预测能力。

链接: https://arxiv.org/abs/2602.15882
作者: Jingjing Fan,Yushan Liu,Shoujie Li,Botao Ren,Siyuan Li,Xiao-Ping Zhang,Wenbo Ding,Zhidong Deng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:General vision-language models increasingly support unified spatiotemporal reasoning over long video streams, yet deploying such capabilities on robots remains constrained by the prohibitive latency of processing long-horizon histories and generating high-dimensional future predictions. To bridge this gap, we present FUTURE-VLA, a unified architecture that reformulates long-horizon control and future forecasting as a monolithic sequence-generation task. Adopting a dual-sided efficiency paradigm, FUTURE-VLA leverages a temporally adaptive compression strategy to maximize spatiotemporal information density, enabling the ingestion of extensive multi-view histories while maintaining constant inference latency. Simultaneously, it performs latent-space autoregression to align actionable dynamics with reviewable visual look-aheads in a single forward pass. These real-time predictive capabilities further enable a prediction-guided Human-In-the-Loop mechanism via interactive execution gating, allowing operators to dynamically validate behaviors based on interpretable future previews. Extensive evaluations demonstrate that FUTURE-VLA establishes new state-of-the-art performance, attaining success rates of 99.2% on LIBERO, 75.4% on RoboTwin, and 78.0% on a real-world Piper platform, all with a 16\times extended spatiotemporal window while maintaining the inference latency of a single-frame baseline.

[AI-50] IT-OSE: Exploring Optimal Sample Size for Industrial Data Augmentation

【速读】：该论文旨在解决工业场景中数据增强（Data Augmentation）领域缺乏理论支撑的最优样本量（Optimal Sample Size, OSS）估计问题，以及现有方法无法有效评估OSS准确性与稳定性的局限。其关键解决方案是提出基于信息论的最优样本量估计方法（Information-Theoretic Optimal Sample Size Estimation, IT-OSE），通过构建区间覆盖与偏差（Interval Coverage and Deviation, ICD）评分来直观评估OSS估计的可靠性，并从理论上揭示OSS与关键因素之间的关系，从而提升估计的可解释性与稳定性。实验表明，相比经验估计和穷举搜索，IT-OSE在分类任务中平均提升准确率4.38%，回归任务中平均降低MAPE 18.80%，同时显著减少计算与数据开销（分别降低83.97%和93.46%），并增强了OSS估计的确定性。

链接: https://arxiv.org/abs/2602.15878
作者: Mingchun Sun,Rongqiang Zhao,Zhennan Huang,Songyu Ding,Jie Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In industrial scenarios, data augmentation is an effective approach to improve model performance. However, its benefits are not unidirectionally beneficial. There is no theoretical research or established estimation for the optimal sample size (OSS) in augmentation, nor is there an established metric to evaluate the accuracy of OSS or its deviation from the ground truth. To address these issues, we propose an information-theoretic optimal sample size estimation (IT-OSE) to provide reliable OSS estimation for industrial data augmentation. An interval coverage and deviation (ICD) score is proposed to evaluate the estimated OSS intuitively. The relationship between OSS and dominant factors is theoretically analyzed and formulated, thereby enhancing the interpretability. Experiments show that, compared to empirical estimation, the IT-OSE increases accuracy in classification tasks across baseline models by an average of 4.38%, and reduces MAPE in regression tasks across baseline models by an average of 18.80%. The improvements in downstream model performance are more stable. ICDdev in the ICD score is also reduced by an average of 49.30%. The determinism of OSS is enhanced. Compared to exhaustive search, the IT-OSE achieves the same OSS while reducing computational and data costs by an average of 83.97% and 93.46%. Furthermore, practicality experiments demonstrate that the IT-OSE exhibits generality across representative sensor-based industrial scenarios.

[AI-51] Genetic Generalized Additive Models

【速读】：该论文旨在解决广义加性模型（Generalized Additive Models, GAMs）在实际应用中因手动配置结构而面临的挑战，即如何在保持模型可解释性的前提下提升预测准确性。其解决方案的关键在于引入多目标遗传算法NSGA-II，通过同时优化两个目标——预测误差（均方根误差，RMSE）和复杂度惩罚项（综合衡量稀疏性、平滑性和不确定性）——实现GAM结构的自动化搜索与优化。实验表明，该方法能够发现比基准线性GAM更准确或性能相当但复杂度显著更低的模型，从而在保证高预测性能的同时增强模型的透明性和可解释性。

链接: https://arxiv.org/abs/2602.15877
作者: Kaaustaaub Shankar,Kelly Cohen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Accepted to NAFIPS 2026

点击查看摘要

Abstract:Generalized Additive Models (GAMs) balance predictive accuracy and interpretability, but manually configuring their structure is challenging. We propose using the multi-objective genetic algorithm NSGA-II to automatically optimize GAMs, jointly minimizing prediction error (RMSE) and a Complexity Penalty that captures sparsity, smoothness, and uncertainty. Experiments on the California Housing dataset show that NSGA-II discovers GAMs that outperform baseline LinearGAMs in accuracy or match performance with substantially lower complexity. The resulting models are simpler, smoother, and exhibit narrower confidence intervals, enhancing interpretability. This framework provides a general approach for automated optimization of transparent, high-performing models. The code can be found at this https URL.

[AI-52] Fly0: Decoupling Semantic Grounding from Geometric Planning for Zero-Shot Aerial Navigation

【速读】：该论文旨在解决当前视觉-语言导航（Visual-Language Navigation, VLN）方法在语义理解与控制精度之间存在的权衡问题：尽管多模态大语言模型（Multimodal Large Language Models, MLLMs）具备强大的推理能力，但将其直接作为底层控制器会导致高延迟、轨迹振荡以及泛化性能差，主要源于其几何定位能力薄弱。解决方案的关键在于提出Fly0框架，通过三阶段解耦机制实现语义推理与几何规划的分离：首先由MLLM模块将自然语言指令映射为2D像素坐标；其次利用深度数据进行目标三维空间定位；最后由几何规划器生成无碰撞的导航轨迹。该设计显著提升了系统鲁棒性与稳定性，尤其在视觉信号丢失场景下表现优异，并大幅降低计算开销。

链接: https://arxiv.org/abs/2602.15875
作者: Zhenxing Xu,Brikit Lu,Weidong Bao,Zhengqiu Zhu,Junsong Zhang,Hui Yan,Wenhao Lu,Ji Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current Visual-Language Navigation (VLN) methodologies face a trade-off between semantic understanding and control precision. While Multimodal Large Language Models (MLLMs) offer superior reasoning, deploying them as low-level controllers leads to high latency, trajectory oscillations, and poor generalization due to weak geometric grounding. To address these limitations, we propose Fly0, a framework that decouples semantic reasoning from geometric planning. The proposed method operates through a three-stage pipeline: (1) an MLLM-driven module for grounding natural language instructions into 2D pixel coordinates; (2) a geometric projection module that utilizes depth data to localize targets in 3D space; and (3) a geometric planner that generates collision-free trajectories. This mechanism enables robust navigation even when visual contact is lost. By eliminating the need for continuous inference, Fly0 reduces computational overhead and improves system stability. Extensive experiments in simulation and real-world environments demonstrate that Fly0 outperforms state-of-the-art baselines, improving the Success Rate by over 20% and reducing Navigation Error (NE) by approximately 50% in unstructured environments. Our code is available at this https URL.

[AI-53] st-Time Adaptation for Tactile-Vision-Language Models

【速读】：该论文旨在解决在真实场景中部署的触觉-视觉-语言（Tactile-Vision-Language, TVL）模型面临测试时分布偏移（test-time distribution shifts）的问题，特别是当多模态信号因异步交叉模态扰动而出现可靠性下降时，现有测试时适应（Test-Time Adaptation, TTA）方法缺乏对各模态可靠性的显式建模，导致性能脆弱。其解决方案的关键在于提出一种可靠性感知框架，通过预测不确定性与扰动响应估计每模态的可靠性，并将该共享可靠性信号用于三个核心环节：(i) 过滤不可靠测试样本，(ii) 自适应融合触觉、视觉和语言特征，(iii) 以可靠性引导的目标函数正则化测试时优化过程，从而显著提升TVL模型在严重模态退化下的鲁棒性，实验证明该方法在TAG-C基准和其他TVL场景中相较强基线最高提升49.9%准确率。

链接: https://arxiv.org/abs/2602.15873
作者: Chuyang Ye,Haoxian Jing,Qinting Jiang,Yixi Lin,Qiang Li,Xing Tang,Jingyan Jiang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tactile-vision-language (TVL) models are increasingly deployed in real-world robotic and multimodal perception tasks, where test-time distribution shifts are unavoidable. Existing test-time adaptation (TTA) methods provide filtering in unimodal settings but lack explicit treatment of modality-wise reliability under asynchronous cross-modal shifts, leaving them brittle when some modalities become unreliable. We study TTA for TVL models under such shifts and propose a reliability-aware framework that estimates per-modality reliability from prediction uncertainty and perturbation-based responses. This shared reliability signal is used to (i) filter unreliable test samples, (ii) adaptively fuse tactile, visual, and language features, and (iii) regularize test-time optimization with a reliability-guided objective. On the TAG-C benchmark and additional TVL scenarios, our approach consistently outperforms strong TTA baselines, achieving accuracy gains of up to 49.9% under severe modality corruptions, underscoring the importance of explicit modality-wise reliability modeling for robust test-time adaptation.

[AI-54] Kalman-Inspired Runtime Stability and Recovery in Hybrid Reasoning Systems

【速读】：该论文旨在解决混合推理系统（hybrid reasoning systems）在部分可观测性和持续证据不匹配条件下运行时稳定性不足的问题，尤其关注因内部推理动态逐渐偏离而导致的渐进式失效现象。其解决方案的关键在于引入受卡尔曼滤波启发的建模视角，将推理视为由内部创新信号驱动的随机推断过程，并定义“认知漂移”（cognitive drift）为可测量的运行时现象；进而提出一个基于创新统计监测的运行时稳定性框架，通过检测不稳定状态并触发具备恢复意识的控制机制，实现对系统内部行为的可检测性、有界偏差维持与有限时间内恢复的能力，而非单纯依赖任务层面的正确性指标。

链接: https://arxiv.org/abs/2602.15855
作者: Barak Or
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Hybrid reasoning systems that combine learned components with model-based inference are increasingly deployed in tool-augmented decision loops, yet their runtime behavior under partial observability and sustained evidence mismatch remains poorly understood. In practice, failures often arise as gradual divergence of internal reasoning dynamics rather than as isolated prediction errors. This work studies runtime stability in hybrid reasoning systems from a Kalman-inspired perspective. We model reasoning as a stochastic inference process driven by an internal innovation signal and introduce cognitive drift as a measurable runtime phenomenon. Stability is defined in terms of detectability, bounded divergence, and recoverability rather than task-level correctness. We propose a runtime stability framework that monitors innovation statistics, detects emerging instability, and triggers recovery-aware control mechanisms. Experiments on multi-step, tool-augmented reasoning tasks demonstrate reliable instability detection prior to task failure and show that recovery, when feasible, re-establishes bounded internal behavior within finite time. These results emphasize runtime stability as a system-level requirement for reliable reasoning under uncertainty.

[AI-55] EdgeNav-QE: QLoRA Quantization and Dynamic Early Exit for LAM-based Navigation on Edge Devices

【速读】：该论文旨在解决大规模动作模型（Large Action Models, LAMs）在边缘设备上部署时面临的内存限制与延迟要求难题，尤其是在自主导航任务中如何实现高效推理。其解决方案的关键在于提出EdgeNav-QE框架，通过结合量化低秩适配（Quantized Low-Rank Adaptation, QLoRA）与动态早退出（Dynamic Early-Exit, DEE）机制：一方面将骨干模型量化至4-bit精度以显著降低内存占用，另一方面在模型中嵌入可自适应判断的早退出分支，使简单任务能提前终止推理，复杂任务则保留完整深度计算，从而在保障导航成功率（81.8%）的同时，实现推理延迟降低82.7%、内存占用减少66.7%，优于静态早退出方法17.9%的延迟表现，展现出内容感知的自适应计算在安全关键场景中的优势。

链接: https://arxiv.org/abs/2602.15836
作者: Mengyun Liu,Shanshan Huang,Jianan Jiang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Action Models (LAMs) have shown immense potential in autonomous navigation by bridging high-level reasoning with low-level control. However, deploying these multi-billion parameter models on edge devices remains a significant challenge due to memory constraints and latency requirements. In this paper, we propose EdgeNav-QE, a novel framework that integrates Quantized Low-Rank Adaptation (QLoRA) with a dynamic early-exit (DEE) mechanism to optimize LAMs for real-time edge navigation. By quantizing the backbone to 4-bit precision and strategically placing early-exit branches, we enable the model to terminate inference early for simple navigation tasks while retaining full depth for complex decision-making. Experimental results on the Habitat-Sim environment with Matterport3D dataset using OpenVLA-7B backbone, demonstrate that EdgeNav-QE reduces inference latency by 82.7% and memory footprint by 66.7% compared to full-precision baselines, while maintaining 81.8% navigation success rate. Furthermore, it outperforms state-of-the-art static early-exit method by 17.9% in latency, demonstrating the superiority of content-aware adaptive computation for safety-critical applications.

[AI-56] Enhanced Diffusion Sampling: Efficient Rare Event Sampling and Free Energy Calculation with Diffusion Models

【速读】：该论文旨在解决分子动力学模拟中长期存在的稀有事件采样问题（rare-event sampling problem），尤其针对依赖于平衡态下罕见状态的物理量（如折叠自由能）的计算瓶颈。尽管生成式AI（Generative AI）驱动的扩散模型（如BioEmu）已能高效生成独立样本并消除过渡态采样成本，但其在估算稀有状态相关可观测量时仍面临偏差和效率不足的问题。解决方案的关键在于提出“增强扩散采样”（enhanced diffusion sampling）框架：通过定量精确的引导协议（steering protocols）生成偏置系综，并利用严格的重加权技术恢复平衡统计特性，从而实现对稀有事件区域的高效探索与无偏热力学估计。

链接: https://arxiv.org/abs/2602.16634
作者: Yu Xie,Ludwig Winkler,Lixin Sun,Sarah Lewis,Adam E. Foster,José Jiménez Luna,Tim Hempel,Michael Gastegger,Yaoyi Chen,Iryna Zaporozhets,Cecilia Clementi,Christopher M. Bishop,Frank Noé
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Chemical Physics (physics.chem-ph)
备注:

点击查看摘要

Abstract:The rare-event sampling problem has long been the central limiting factor in molecular dynamics (MD), especially in biomolecular simulation. Recently, diffusion models such as BioEmu have emerged as powerful equilibrium samplers that generate independent samples from complex molecular distributions, eliminating the cost of sampling rare transition events. However, a sampling problem remains when computing observables that rely on states which are rare in equilibrium, for example folding free energies. Here, we introduce enhanced diffusion sampling, enabling efficient exploration of rare-event regions while preserving unbiased thermodynamic estimators. The key idea is to perform quantitatively accurate steering protocols to generate biased ensembles and subsequently recover equilibrium statistics via exact reweighting. We instantiate our framework in three algorithms: UmbrellaDiff (umbrella sampling with diffusion models), \Delta G-Diff (free-energy differences via tilted ensembles), and MetaDiff (a batchwise analogue for metadynamics). Across toy systems, protein folding landscapes and folding free energies, our methods achieve fast, accurate, and scalable estimation of equilibrium properties within GPU-minutes to hours per system – closing the rare-event sampling gap that remained after the advent of diffusion-model equilibrium samplers.

[AI-57] AI-Driven Structure Refinement of X-ray Diffraction

【速读】：该论文旨在解决生成式 AI 在X射线衍射（XRD）数据处理中提出的候选相和结构假设在后续精修阶段常因峰强度无法稳定分配而失败的问题，尤其是在严重峰重叠、混合辐射或多相共存等复杂实验条件下。解决方案的关键在于提出一种物理约束的全谱分解与精修工作流 WPEM（Whole-Pattern Decomposition and Refinement with Bragg Constraint），其核心是将布拉格定律（Bragg’s law）显式嵌入批处理期望-最大化（batch expectation-maximization）框架中，通过概率混合密度模型迭代推断各组分的解析强度，并保持峰位始终满足布拉格一致性，从而获得连续且物理可接受的强度表示，显著提升在高重叠区域和多相体系中的稳定性与精度。

链接: https://arxiv.org/abs/2602.16372
作者: Bin Cao,Qian Zhang,Zhenjie Feng,Taolue Zhang,Jiaqiang Huang,Lu-Tao Weng,Tong-Yi Zhang
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence can rapidly propose candidate phases and structures from X-ray diffraction (XRD), but these hypotheses often fail in downstream refinement because peak intensities cannot be stably assigned under severe overlap and diffraction consistency is enforced only weakly. Here we introduce WPEM, a physics-constrained whole-pattern decomposition and refinement workflow that turns Bragg’s law into an explicit constraint within a batch expectation–maximization framework. WPEM models the full profile as a probabilistic mixture density and iteratively infers component-resolved intensities while keeping peak centres Bragg-consistent, producing a continuous, physically admissible intensity representation that remains stable in heavily overlapped regions and in the presence of mixed radiation or multiple phases. We benchmark WPEM on standard reference patterns (\cePbSO4 and \ceTb2BaCoO5), where it yields lower R_\mathrmp / R_\mathrmwp than widely used packages (FullProf and TOPAS) under matched refinement conditions. We further demonstrate generality across realistic experimental scenarios, including phase-resolved decomposition of a multiphase Ti–15Nb thin film, quantitative recovery of \ceNaCl–\ceLi2CO3 mixture compositions, separation of crystalline peaks from amorphous halos in semicrystalline polymers, high-throughput operando lattice tracking in layered cathodes, automated refinement of a compositionally disordered Ru–Mn oxide solid solution (CCDC 2530452), and quantitative phase-resolved deciphering of an ancient Egyptian make-up sample from synchrotron powder XRD. By providing Bragg-consistent, uncertainty-aware intensity partitioning as a refinement-ready interface, WPEM closes the gap between AI-generated hypotheses and diffraction-admissible structure refinement on challenging XRD data.

[AI-58] Color-based Emotion Representation for Speech Emotion Recognition

【速读】：该论文旨在解决传统语音情感识别（Speech Emotion Recognition, SER）方法在表达情感多样性与可解释性方面的局限性，即依赖分类标签或维度标签难以全面刻画复杂情感状态。其解决方案的关键在于引入颜色属性（hue、saturation、value）作为连续且具解释性的新表征方式，通过众包标注构建带有颜色属性的情感语音语料库，并基于机器学习与深度学习建立颜色属性回归模型；进一步探索颜色属性回归与情感分类的多任务学习机制，实验证明该方法不仅揭示了颜色属性与语音情感之间的关联，还提升了各任务的性能表现。

链接: https://arxiv.org/abs/2602.16256
作者: Ryotaro Nagase,Ryoichi Takashima,Yoichi Yamashita
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Submitted to EUSIPCO2026

点击查看摘要

Abstract:Speech emotion recognition (SER) has traditionally relied on categorical or dimensional labels. However, this technique is limited in representing both the diversity and interpretability of emotions. To overcome this limitation, we focus on color attributes, such as hue, saturation, and value, to represent emotions as continuous and interpretable scores. We annotated an emotional speech corpus with color attributes via crowdsourcing and analyzed them. Moreover, we built regression models for color attributes in SER using machine learning and deep learning, and explored the multitask learning of color attribute regression and emotion classification. As a result, we demonstrated the relationship between color attributes and emotions in speech, and successfully developed color attribute regression models for SER. We also showed that multitask learning improved the performance of each task.

[AI-59] Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNNs）在有限样本条件下的可学习性（practical learnability）问题，以及非凸优化与泛化性能之间的理论联系。其核心挑战在于如何从理论上刻画训练过程能否收敛至全局最优，并解释模型结构、批量大小等因素对优化和泛化的影响。解决方案的关键在于构建了一个基于凸共轭对偶理论（convex conjugate duality）的共轭学习理论框架，通过联合控制结构矩阵的极端特征值与梯度能量，证明了使用小批量随机梯度下降（mini-batch SGD）训练DNN可实现经验风险的全局最优解；同时，该框架进一步推导出适用于任意模型的泛化误差下界，明确量化了信息损失（由不可逆变换引起）、最大可达损失值及特征-标签间的广义条件熵对泛化行为的作用机制，从而为理解正则化、不可逆变换与网络深度等关键因素提供了统一的理论视角。

链接: https://arxiv.org/abs/2602.16177
作者: Binchuan Qi
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we propose a notion of practical learnability grounded in finite sample settings, and develop a conjugate learning theoretical framework based on convex conjugate duality to characterize this learnability property. Building on this foundation, we demonstrate that training deep neural networks (DNNs) with mini-batch stochastic gradient descent (SGD) achieves global optima of empirical risk by jointly controlling the extreme eigenvalues of a structure matrix and the gradient energy, and we establish a corresponding convergence theorem. We further elucidate the impact of batch size and model architecture (including depth, parameter count, sparsity, skip connections, and other characteristics) on non-convex optimization. Additionally, we derive a model-agnostic lower bound for the achievable empirical risk, theoretically demonstrating that data determines the fundamental limit of trainability. On the generalization front, we derive deterministic and probabilistic bounds on generalization error based on generalized conditional entropy measures. The former explicitly delineates the range of generalization error, while the latter characterizes the distribution of generalization error relative to the deterministic bounds under independent and identically distributed (i.i.d.) sampling conditions. Furthermore, these bounds explicitly quantify the influence of three key factors: (i) information loss induced by irreversibility in the model, (ii) the maximum attainable loss value, and (iii) the generalized conditional entropy of features with respect to labels. Moreover, they offer a unified theoretical lens for understanding the roles of regularization, irreversible transformations, and network depth in shaping the generalization behavior of deep neural networks. Extensive experiments validate all theoretical predictions, confirming the framework’s correctness and consistency.

[AI-60] Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing

【速读】：该论文旨在解决在线媒体平台在A/B实验中评估用户对特定内容属性暴露频率时面临的高成本与低效率问题。传统方法需为每个实验组和细分群体重复进行高质量内容标注，导致计算资源消耗大、响应延迟高，难以规模化应用。其解决方案的关键在于提出一种基于代理信号（surrogate-based）的流行度测量框架：通过离线校准一个低成本的代理指标（如模型得分分桶，score bucketing），将高成本标签与仅依赖曝光日志（impression logs）的估计过程解耦；具体而言，先在离线样本上估算各分桶的流行度，再结合各实验组中分桶的曝光分布，快速生成任意实验臂和细分群体的流行度估计，从而实现低延迟、可扩展的测量能力，且经多轮大规模A/B测试验证，代理估计结果与参考基准高度一致。

链接: https://arxiv.org/abs/2602.16111
作者: Zehao Xu,Tony Paek,Kevin O’Sullivan,Attila Dobi
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online media platforms often need to measure how frequently users are exposed to specific content attributes in order to evaluate trade-offs in A/B experiments. A direct approach is to sample content, label it using a high-quality rubric (e.g., an expert-reviewed LLM prompt), and estimate impression-weighted prevalence. However, repeatedly running such labeling for every experiment arm and segment is too costly and slow to serve as a default measurement at scale. We present a scalable \emphsurrogate-based prevalence measurement framework that decouples expensive labeling from per-experiment evaluation. The framework calibrates a surrogate signal to reference labels offline and then uses only impression logs to estimate prevalence for arbitrary experiment arms and segments. We instantiate this framework using \emphscore bucketing as the surrogate: we discretize a model score into buckets, estimate bucket-level prevalences from an offline labeled sample, and combine these calibrated bucket level prevalences with the bucket distribution of impressions in each arm to obtain fast, log-based estimates. Across multiple large-scale A/B tests, we validate that the surrogate estimates closely match the reference estimates for both arm-level prevalence and treatment–control deltas. This enables scalable, low-latency prevalence measurement in experimentation without requiring per-experiment labeling jobs. Subjects: Applications (stat.AP); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.16111 [stat.AP] (or arXiv:2602.16111v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2602.16111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-61] A fully differentiable framework for training proxy Exchange Correlation Functionals for periodic systems

【速读】：该论文旨在解决密度泛函理论（Density Functional Theory, DFT）在模拟大体系时计算成本高昂的问题。其核心解决方案是提出一个可微分框架，将机器学习模型（特别是神经网络）作为交换-关联（exchange-correlation, XC）泛函的替代品集成到DFT中，支持梯度在完整的自洽场DFT流程中传递，从而实现端到端的可微分建模。该框架基于PyTorch实现，提供清晰的API接口，并与DeepChem库集成，便于复用现有模型并降低实验门槛。初步测试表明，该方法在与GPAW和PySCF等主流电子结构软件对比时，能量误差控制在5–10%以内。

链接: https://arxiv.org/abs/2602.15923
作者: Rakshit Kumar Singh,Aryan Amit Barsainyan,Bharath Ramsundar
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Density Functional Theory (DFT) is widely used for first-principles simulations in chemistry and materials science, but its computational cost remains a key limitation for large systems. Motivated by recent advances in ML-based exchange-correlation (XC) functionals, this paper introduces a differentiable framework that integrates machine learning models into density functional theory (DFT) for solids and other periodic systems. The framework defines a clean API for neural network models that can act as drop in replacements for conventional exchange-correlation (XC) functionals and enables gradients to flow through the full self-consistent DFT workflow. The framework is implemented in Python using a PyTorch backend, making it fully differentiable and easy to use with standard deep learning tools. We integrate the implementation with the DeepChem library to promote the reuse of established models and to lower the barrier for experimentation. In initial benchmarks against established electronic structure packages (GPAW and PySCF), our models achieve relative errors on the order of 5-10%.

[AI-62] Generalized Leverag e Score for Scalable Assessment of Privacy Vulnerability

【速读】：该论文旨在解决个体数据点在机器学习模型中的隐私风险评估问题，即如何在不重新训练模型或显式模拟攻击的情况下量化其遭受成员推断攻击（Membership Inference Attack, MIA）的可能性。解决方案的关键在于识别出数据点对模型的影响与其MIA风险之间存在理论上的对应关系：在线性设定下，作者证明了个体MIA风险与杠杆得分（leverage score）之间存在一一映射，从而将隐私风险建模为一个可计算的、基于数据点影响的指标。在此基础上，进一步提出了一种适用于深度学习场景的杠杆得分推广形式，实验证明该指标与MIA成功率高度相关，可作为高效且实用的个体隐私风险代理度量。

链接: https://arxiv.org/abs/2602.15919
作者: Valentin Dorseuil(DI-ENS),Jamal Atif(CMAP),Olivier Cappé(DI-ENS)
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Can the privacy vulnerability of individual data points be assessed without retraining models or explicitly simulating attacks? We answer affirmatively by showing that exposure to membership inference attack (MIA) is fundamentally governed by a data point’s influence on the learned model. We formalize this in the linear setting by establishing a theoretical correspondence between individual MIA risk and the leverage score, identifying it as a principled metric for vulnerability. This characterization explains how data-dependent sensitivity translates into exposure, without the computational burden of training shadow models. Building on this, we propose a computationally efficient generalization of the leverage score for deep learning. Empirical evaluations confirm a strong correlation between the proposed score and MIA success, validating this metric as a practical surrogate for individual privacy risk assessment.

[AI-63] Surrogate Modeling for Neutron Transport: A Neural Operator Approach

【速读】：该论文旨在解决传统中子输运计算中计算成本高、效率低的问题，特别是在需要多次迭代求解（如k-eigenvalue问题）时难以满足实时性或大规模参数空间评估的需求。解决方案的关键在于引入基于神经算子（neural operator）的代理建模框架，具体采用Deep Operator Network (DeepONet) 和 Fourier Neural Operator (FNO) 两种架构，学习从各向异性中子源 $ Q(x,\mu) $ 到角通量 $ \psi(x,\mu) $ 的非线性映射关系。通过在不同散射比（c = 0.1, 0.5, 1.0）下训练模型以覆盖吸收主导、中等和散射主导的输运 regimes，验证了其泛化能力；进一步将模型嵌入S_N k-eigenvalue求解器中替代耗时的输运扫掠循环，实现了高达99.9%的加速比，并保持了与参考解偏差小于135 pcm的精度，显著提升了中子输运模拟的效率与实用性，为数字孪生和设计优化等场景提供了可行的高性能替代方案。

链接: https://arxiv.org/abs/2602.15890
作者: Md Hossain Sahadath,Qiyun Cheng,Shaowu Pan,Wei Ji
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work introduces a neural operator based surrogate modeling framework for neutron transport computation. Two architectures, the Deep Operator Network (DeepONet) and the Fourier Neural Operator (FNO), were trained for fixed source problems to learn the mapping from anisotropic neutron sources, Q(x,\mu), to the corresponding angular fluxes, \psi(x,\mu), in a one-dimensional slab geometry. Three distinct models were trained for each neural operator, corresponding to different scattering ratios (c = 0.1, 0.5, 1.0), providing insight into their performance across distinct transport regimes (absorption-dominated, moderate, and scattering-dominated). The models were subsequently evaluated on a wide range of previously unseen source configurations, demonstrating that FNO generally achieves higher predictive accuracy, while DeepONet offers greater computational efficiency. Both models offered significant speedups that become increasingly pronounced as the scattering ratio increases, requiring 0.3% of the runtime of a conventional S_N solver. The surrogate models were further incorporated into the S_N k-eigenvalue solver, replacing the computationally intensive transport sweep loop with a single forward pass. Across varying fission cross sections and spatial-angular grids, both neural operator solvers reproduced reference eigenvalues with deviations up to 135 pcm for DeepONet and 112 pcm for FNO, while reducing runtime to 0.1% of that of the S_N solver on relatively fine grids. These results demonstrate the strong potential of neural operator frameworks as accurate, efficient, and generalizable surrogates for neutron transport, paving the way for real-time digital twin applications and repeated evaluations, such as in design optimization.

[AI-64] NeuroSleep: Neuromorphic Event-Driven Single-Channel EEG Sleep Staging for Edge-Efficient Sensing

【速读】：该论文旨在解决可穿戴边缘平台上基于脑电图（EEG）的睡眠分期任务中，高频率密集计算在严苛能效预算下难以持续运行的问题。其解决方案的关键在于提出一个事件驱动的感知与推理系统 NeuroSleep，通过两个核心机制实现能效优化：一是采用残差自适应多尺度 delta 调制（R-AMSDM）将原始 EEG 转换为互补的多尺度双极事件流，从而在传感前端显式控制保真度与稀疏性之间的权衡；二是设计分层推理架构，包括事件自适应多尺度响应模块（EAMR）、局部时序注意力模块（LTAM）和 epoch-泄漏积分发放模块（ELIF），以高效提取局部特征、聚合上下文信息并捕捉长期状态持久性。实验表明，该方案在保持高精度的同时显著降低计算负载，为资源受限场景下的持续睡眠分析提供了可扩展的解决方案。

链接: https://arxiv.org/abs/2602.15888
作者: Boyu Li,Xingchun Zhu,Yonghui Wu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 figures, under review at Journal of Neural Engineering

点击查看摘要

Abstract:Reliable, continuous neural sensing on wearable edge platforms is fundamental to long-term health monitoring; however, for electroencephalography (EEG)-based sleep monitoring, dense high-frequency processing is often computationally prohibitive under tight energy budgets. To address this bottleneck, this paper proposes NeuroSleep, an integrated event-driven sensing and inference system for energy-efficient sleep staging. NeuroSleep first converts raw EEG into complementary multi-scale bipolar event streams using Residual Adaptive Multi-Scale Delta Modulation (R-AMSDM), enabling an explicit fidelity-sparsity trade-off at the sensing front end. Furthermore, NeuroSleep adopts a hierarchical inference architecture that comprises an Event-based Adaptive Multi-scale Response (EAMR) module for local feature extraction, a Local Temporal-Attention Module (LTAM) for context aggregation, and an Epoch-Leaky Integrate-and-Fire (ELIF) module to capture long-term state persistence. Experimental results using subject-independent 5-fold cross-validation on the Sleep-EDF Expanded dataset demonstrate that NeuroSleep achieves a mean accuracy of 74.2% with only 0.932 M parameters while reducing sparsity-adjusted effective operations by approximately 53.6% relative to dense processing. Compared with the representative dense Transformer baseline, NeuroSleep improves accuracy by 7.5% with a 45.8% reduction in computational load. By bridging neuromorphic encoding with state-aware modeling, NeuroSleep provides a scalable solution for always-on sleep analysis in resource-constrained wearable scenarios.

机器学习

[LG-0] Knowledge-Embedded Latent Projection for Robust Representation Learning

链接: https://arxiv.org/abs/2602.16709
作者: Weijing Tang,Ming Yuan,Zongqi Xia,Tianxi Cai
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Latent space models are widely used for analyzing high-dimensional discrete data matrices, such as patient-feature matrices in electronic health records (EHRs), by capturing complex dependence structures through low-dimensional embeddings. However, estimation becomes challenging in the imbalanced regime, where one matrix dimension is much larger than the other. In EHR applications, cohort sizes are often limited by disease prevalence or data availability, whereas the feature space remains extremely large due to the breadth of medical coding system. Motivated by the increasing availability of external semantic embeddings, such as pre-trained embeddings of clinical concepts in EHRs, we propose a knowledge-embedded latent projection model that leverages semantic side information to regularize representation learning. Specifically, we model column embeddings as smooth functions of semantic embeddings via a mapping in a reproducing kernel Hilbert space. We develop a computationally efficient two-step estimation procedure that combines semantically guided subspace construction via kernel principal component analysis with scalable projected gradient descent. We establish estimation error bounds that characterize the trade-off between statistical error and approximation error induced by the kernel projection. Furthermore, we provide local convergence guarantees for our non-convex optimization procedure. Extensive simulation studies and a real-world EHR application demonstrate the effectiveness of the proposed method.

[LG-1] Causality is Key for Interpretability Claims to Generalise

链接: https://arxiv.org/abs/2602.16698
作者: Shruti Joshi,Aaron Mueller,David Klindt,Wieland Brendel,Patrik Reizinger,Dhanya Sridhar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interpretability research on large language models (LLMs) has yielded important insights into model behaviour, yet recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence. Our position is that causal inference specifies what constitutes a valid mapping from model activations to invariant high-level structures, the data or assumptions needed to achieve it, and the inferences it can support. Specifically, Pearl’s causal hierarchy clarifies what an interpretability study can justify. Observations establish associations between model behaviour and internal components. Interventions (e.g., ablations or activation patching) support claims how these edits affect a behavioural metric (\eg, average change in token probabilities) over a set of prompts. However, counterfactual claims – i.e., asking what the model output would have been for the same prompt under an unobserved intervention – remain largely unverifiable without controlled supervision. We show how causal representation learning (CRL) operationalises this hierarchy, specifying which variables are recoverable from activations and under what assumptions. Together, these motivate a diagnostic framework that helps practitioners select methods and evaluations matching claims to evidence such that findings generalise.

[LG-2] Protecting the Undeleted in Machine Unlearning

链接: https://arxiv.org/abs/2602.16697
作者: Aloni Cohen,Refael Kohen,Kobbi Nissim,Uri Stemmer
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Machine unlearning aims to remove specific data points from a trained model, often striving to emulate “perfect retraining”, i.e., producing the model that would have been obtained had the deleted data never been included. We demonstrate that this approach, and security definitions that enable it, carry significant privacy risks for the remaining (undeleted) data points. We present a reconstruction attack showing that for certain tasks, which can be computed securely without deletions, a mechanism adhering to perfect retraining allows an adversary controlling merely \omega(1) data points to reconstruct almost the entire dataset merely by issuing deletion requests. We survey existing definitions for machine unlearning, showing they are either susceptible to such attacks or too restrictive to support basic functionalities like exact summation. To address this problem, we propose a new security definition that specifically safeguards undeleted data against leakage caused by the deletion of other points. We show that our definition permits several essential functionalities, such as bulletin boards, summations, and statistical learning.

[LG-3] On the Hardness of Approximation of the Fair k-Center Problem

链接: https://arxiv.org/abs/2602.16688
作者: Suhas Thejaswi
类目: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we study the hardness of approximation of the fair k -center problem. Here the data points are partitioned into groups and the task is to choose a prescribed number of data points from each group, called centers, while minimizing the maximum distance from any point to its closest center. Although a polynomial-time 3 -approximation is known for this problem in general metrics, it has remained open whether this approximation guarantee is tight or could be further improved, especially since the unconstrained k -center problem admits a polynomial-time factor- 2 approximation. We resolve this open question by proving that, for every \epsilon0 , achieving a (3-\epsilon) -approximation is NP-hard, assuming \textP \neq \textNP . Our inapproximability results hold even when only two disjoint groups are present and at least one center must be chosen from each group. Further, it extends to the canonical one-per-group setting with k -groups (for arbitrary k ), where exactly one center must be selected from each group. Consequently, the factor- 3 barrier for fair k -center in general metric spaces is inherent, and existing 3 -approximation algorithms are optimal up to lower-order terms even in these restricted regimes. This result stands in sharp contrast to the k -supplier formulation, where both the unconstrained and fair variants admit factor- 3 approximation in polynomial time. Subjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2602.16688 [cs.CC] (or arXiv:2602.16688v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2602.16688 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition

链接: https://arxiv.org/abs/2602.16684
作者: Bo Pan,Peter Zhiping Zhang,Hao-Wei Pang,Alex Zhu,Xiang Yu,Liying Zhang,Liang Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Matched molecular pairs (MMPs) capture the local chemical edits that medicinal chemists routinely use to design analogs, but existing ML approaches either operate at the whole-molecule level with limited edit controllability or learn MMP-style edits from restricted settings and small models. We propose a variable-to-variable formulation of analog generation and train a foundation model on large-scale MMP transformations (MMPTs) to generate diverse variables conditioned on an input variable. To enable practical control, we develop prompting mechanisms that let the users specify preferred transformation patterns during generation. We further introduce MMPT-RAG, a retrieval-augmented framework that uses external reference analogs as contextual guidance to steer generation and generalize from project-specific series. Experiments on general chemical corpora and patent-specific datasets demonstrate improved diversity, novelty, and controllability, and show that our method recovers realistic analog structures in practical discovery scenarios.

[LG-5] Factorization Machine with Quadratic-Optimization Annealing for RNA Inverse Folding and Evaluation of Binary-Integer Encoding and Nucleotide Assignment

链接: https://arxiv.org/abs/2602.16643
作者: Shuta Kikuchi,Shu Tanaka
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:The RNA inverse folding problem aims to identify nucleotide sequences that preferentially adopt a given target secondary structure. While various heuristic and machine learning-based approaches have been proposed, many require a large number of sequence evaluations, which limits their applicability when experimental validation is costly. We propose a method to solve the problem using a factorization machine with quadratic-optimization annealing (FMQA). FMQA is a discrete black-box optimization method reported to obtain high-quality solutions with a limited number of evaluations. Applying FMQA to the problem requires converting nucleotides into binary variables. However, the influence of integer-to-nucleotide assignments and binary-integer encoding on the performance of FMQA has not been thoroughly investigated, even though such choices determine the structure of the surrogate model and the search landscape, and thus can directly affect solution quality. Therefore, this study aims both to establish a novel FMQA framework for RNA inverse folding and to analyze the effects of these assignments and encoding methods. We evaluated all 24 possible assignments of the four nucleotides to the ordered integers (0-3), in combination with four binary-integer encoding methods. Our results demonstrated that one-hot and domain-wall encodings outperform binary and unary encodings in terms of the normalized ensemble defect value. In domain-wall encoding, nucleotides assigned to the boundary integers (0 and 3) appeared with higher frequency. In the RNA inverse folding problem, assigning guanine and cytosine to these boundary integers promoted their enrichment in stem regions, which led to more thermodynamically stable secondary structures than those obtained with one-hot encoding.

[LG-6] Optimizer choice matters for the emergence of Neural Collapse ICLR2026

链接: https://arxiv.org/abs/2602.16642
作者: Jim Zhao,Tin Sum Cheng,Wojciech Masarczyk,Aurelien Lucchi
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Neural Collapse (NC) refers to the emergence of highly symmetric geometric structures in the representations of deep neural networks during the terminal phase of training. Despite its prevalence, the theoretical understanding of NC remains limited. Existing analyses largely ignore the role of the optimizer, thereby suggesting that NC is universal across optimization methods. In this work, we challenge this assumption and demonstrate that the choice of optimizer plays a critical role in the emergence of NC. The phenomenon is typically quantified through NC metrics, which, however, are difficult to track and analyze theoretically. To overcome this limitation, we introduce a novel diagnostic metric, NC0, whose convergence to zero is a necessary condition for NC. Using NC0, we provide theoretical evidence that NC cannot emerge under decoupled weight decay in adaptive optimizers, as implemented in AdamW. Concretely, we prove that SGD, SignGD with coupled weight decay (a special case of Adam), and SignGD with decoupled weight decay (a special case of AdamW) exhibit qualitatively different NC0 dynamics. Also, we show the accelerating effect of momentum on NC (beyond convergence of train loss) when trained with SGD, being the first result concerning momentum in the context of NC. Finally, we conduct extensive empirical experiments consisting of 3,900 training runs across various datasets, architectures, optimizers, and hyperparameters, confirming our theoretical results. This work provides the first theoretical explanation for optimizer-dependent emergence of NC and highlights the overlooked role of weight-decay coupling in shaping the implicit biases of optimizers.

[LG-7] Predicting The Cop Number Using Machine Learning

链接: https://arxiv.org/abs/2602.16600
作者: Meagan Mann,Christian Muise,Erin Meger
类目: Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Cops and Robbers is a pursuit evasion game played on a graph, first introduced independently by Quilliot \citequilliot1978jeux and Nowakowski and Winkler \citeNOWAKOWSKI1983235 over four decades ago. A main interest in recent the literature is identifying the cop number of graph families. The cop number of a graph, c(G) , is defined as the minimum number of cops required to guarantee capture of the robber. Determining the cop number is computationally difficult and exact algorithms for this are typically restricted to small graph families. This paper investigates whether classical machine learning methods and graph neural networks can accurately predict a graph’s cop number from its structural properties and identify which properties most strongly influence this prediction. Of the classical machine learning models, tree-based models achieve high accuracy in prediction despite class imbalance, whereas graph neural networks achieve comparable results without explicit feature engineering. The interpretability analysis shows that the most predictive features are related to node connectivity, clustering, clique structure, and width parameters, which aligns with known theoretical results. Our findings suggest that machine learning approaches can be used in complement with existing cop number algorithms by offering scalable approximations where computation is infeasible.

[LG-8] Sequential Membership Inference Attacks

链接: https://arxiv.org/abs/2602.16596
作者: Thomas Michel,Debabrota Basu,Emilie Kaufmann
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 27 pages, 10 figures

点击查看摘要

Abstract:Modern AI models are not static. They go through multiple updates in their lifecycles. Thus, exploiting the model dynamics to create stronger Membership Inference (MI) attacks and tighter privacy audits are timely questions. Though the literature empirically shows that using a sequence of model updates can increase the power of MI attacks, rigorous analysis of the optimal' MI attacks is limited to static models with infinite samples. Hence, we develop an optimal’ MI attack, SeMI*, that uses the sequence of model updates to identify the presence of a target inserted at a certain update step. For the empirical mean computation, we derive the optimal power of SeMI*, while accessing a finite number of samples with or without privacy. Our results retrieve the existing asymptotic analysis. We observe that having access to the model sequence avoids the dilution of MI signals unlike the existing attacks on the final model, where the MI signal vanishes as training data accumulates. Furthermore, an adversary can use SeMI* to tune both the insertion time and the canary to yield tighter privacy audits. Finally, we conduct experiments across data distributions and models trained or fine-tuned with DP-SGD demonstrating that practical variants of SeMI* lead to tighter privacy audits than the baselines.

[LG-9] MoDE-Boost: Boosting Shared Mobility Demand with Edge-Ready Prediction Models

链接: https://arxiv.org/abs/2602.16573
作者: Antonios Tziorvas,George S. Theodoropoulos,Yannis Theodoridis
类目: Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Urban demand forecasting plays a critical role in optimizing routing, dispatching, and congestion management within Intelligent Transportation Systems. By leveraging data fusion and analytics techniques, traffic demand forecasting serves as a key intermediate measure for identifying emerging spatial and temporal demand patterns. In this paper, we tackle this challenge by proposing two gradient boosting model variations, one for classiffication and one for regression, both capable of generating demand forecasts at various temporal horizons, from 5 minutes up to one hour. Our overall approach effectively integrates temporal and contextual features, enabling accurate predictions that are essential for improving the efficiency of shared (micro-) mobility services. To evaluate its effectiveness, we utilize open shared mobility data derived from e-scooter and e-bike networks in five metropolitan areas. These real-world datasets allow us to compare our approach with state-of-the-art methods as well as a Generative AI-based model, demonstrating its effectiveness in capturing the complexities of modern urban mobility. Ultimately, our methodology offers novel insights on urban micro-mobility management, helping to tackle the challenges arising from rapid urbanization and thus, contributing to more sustainable, efficient, and livable cities.

[LG-10] Steering diffusion models with quadratic rewards: a fine-grained analysis

链接: https://arxiv.org/abs/2602.16570
作者: Ankur Moitra,Andrej Risteski,Dhruv Rohatgi
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Inference-time algorithms are an emerging paradigm in which pre-trained models are used as subroutines to solve downstream tasks. Such algorithms have been proposed for tasks ranging from inverse problems and guided image generation to reasoning. However, the methods currently deployed in practice are heuristics with a variety of failure modes – and we have very little understanding of when these heuristics can be efficiently improved. In this paper, we consider the task of sampling from a reward-tilted diffusion model – that is, sampling from p^\star(x) \propto p(x) \exp(r(x)) – given a reward function r and pre-trained diffusion oracle for p . We provide a fine-grained analysis of the computational tractability of this task for quadratic rewards r(x) = x^\top A x + b^\top x . We show that linear-reward tilts are always efficiently sampleable – a simple result that seems to have gone unnoticed in the literature. We use this as a building block, along with a conceptually new ingredient – the Hubbard-Stratonovich transform – to provide an efficient algorithm for sampling from low-rank positive-definite quadratic tilts, i.e. r(x) = x^\top A x where A is positive-definite and of rank O(1) . For negative-definite tilts, i.e. r(x) = - x^\top A x where A is positive-definite, we prove that the problem is intractable even if A is of rank 1 (albeit with exponentially-large entries). Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2602.16570 [cs.LG] (or arXiv:2602.16570v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.16570 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] A Scalable Approach to Solving Simulation-Based Network Security Games

链接: https://arxiv.org/abs/2602.16564
作者: Michael Lanier,Yevgeniy Vorobeychik
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We introduce MetaDOAR, a lightweight meta-controller that augments the Double Oracle / PSRO paradigm with a learned, partition-aware filtering layer and Q-value caching to enable scalable multi-agent reinforcement learning on very large cyber-network environments. MetaDOAR learns a compact state projection from per node structural embeddings to rapidly score and select a small subset of devices (a top-k partition) on which a conventional low-level actor performs focused beam search utilizing a critic agent. Selected candidate actions are evaluated with batched critic forwards and stored in an LRU cache keyed by a quantized state projection and local action identifiers, dramatically reducing redundant critic computation while preserving decision quality via conservative k-hop cache invalidation. Empirically, MetaDOAR attains higher player payoffs than SOTA baselines on large network topologies, without significant scaling issues in terms of memory usage or training time. This contribution provide a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems.

[LG-12] Illustration of Barren Plateaus in Quantum Computing

链接: https://arxiv.org/abs/2602.16558
作者: Gerhard Stenzel,Tobias Rohe,Michael Kölle,Leo Sünkel,Jonas Stein,Claudia Linnhoff-Popien
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Extended version of a short paper to be published at ICAART-QAIO 2026

点击查看摘要

Abstract:Variational Quantum Circuits (VQCs) have emerged as a promising paradigm for quantum machine learning in the NISQ era. While parameter sharing in VQCs can reduce the parameter space dimensionality and potentially mitigate the barren plateau phenomenon, it introduces a complex trade-off that has been largely overlooked. This paper investigates how parameter sharing, despite creating better global optima with fewer parameters, fundamentally alters the optimization landscape through deceptive gradients – regions where gradient information exists but systematically misleads optimizers away from global optima. Through systematic experimental analysis, we demonstrate that increasing degrees of parameter sharing generate more complex solution landscapes with heightened gradient magnitudes and measurably higher deceptiveness ratios. Our findings reveal that traditional gradient-based optimizers (Adam, SGD) show progressively degraded convergence as parameter sharing increases, with performance heavily dependent on hyperparameter selection. We introduce a novel gradient deceptiveness detection algorithm and a quantitative framework for measuring optimization difficulty in quantum circuits, establishing that while parameter sharing can improve circuit expressivity by orders of magnitude, this comes at the cost of significantly increased landscape deceptiveness. These insights provide important considerations for quantum circuit design in practical applications, highlighting the fundamental mismatch between classical optimization strategies and quantum parameter landscapes shaped by parameter sharing.

[LG-13] RIDER: 3D RNA Inverse Design with Reinforcement Learning-Guided Diffusion ICLR2026

链接: https://arxiv.org/abs/2602.16548
作者: Tianmeng Hu,Yongzheng Cui,Biao Luo,Ke Li
类目: Machine Learning (cs.LG)
*备注: Accepted as a conference paper at ICLR 2026

点击查看摘要

Abstract:The inverse design of RNA three-dimensional (3D) structures is crucial for engineering functional RNAs in synthetic biology and therapeutics. While recent deep learning approaches have advanced this field, they are typically optimized and evaluated using native sequence recovery, which is a limited surrogate for structural fidelity, since different sequences can fold into similar 3D structures and high recovery does not necessarily indicate correct folding. To address this limitation, we propose RIDER, an RNA Inverse DEsign framework with Reinforcement learning that directly optimizes for 3D structural similarity. First, we develop and pre-train a GNN-based generative diffusion model conditioned on the target 3D structure, achieving a 9% improvement in native sequence recovery over state-of-the-art methods. Then, we fine-tune the model with an improved policy gradient algorithm using four task-specific reward functions based on 3D self-consistency metrics. Experimental results show that RIDER improves structural similarity by over 100% across all metrics and discovers designs that are distinct from native sequences.

[LG-14] Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

链接: https://arxiv.org/abs/2602.16543
作者: Jialiang Fan,Shixiong Jiang,Mengyu Liu,Fanxin Kong
类目: Machine Learning (cs.LG)
*备注: 12 pages, 6 figures, supplementary material included

点击查看摘要

Abstract:Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy’s gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy’s internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.

[LG-15] ransfer Learning of Linear Regression with Multiple Pretrained Models: Benefiting from More Pretrained Models via Overparameterization Debiasing

链接: https://arxiv.org/abs/2602.16531
作者: Daniel Boharon,Yehuda Dar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study transfer learning for a linear regression task using several least-squares pretrained models that can be overparameterized. We formulate the target learning task as optimization that minimizes squared errors on the target dataset with penalty on the distance of the learned model from the pretrained models. We analytically formulate the test error of the learned target model and provide the corresponding empirical evaluations. Our results elucidate when using more pretrained models can improve transfer learning. Specifically, if the pretrained models are overparameterized, using sufficiently many of them is important for beneficial transfer learning. However, the learning may be compromised by overparameterization bias of pretrained models, i.e., the minimum \ell_2 -norm solution’s restriction to a small subspace spanned by the training examples in the high-dimensional parameter space. We propose a simple debiasing via multiplicative correction factor that can reduce the overparameterization bias and leverage more pretrained models to learn a target predictor. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.16531 [cs.LG] (or arXiv:2602.16531v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.16531 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] FEKAN: Feature-Enriched Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2602.16530
作者: Sidharth S. Menon,Ameya D. Jagtap
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: 45 pages, 45 figures

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have recently emerged as a compelling alternative to multilayer perceptrons, offering enhanced interpretability via functional decomposition. However, existing KAN architectures, including spline-, wavelet-, radial-basis variants, etc., suffer from high computational cost and slow convergence, limiting scalability and practical applicability. Here, we introduce Feature-Enriched Kolmogorov-Arnold Networks (FEKAN), a simple yet effective extension that preserves all the advantages of KAN while improving computational efficiency and predictive accuracy through feature enrichment, without increasing the number of trainable parameters. By incorporating these additional features, FEKAN accelerates convergence, increases representation capacity, and substantially mitigates the computational overhead characteristic of state-of-the-art KAN architectures. We investigate FEKAN across a comprehensive set of benchmarks, including function-approximation tasks, physics-informed formulations for diverse partial differential equations (PDEs), and neural operator settings that map between input and output function spaces. For function approximation, we systematically compare FEKAN against a broad family of KAN variants, FastKAN, WavKAN, ReLUKAN, HRKAN, ChebyshevKAN, RBFKAN, and the original SplineKAN. Across all tasks, FEKAN demonstrates substantially faster convergence and consistently higher approximation accuracy than the underlying baseline architectures. We also establish the theoretical foundations for FEKAN, showing its superior representation capacity compared to KAN, which contributes to improved accuracy and efficiency.

[LG-17] Capacity-constrained demand response in smart grids using deep reinforcement learning

链接: https://arxiv.org/abs/2602.16525
作者: Shafagh Abband Pashaki,Sepehr Maleki,Amir Badiee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a capacity-constrained incentive-based demand response approach for residential smart grids. It aims to maintain electricity grid capacity limits and prevent congestion by financially incentivising end users to reduce or shift their energy consumption. The proposed framework adopts a hierarchical architecture in which a service provider adjusts hourly incentive rates based on wholesale electricity prices and aggregated residential load. The financial interests of both the service provider and end users are explicitly considered. A deep reinforcement learning approach is employed to learn optimal real-time incentive rates under explicit capacity constraints. Heterogeneous user preferences are modelled through appliance-level home energy management systems and dissatisfaction costs. Using real-world residential electricity consumption and price data from three households, simulation results show that the proposed approach effectively reduces peak demand and smooths the aggregated load profile. This leads to an approximately 22.82% reduction in the peak-to-average ratio compared to the no-demand-response case.

[LG-18] Reinforcement Learning for Parameterized Quantum State Preparation: A Comparative Study

链接: https://arxiv.org/abs/2602.16523
作者: Gerhard Stenzel,Isabella Debelic,Michael Kölle,Tobias Rohe,Leo Sünkel,Julian Hager,Claudia Linnhoff-Popien
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Extended version of a short paper to be published at ICAART 2026

点击查看摘要

Abstract:We extend directed quantum circuit synthesis (DQCS) with reinforcement learning from purely discrete gate selection to parameterized quantum state preparation with continuous single-qubit rotations (R_x), (R_y), and (R_z). We compare two training regimes: a one-stage agent that jointly selects the gate type, the affected qubit(s), and the rotation angle; and a two-stage variant that first proposes a discrete circuit and subsequently optimizes the rotation angles with Adam using parameter-shift gradients. Using Gymnasium and PennyLane, we evaluate Proximal Policy Optimization (PPO) and Advantage Actor–Critic (A2C) on systems comprising two to ten qubits and on targets of increasing complexity with (\lambda) ranging from one to five. Whereas A2C does not learn effective policies in this setting, PPO succeeds under stable hyperparameters (one-stage: learning rate approximately (5\times10^-4) with a self-fidelity-error threshold of 0.01; two-stage: learning rate approximately (10^-4)). Both approaches reliably reconstruct computational basis states (between 83% and 99% success) and Bell states (between 61% and 77% success). However, scalability saturates for (\lambda) of approximately three to four and does not extend to ten-qubit targets even at (\lambda=2). The two-stage method offers only marginal accuracy gains while requiring around three times the runtime. For practicality under a fixed compute budget, we therefore recommend the one-stage PPO policy, provide explicit synthesized circuits, and contrast with a classical variational baseline to outline avenues for improved scalability.

[LG-19] Small molecule retrieval from tandem mass spectrometry: what are we optimizing for?

链接: https://arxiv.org/abs/2602.16507
作者: Gaetan De Waele,Marek Wydmuch,Krzysztof Dembczyński,Wojciech Kotłowski,Willem Waegeman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the central challenges in the computational analysis of liquid chromatography-tandem mass spectrometry (LC-MS/MS) data is to identify the compounds underlying the output spectra. In recent years, this problem is increasingly tackled using deep learning methods. A common strategy involves predicting a molecular fingerprint vector from an input mass spectrum, which is then used to search for matches in a chemical compound database. While various loss functions are employed in training these predictive models, their impact on model performance remains poorly understood. In this study, we investigate commonly used loss functions, deriving novel regret bounds that characterize when Bayes-optimal decisions for these objectives must diverge. Our results reveal a fundamental trade-off between the two objectives of (1) fingerprint similarity and (2) molecular retrieval. Optimizing for more accurate fingerprint predictions typically worsens retrieval results, and vice versa. Our theoretical analysis shows this trade-off depends on the similarity structure of candidate sets, providing guidance for loss function and fingerprint selection.

[LG-20] Synthesis and Verification of Transformer Programs

链接: https://arxiv.org/abs/2602.16473
作者: Hongjian Jiang,Matthew Hague,Philipp Rümmer,Anthony Widjaja Lin
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:C-RASP is a simple programming language that was recently shown to capture concepts expressible by transformers. In this paper, we develop new algorithmic techniques for automatically verifying C-RASPs. To this end, we establish a connection to the verification of synchronous dataflow programs in Lustre, which enables us to exploit state-of-the-art model checkers utilizing highly optimized SMT-solvers. Our second contribution addresses learning a C-RASP program in the first place. To this end, we provide a new algorithm for learning a C-RASP from examples using local search. We demonstrate efficacy of our implementation for benchmarks of C-RASPs in the literature, in particular in connection to the following applications: (1) transformer program optimization, and (2) constrained learning of transformer programs (based on a partial specification).

[LG-21] HPMixer: Hierarchical Patching for Multivariate Time Series Forecasting PAKDD2026

链接: https://arxiv.org/abs/2602.16468
作者: Jung Min Choi,Vijaya Krishna Yalavarthi,Lars Schmidt-Thieme
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures, 5 tables, PAKDD 2026

点击查看摘要

Abstract:In long-term multivariate time series forecasting, effectively capturing both periodic patterns and residual dynamics is essential. To address this within standard deep learning benchmark settings, we propose the Hierarchical Patching Mixer (HPMixer), which models periodicity and residuals in a decoupled yet complementary manner. The periodic component utilizes a learnable cycle module [7] enhanced with a nonlinear channel-wise MLP for greater expressiveness. The residual component is processed through a Learnable Stationary Wavelet Transform (LSWT) to extract stable, shift-invariant frequency-domain representations. Subsequently, a channel-mixing encoder models explicit inter-channel dependencies, while a two-level non-overlapping hierarchical patching mechanism captures coarse- and fine-scale residual variations. By integrating decoupled periodicity modeling with structured, multi-scale residual learning, HPMixer provides an effective framework. Extensive experiments on standard multivariate benchmarks demonstrate that HPMixer achieves competitive or state-of-the-art performance compared to recent baselines.

[LG-22] Beyond SGD Without SVD: Proximal Subspace Iteration LoRA with Diagonal Fractional K-FAC

链接: https://arxiv.org/abs/2602.16456
作者: Abdulla Jasem Almansoori,Maria Ivanova,Andrey Veprikov,Aleksandr Beznosikov,Samuel Horváth,Martin Takáč
类目: Machine Learning (cs.LG)
*备注: 20 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights, dramatically reducing trainable parameters and memory. In this work, we address the gap between training with full steps with low-rank projections (SVDLoRA) and LoRA fine-tuning. We propose LoRSum, a memory-efficient subroutine that closes this gap for gradient descent by casting LoRA optimization as a proximal sub-problem and solving it efficiently with alternating least squares updates, which we prove to be an implicit block power method. We recover several recently proposed preconditioning methods for LoRA as special cases, and show that LoRSum can also be used for updating a low-rank momentum. In order to address full steps with preconditioned gradient descent, we propose a scaled variant of LoRSum that uses structured metrics such as K-FAC and Shampoo, and we show that storing the diagonal of these metrics still allows them to perform well while remaining memory-efficient. Experiments on a synthetic task, CIFAR-100, and language-model fine-tuning on GLUE, SQuAD v2, and WikiText-103, show that our method can match or improve LoRA baselines given modest compute overhead, while avoiding full-matrix SVD projections and retaining LoRA-style parameter efficiency.

[LG-23] Learning with Locally Private Examples by Inverse Weierstrass Private Stochastic Gradient Descent

链接: https://arxiv.org/abs/2602.16436
作者: Jean Dufraiche,Paul Mangold,Michaël Perrot,Marc Tommasi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 30 pages, 8 figures

点击查看摘要

Abstract:Releasing data once and for all under noninteractive Local Differential Privacy (LDP) enables complete data reusability, but the resulting noise may create bias in subsequent analyses. In this work, we leverage the Weierstrass transform to characterize this bias in binary classification. We prove that inverting this transform leads to a bias-correction method to compute unbiased estimates of nonlinear functions on examples released under LDP. We then build a novel stochastic gradient descent algorithm called Inverse Weierstrass Private SGD (IWP-SGD). It converges to the true population risk minimizer at a rate of \mathcalO(1/n) , with n the number of examples. We empirically validate IWP-SGD on binary classification tasks using synthetic and real-world datasets.

[LG-24] Easy Data Unlearning Bench ICML2025

链接: https://arxiv.org/abs/2602.16400
作者: Roy Rinberg,Pol Puigdemont,Martin Pawelczyk,Volkan Cevher
类目: Machine Learning (cs.LG)
*备注: ICML 2025 Workshop on Machine Unlearning for Generative AI

点击查看摘要

Abstract:Evaluating machine unlearning methods remains technically challenging, with recent benchmarks requiring complex setups and significant engineering overhead. We introduce a unified and extensible benchmarking suite that simplifies the evaluation of unlearning algorithms using the KLoM (KL divergence of Margins) metric. Our framework provides precomputed model ensembles, oracle outputs, and streamlined infrastructure for running evaluations out of the box. By standardizing setup and metrics, it enables reproducible, scalable, and fair comparison across unlearning methods. We aim for this benchmark to serve as a practical foundation for accelerating research and promoting best practices in machine unlearning. Our code and data are publicly available.

[LG-25] Improved Bounds for Reward-Agnostic and Reward-Free Exploration

链接: https://arxiv.org/abs/2602.16363
作者: Oran Ridel,Alon Cohen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable \epsilon -optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets \epsilon -optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter \epsilon . We propose a new algorithm that significantly relaxes the requirement on \epsilon . Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an \epsilon -optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.

[LG-26] Optical Inversion and Spectral Unmixing of Spectroscopic Photoacoustic Images with Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2602.16357
作者: Sarkis Ter Martirosyan,Xinyue Huang,David Qin,Anthony Yu,Stanislav Emelianov
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Accurate estimation of the relative concentrations of chromophores in a spectroscopic photoacoustic (sPA) image can reveal immense structural, functional, and molecular information about physiological processes. However, due to nonlinearities and ill-posedness inherent to sPA imaging, concentration estimation is intractable. The Spectroscopic Photoacoustic Optical Inversion Autoencoder (SPOI-AE) aims to address the sPA optical inversion and spectral unmixing problems without assuming linearity. Herein, SPOI-AE was trained and tested on \textitin vivo mouse lymph node sPA images with unknown ground truth chromophore concentrations. SPOI-AE better reconstructs input sPA pixels than conventional algorithms while providing biologically coherent estimates for optical parameters, chromophore concentrations, and the percent oxygen saturation of tissue. SPOI-AE’s unmixing accuracy was validated using a simulated mouse lymph node phantom ground truth.

[LG-27] How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection ICASSP2026

链接: https://arxiv.org/abs/2602.16343
作者: Yixuan Xiao,Florian Lux,Alejandro Pérez-González-de-Martos,Ngoc Thang Vu
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Since Text-to-Speech systems typically don’t produce waveforms directly, recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker. Unlike vocoders, which are specifically designed for speech synthesis, neural audio codecs were originally developed for compressing audio for storage and transmission. However, their ability to discretize speech also sparked interest in language-modeling-based speech synthesis. Owing to this dual functionality, codec resynthesized data may be labeled as either bonafide or spoof. So far, very little research has addressed this issue. In this study, we present a challenging extension of the ASVspoof 5 dataset constructed for this purpose. We examine how different labeling choices affect detection performance and provide insights into labeling strategies.

[LG-28] Explainability for Fault Detection System in Chemical Processes

链接: https://arxiv.org/abs/2602.16341
作者: Georgios Gravanis,Dimitrios Kyriakou,Spyros Voutetakis,Simira Papadopoulou,Konstantinos Diamantaras
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we apply and compare two state-of-the-art eXplainability Artificial Intelligence (XAI) methods, the Integrated Gradients (IG) and the SHapley Additive exPlanations (SHAP), that explain the fault diagnosis decisions of a highly accurate Long Short-Time Memory (LSTM) classifier. The classifier is trained to detect faults in a benchmark non-linear chemical process, the Tennessee Eastman Process (TEP). It is highlighted how XAI methods can help identify the subsystem of the process where the fault occurred. Using our knowledge of the process, we note that in most cases the same features are indicated as the most important for the decision, while insome cases the SHAP method seems to be more informative and closer to the root cause of the fault. Finally, since the used XAI methods are model-agnostic, the proposed approach is not limited to the specific process and can also be used in similar problems.

[LG-29] he Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks ICML2026

链接: https://arxiv.org/abs/2602.16340
作者: Eitan Gronich,Gal Vardi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 1 figure (with appendix: 48 pages, 2 figures), under review for ICML 2026

点击查看摘要

Abstract:We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ( \ell_2 norm), and Signum ( \ell_\infty norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the \ell_\infty margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.

[LG-30] BAT: Better Audio Transformer Guided by Convex Gated Probing

链接: https://arxiv.org/abs/2602.16305
作者: Houtan Ghaffari,Lukas Rauch,Christoph Scholz,Paul Devos
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pre-training recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.

[LG-31] Fast KV Compaction via Attention Matching

链接: https://arxiv.org/abs/2602.16284
作者: Adam Zweiger,Xinghong Fu,Han Guo,Yoon Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.

[LG-32] Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

链接: https://arxiv.org/abs/2602.16274
作者: Rahul Singh,Siddharth Chandak,Eric Moulines,Vivek S. Borkar,Nicholas Bambos
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes, without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed \epsilon_n -Greedy exploration scheme that combines \epsilon_n -greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near- \tildeO(N^9/10) . To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our bound is governed by the mixing time and is allowed to converge to one asymptotically.

[LG-33] Prediction of Major Solar Flares Using Interpretable Class-dependent Reward Framework with Active Region Magnetograms and Domain Knowledge

链接: https://arxiv.org/abs/2602.16264
作者: Zixian Wu,Xuebao Li,Yanfang Zheng,Rui Wang,Shunhuang Zhang,Jinfang Wei,Yongshang Lv,Liang Dong,Zamri Zainal Abidin,Noraisyah Mohamed Shah,Hongwei Ye,Pengchao Yan,Xuefeng Li,Xiaojia Ji,Xusheng Huang,Xiaotian Wang,Honglei Jin
类目: Machine Learning (cs.LG); Solar and Stellar Astrophysics (astro-ph.SR)
*备注: 24 pages,12 figures

点击查看摘要

Abstract:In this work, we develop, for the first time, a supervised classification framework with class-dependent rewards (CDR) to predict \geq MM flares within 24 hr. We construct multiple datasets, covering knowledge-informed features and line-of sight (LOS) magnetograms. We also apply three deep learning models (CNN, CNN-BiLSTM, and Transformer) and three CDR counterparts (CDR-CNN, CDR-CNN-BiLSTM, and CDR-Transformer). First, we analyze the importance of LOS magnetic field parameters with the Transformer, then compare its performance using LOS-only, vector-only, and combined magnetic field parameters. Second, we compare flare prediction performance based on CDR models versus deep learning counterparts. Third, we perform sensitivity analysis on reward engineering for CDR models. Fourth, we use the SHAP method for model interpretability. Finally, we conduct performance comparison between our models and NASA/CCMC. The main findings are: (1)Among LOS feature combinations, R_VALUE and AREA_ACR consistently yield the best results. (2)Transformer achieves better performance with combined LOS and vector magnetic field data than with either alone. (3)Models using knowledge-informed features outperform those using magnetograms. (4)While CNN and CNN-BiLSTM outperform their CDR counterparts on magnetograms, CDR-Transformer is slightly superior to its deep learning counterpart when using knowledge-informed features. Among all models, CDR-Transformer achieves the best performance. (5)The predictive performance of the CDR models is not overly sensitive to the reward choices.(6)Through SHAP analysis, the CDR model tends to regard TOTUSJH as more important, while the Transformer tends to prioritize R_VALUE more.(7)Under identical prediction time and active region (AR) number, the CDR-Transformer shows superior predictive capabilities compared to NASA/CCMC.

[LG-34] Online Prediction of Stochastic Sequences with High Probability Regret Bounds ICLR2026

链接: https://arxiv.org/abs/2602.16236
作者: Matthias Frey,Jonathan H. Manton,Jingge Zhu
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Accepted for publication at The Fourteenth International Conference on Learning Representations (ICLR 2026)

点击查看摘要

Abstract:We revisit the classical problem of universal prediction of stochastic sequences with a finite time horizon T known to the learner. The question we investigate is whether it is possible to derive vanishing regret bounds that hold with high probability, complementing existing bounds from the literature that hold in expectation. We propose such high-probability bounds which have a very similar form as the prior expectation bounds. For the case of universal prediction of a stochastic process over a countable alphabet, our bound states a convergence rate of \mathcalO(T^-1/2 \delta^-1/2) with probability as least 1-\delta compared to prior known in-expectation bounds of the order \mathcalO(T^-1/2) . We also propose an impossibility result which proves that it is not possible to improve the exponent of \delta in a bound of the same form without making additional assumptions.

[LG-35] DistributedEstimator: Distributed Training of Quantum Neural Networks via Circuit Cutting

链接: https://arxiv.org/abs/2602.16233
作者: Prabhjot Singh,Adel N. Toosi,Rajkumar Buyya
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Circuit cutting decomposes a large quantum circuit into a collection of smaller subcircuits. The outputs of these subcircuits are then classically reconstructed to recover the original expectation values. While prior work characterises cutting overhead largely in terms of subcircuit counts and sampling complexity, its end-to-end impact on iterative, estimator-driven training pipelines remains insufficiently measured from a systems perspective. In this paper, we propose a cut-aware estimator execution pipeline that treats circuit cutting as a staged distributed workload and instruments each estimator query into partitioning, subexperiment generation, parallel execution, and classical reconstruction phases. Using logged runtime traces and learning outcomes on two binary classification workloads (Iris and MNIST), we quantify cutting overheads, scaling limits, and sensitivity to injected stragglers, and we evaluate whether accuracy and robustness are preserved under matched training budgets. Our measurements show that cutting introduces substantial end-to-end overheads that grow with the number of cuts, and that reconstruction constitutes a dominant fraction of per-query time, bounding achievable speed-up under increased parallelism. Despite these systems costs, test accuracy and robustness are preserved in the measured regimes, with configuration-dependent improvements observed in some cut settings. These results indicate that practical scaling of circuit cutting for learning workloads hinges on reducing and overlapping reconstruction and on scheduling policies that account for barrier-dominated critical paths.

[LG-36] Factored Latent Action World Models

链接: https://arxiv.org/abs/2602.16229
作者: Zizhao Wang,Chang Shi,Jiaheng Hu,Kevin Rohling,Roberto Martín-Martín,Amy Zhang,Peter Stone
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.

[LG-37] Amortized Predictability-aware Training Framework for Time Series Forecasting and Classification WWW2026

链接: https://arxiv.org/abs/2602.16224
作者: Xu Zhang,Peng Wang,Yichen Li,Wei Wang
类目: Machine Learning (cs.LG)
*备注: This work is accepted by the proceedings of the ACM Web Conference 2026 (WWW 2026). The code is available at the link this https URL

点击查看摘要

Abstract:Time series data are prone to noise in various domains, and training samples may contain low-predictability patterns that deviate from the normal data distribution, leading to training instability or convergence to poor local minima. Therefore, mitigating the adverse effects of low-predictability samples is crucial for time series analysis tasks such as time series forecasting (TSF) and time series classification (TSC). While many deep learning models have achieved promising performance, few consider how to identify and penalize low-predictability samples to improve model performance from the training perspective. To fill this gap, we propose a general Amortized Predictability-aware Training Framework (APTF) for both TSF and TSC. APTF introduces two key designs that enable the model to focus on high-predictability samples while still learning appropriately from low-predictability ones: (i) a Hierarchical Predictability-aware Loss (HPL) that dynamically identifies low-predictability samples and progressively expands their loss penalty as training evolves, and (ii) an amortization model that mitigates predictability estimation errors caused by model bias, further enhancing HPL’s effectiveness. The code is available at this https URL.

[LG-38] SEMixer: Semantics Enhanced MLP-Mixer for Multiscale Mixing and Long-term Time Series Forecasting WWW2026

链接: https://arxiv.org/abs/2602.16220
作者: Xu Zhang,Qitong Wang,Peng Wang,Wei Wang
类目: Machine Learning (cs.LG)
*备注: This work is accepted by the proceedings of the ACM Web Conference 2026 (WWW 2026). The code is available at the link this https URL

点击查看摘要

Abstract:Modeling multiscale patterns is crucial for long-term time series forecasting (TSF). However, redundancy and noise in time series, together with semantic gaps between non-adjacent scales, make the efficient alignment and integration of multi-scale temporal dependencies challenging. To address this, we propose SEMixer, a lightweight multiscale model designed for long-term TSF. SEMixer features two key components: a Random Attention Mechanism (RAM) and a Multiscale Progressive Mixing Chain (MPMC). RAM captures diverse time-patch interactions during training and aggregates them via dropout ensemble at inference, enhancing patch-level semantics and enabling MLP-Mixer to better model multi-scale dependencies. MPMC further stacks RAM and MLP-Mixer in a memory-efficient manner, achieving more effective temporal mixing. It addresses semantic gaps across scales and facilitates better multiscale modeling and forecasting performance. We not only validate the effectiveness of SEMixer on 10 public datasets, but also on the \textit2025 CCF AlOps Challenge based on 21GB real wireless network data, where SEMixer achieves third place. The code is available at the link this https URL.

[LG-39] Bayesian Quadrature: Gaussian Processes for Integration

链接: https://arxiv.org/abs/2602.16218
作者: Maren Mahsereci,Toni Karvonen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian quadrature is a probabilistic, model-based approach to numerical integration, the estimation of intractable integrals, or expectations. Although Bayesian quadrature was popularised already in the 1980s, no systematic and comprehensive treatment has been published. The purpose of this survey is to fill this gap. We review the mathematical foundations of Bayesian quadrature from different points of view; present a systematic taxonomy for classifying different Bayesian quadrature methods along the three axes of modelling, inference, and sampling; collect general theoretical guarantees; and provide a controlled numerical study that explores and illustrates the effect of different choices along the axes of the taxonomy. We also provide a realistic assessment of practical challenges and limitations to application of Bayesian quadrature methods and include an up-to-date and nearly exhaustive bibliography that covers not only machine learning and statistics literature but all areas of mathematics and engineering in which Bayesian quadrature or equivalent methods have seen use.

[LG-40] Multi-Class Boundary Extraction from Implicit Representations

链接: https://arxiv.org/abs/2602.16217
作者: Jash Vira,Andrew Myers,Simon Ratcliffe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Surface extraction from implicit neural representations modelling a single class surface is a well-known task. However, there exist no surface extraction methods from an implicit representation of multiple classes that guarantee topological correctness and no holes. In this work, we lay the groundwork by introducing a 2D boundary extraction algorithm for the multi-class case focusing on topological consistency and water-tightness, which also allows for setting minimum detail restraint on the approximation. Finally, we evaluate our algorithm using geological modelling data, showcasing its adaptiveness and ability to honour complex topology.

[LG-41] Linked Data Classification using Neurochaos Learning

链接: https://arxiv.org/abs/2602.16204
作者: Pooja Honna,Ayush Patravali,Nithin Nagaraj,Nanjangud C. Narendra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neurochaos Learning (NL) has shown promise in recent times over traditional deep learning due to its two key features: ability to learn from small sized training samples, and low compute requirements. In prior work, NL has been implemented and extensively tested on separable and time series data, and demonstrated its superior performance on both classification and regression tasks. In this paper, we investigate the next step in NL, viz., applying NL to linked data, in particular, data that is represented in the form of knowledge graphs. We integrate linked data into NL by implementing node aggregation on knowledge graphs, and then feeding the aggregated node features to the simplest NL architecture: ChaosNet. We demonstrate the results of our implementation on homophilic graph datasets as well as heterophilic graph datasets of verying heterophily. We show better efficacy of our approach on homophilic graphs than on heterophilic graphs. While doing so, we also present our analysis of the results, as well as suggestions for future work.

[LG-42] raining-Free Adaptation of Diffusion Models via Doobs h-Transform

链接: https://arxiv.org/abs/2602.16198
作者: Qijie Zhu,Zeqi Ye,Han Liu,Zhaoran Wang,Minshuo Chen
类目: Machine Learning (cs.LG)
*备注: 36 pages, 3 figures

点击查看摘要

Abstract:Adaptation methods have been a workhorse for unlocking the transformative power of pre-trained diffusion models in diverse applications. Existing approaches often abstract adaptation objectives as a reward function and steer diffusion models to generate high-reward samples. However, these approaches can incur high computational overhead due to additional training, or rely on stringent assumptions on the reward such as differentiability. Moreover, despite their empirical success, theoretical justification and guarantees are seldom established. In this paper, we propose DOIT (Doob-Oriented Inference-time Transformation), a training-free and computationally efficient adaptation method that applies to generic, non-differentiable rewards. The key framework underlying our method is a measure transport formulation that seeks to transport the pre-trained generative distribution to a high-reward target distribution. We leverage Doob’s h -transform to realize this transport, which induces a dynamic correction to the diffusion sampling process and enables efficient simulation-based computation without modifying the pre-trained model. Theoretically, we establish a high probability convergence guarantee to the target high-reward distribution via characterizing the approximation error in the dynamic Doob’s correction. Empirically, on D4RL offline RL benchmarks, our method consistently outperforms state-of-the-art baselines while preserving sampling efficiency.

[LG-43] Deep TPC: Temporal-Prior Conditioning for Time Series Forecasting ICASSP2026

链接: https://arxiv.org/abs/2602.16188
作者: Filippos Bellos,NaveenJohn Premkumar,Yannis Avrithis,Nam H. Nguyen,Jason J. Corso
类目: Machine Learning (cs.LG)
*备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:LLM-for-time series (TS) methods typically treat time shallowly, injecting positional or prompt-based cues once at the input of a largely frozen decoder, which limits temporal reasoning as this information degrades through the layers. We introduce Temporal-Prior Conditioning (TPC), which elevates time to a first-class modality that conditions the model at multiple depths. TPC attaches a small set of learnable time series tokens to the patch stream; at selected layers these tokens cross-attend to temporal embeddings derived from compact, human-readable temporal descriptors encoded by the same frozen LLM, then feed temporal context back via self-attention. This disentangles time series signal and temporal information while maintaining a low parameter budget. We show that by training only the cross-attention modules and explicitly disentangling time series signal and temporal information, TPC consistently outperforms both full fine-tuning and shallow conditioning strategies, achieving state-of-the-art performance in long-term forecasting across diverse datasets. Code available at: this https URL

[LG-44] Multi-Agent Combinatorial-Multi-Armed-Bandit framework for the Submodular Welfare Problem under Bandit Feedback

链接: https://arxiv.org/abs/2602.16183
作者: Subham Pokhriyal,Shweta Jain,Vaneet Aggarwal
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the \emphSubmodular Welfare Problem (SWP), where items are partitioned among agents with monotone submodular utilities to maximize the total welfare under \emphbandit feedback. Classical SWP assumes full value-oracle access, achieving (1-1/e) approximations via continuous-greedy algorithms. We extend this to a \emphmulti-agent combinatorial bandit framework (\textscMA-CMAB), where actions are partitions under full-bandit feedback with non-communicating agents. Unlike prior single-agent or separable multi-agent CMAB models, our setting couples agents through shared allocation constraints. We propose an explore-then-commit strategy with randomized assignments, achieving \tilde\mathcalO(T^2/3) regret against a (1-1/e) benchmark, the first such guarantee for partition-based submodular welfare problem under bandit feedback.

[LG-45] owards Secure and Scalable Energy Theft Detection: A Federated Learning Approach for Resource-Constrained Smart Meters

链接: https://arxiv.org/abs/2602.16181
作者: Diego Labate,Dipanwita Thakur,Giancarlo Fortino
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Energy theft poses a significant threat to the stability and efficiency of smart grids, leading to substantial economic losses and operational challenges. Traditional centralized machine learning approaches for theft detection require aggregating user data, raising serious concerns about privacy and data security. These issues are further exacerbated in smart meter environments, where devices are often resource-constrained and lack the capacity to run heavy models. In this work, we propose a privacy-preserving federated learning framework for energy theft detection that addresses both privacy and computational constraints. Our approach leverages a lightweight multilayer perceptron (MLP) model, suitable for deployment on low-power smart meters, and integrates basic differential privacy (DP) by injecting Gaussian noise into local model updates before aggregation. This ensures formal privacy guarantees without compromising learning performance. We evaluate our framework on a real-world smart meter dataset under both IID and non-IID data distributions. Experimental results demonstrate that our method achieves competitive accuracy, precision, recall, and AUC scores while maintaining privacy and efficiency. This makes the proposed solution practical and scalable for secure energy theft detection in next-generation smart grid infrastructures.

[LG-46] Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning

链接: https://arxiv.org/abs/2602.16167
作者: Binghang Lu,Jiahao Zhang,Guang Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks and neural operators often suffer from severe optimization difficulties caused by ill-conditioned gradients, multi-scale spectral behavior, and stiffness induced by physical constraints. Recently, the Muon optimizer has shown promise by performing orthogonalized updates in the singular-vector basis of the gradient, thereby improving geometric conditioning. However, its unit-singular-value updates may lead to overly aggressive steps and lack explicit stability guarantees when applied to physics-informed learning. In this work, we propose SpecMuon, a spectral-aware optimizer that integrates Muon’s orthogonalized geometry with a mode-wise relaxed scalar auxiliary variable (RSAV) mechanism. By decomposing matrix-valued gradients into singular modes and applying RSAV updates individually along dominant spectral directions, SpecMuon adaptively regulates step sizes according to the global loss energy while preserving Muon’s scale-balancing properties. This formulation interprets optimization as a multi-mode gradient flow and enables principled control of stiff spectral components. We establish rigorous theoretical properties of SpecMuon, including a modified energy dissipation law, positivity and boundedness of auxiliary variables, and global convergence with a linear rate under the Polyak-Lojasiewicz condition. Numerical experiments on physics-informed neural networks, DeepONets, and fractional PINN-DeepONets demonstrate that SpecMuon achieves faster convergence and improved stability compared with Adam, AdamW, and the original Muon optimizer on benchmark problems such as the one-dimensional Burgers equation and fractional partial differential equations.

[LG-47] Differentially Private Non-convex Distributionally Robust Optimization

链接: https://arxiv.org/abs/2602.16155
作者: Difei Xu,Meng Ding,Zebin Ma,Huanyi Xie,Youming Tao,Aicha Slaitane,Di Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world deployments routinely face distribution shifts, group imbalances, and adversarial perturbations, under which the traditional Empirical Risk Minimization (ERM) framework can degrade severely. Distributionally Robust Optimization (DRO) addresses this issue by optimizing the worst-case expected loss over an uncertainty set of distributions, offering a principled approach to robustness. Meanwhile, as training data in DRO always involves sensitive information, safeguarding it against leakage under Differential Privacy (DP) is essential. In contrast to classical DP-ERM, DP-DRO has received much less attention due to its minimax optimization structure with uncertainty constraint. To bridge the gap, we provide a comprehensive study of DP-(finite-sum)-DRO with \psi -divergence and non-convex loss. First, we study DRO with general \psi -divergence by reformulating it as a minimization problem, and develop a novel (\varepsilon, \delta) -DP optimization method, called DP Double-Spider, tailored to this structure. Under mild assumptions, we show that it achieves a utility bound of \mathcalO(\frac1\sqrtn+ (\frac\sqrtd \log (1/\delta)n \varepsilon)^2/3) in terms of the gradient norm, where n denotes the data size and d denotes the model dimension. We further improve the utility rate for specific divergences. In particular, for DP-DRO with KL-divergence, by transforming the problem into a compositional finite-sum optimization problem, we develop a DP Recursive-Spider method and show that it achieves a utility bound of \mathcalO((\frac\sqrtd \log(1/\delta)n\varepsilon)^2/3 ) , matching the best-known result for non-convex DP-ERM. Experimentally, we demonstrate that our proposed methods outperform existing approaches for DP minimax optimization. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.16155 [cs.LG] (or arXiv:2602.16155v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.16155 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Difei Xu [view email] [v1] Wed, 18 Feb 2026 03:00:30 UTC (902 KB) Full-text links: Access Paper: View a PDF of the paper titled Differentially Private Non-convex Distributionally Robust Optimization, by Difei Xu and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-02 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-48] Investigating GNN Convergence on Large Randomly Generated Graphs with Realistic Node Feature Correlations

链接: https://arxiv.org/abs/2602.16145
作者: Mohammed Zain Ali Ahmed
类目: Machine Learning (cs.LG)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:There are a number of existing studies analysing the convergence behaviour of graph neural networks on large random graphs. Unfortunately, the majority of these studies do not model correlations between node features, which would naturally exist in a variety of real-life networks. Consequently, the derived limitations of GNNs, resulting from such convergence behaviour, is not truly reflective of the expressive power of GNNs when applied to realistic graphs. In this paper, we will introduce a novel method to generate random graphs that have correlated node features. The node features will be sampled in such a manner to ensure correlation between neighbouring nodes. As motivation for our choice of sampling scheme, we will appeal to properties exhibited by real-life graphs, particularly properties that are captured by the Barabási-Albert model. A theoretical analysis will strongly indicate that convergence can be avoided in some cases, which we will empirically validate on large random graphs generated using our novel method. The observed divergent behaviour provides evidence that GNNs may be more expressive than initial studies would suggest, especially on realistic graphs.

[LG-49] On the Power of Source Screening for Learning Shared Feature Extractors

链接: https://arxiv.org/abs/2602.16125
作者: Leo (Muxing)Wang,Connor Mclaughlin,Lili Su
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and source-specific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed ``good’’ collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We formalize the notion of an informative subpopulation, develop algorithms and practical heuristics for identifying such subsets, and validate their effectiveness through both theoretical analysis and empirical evaluations on synthetic and real-world datasets.

[LG-50] Feature-based morphological analysis of shape graph data

链接: https://arxiv.org/abs/2602.16120
作者: Murad Hossen,Demetrio Labate,Nicolas Charon
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduces and demonstrates a computational pipeline for the statistical analysis of shape graph datasets, namely geometric networks embedded in 2D or 3D spaces. Unlike traditional abstract graphs, our purpose is not only to retrieve and distinguish variations in the connectivity structure of the data but also geometric differences of the network branches. Our proposed approach relies on the extraction of a specifically curated and explicit set of topological, geometric and directional features, designed to satisfy key invariance properties. We leverage the resulting feature representation for tasks such as group comparison, clustering and classification on cohorts of shape graphs. The effectiveness of this representation is evaluated on several real-world datasets including urban road/street networks, neuronal traces and astrocyte imaging. These results are benchmarked against several alternative methods, both feature-based and not.

[LG-51] Evolutionary Context Search for Automated Skill Acquisition

链接: https://arxiv.org/abs/2602.16113
作者: Qi Sun,Stefan Nielsen,Rio Yokota,Yujin Tang
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models cannot reliably acquire new knowledge post-deployment – even when relevant text resources exist, models fail to transform them into actionable knowledge without retraining. Retrieval-Augmented Generation attempts to bridge this gap by surfacing relevant documents at inference time, yet similarity-based retrieval often fails to identify context that actually improves task performance. We introduce Evolutionary Context Search (ECS), an evolutionary method that searches context combinations using accuracy on a small development set, requiring only inference calls without weight updates. ECS moves beyond semantic similarity to discover non-obvious context pairings that significantly boost performance. Our empirical results show that ECS improves BackendBench by 27% and \tau -bench airline by 7%. The evolved contexts are model-agnostic, as those evolved with Gemini-3-Flash transfer effectively to Claude Sonnet and DeepSeek. This suggests that ECS opens a path toward automated context discovery for skill acquisition – an efficient alternative to manual prompt engineering or costly fine-tuning.

[LG-52] Axle Sensor Fusion for Online Continual Wheel Fault Detection in Wayside Railway Monitoring

链接: https://arxiv.org/abs/2602.16101
作者: Afonso Lourenço,Francisca Osório,Diogo Risca,Goreti Marreiros
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable and cost-effective maintenance is essential for railway safety, particularly at the wheel-rail interface, which is prone to wear and failure. Predictive maintenance frameworks increasingly leverage sensor-generated time-series data, yet traditional methods require manual feature engineering, and deep learning models often degrade in online settings with evolving operational patterns. This work presents a semantic-aware, label-efficient continual learning framework for railway fault diagnostics. Accelerometer signals are encoded via a Variational AutoEncoder into latent representations capturing the normal operational structure in a fully unsupervised manner. Importantly, semantic metadata, including axle counts, wheel indexes, and strain-based deformations, is extracted via AI-driven peak detection on fiber Bragg grating sensors (resistant to electromagnetic interference) and fused with the VAE embeddings, enhancing anomaly detection under unknown operational conditions. A lightweight gradient boosting supervised classifier stabilizes anomaly scoring with minimal labels, while a replay-based continual learning strategy enables adaptation to evolving domains without catastrophic forgetting. Experiments show the model detects minor imperfections due to flats and polygonization, while adapting to evolving operational conditions, such as changes in train type, speed, load, and track profiles, captured using a single accelerometer and strain gauge in wayside monitoring.

[LG-53] Collaborative Zone-Adaptive Zero-Day Intrusion Detection for IoBT

链接: https://arxiv.org/abs/2602.16098
作者: Amirmohammad Pasdar,Shabnam Kasra Kermanshahi,Nour Moustafa,Van-Thuan Pham
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Internet of Battlefield Things (IoBT) relies on heterogeneous, bandwidth-constrained, and intermittently connected tactical networks that face rapidly evolving cyber threats. In this setting, intrusion detection cannot depend on continuous central collection of raw traffic due to disrupted links, latency, operational security limits, and non-IID traffic across zones. We present Zone-Adaptive Intrusion Detection (ZAID), a collaborative detection and model-improvement framework for unseen attack types, where “zero-day” refers to previously unobserved attack families and behaviours (not vulnerability disclosure timing). ZAID combines a universal convolutional model for generalisable traffic representations, an autoencoder-based reconstruction signal as an auxiliary anomaly score, and lightweight adapter modules for parameter-efficient zone adaptation. To support cross-zone generalisation under constrained connectivity, ZAID uses federated aggregation and pseudo-labelling to leverage locally observed, weakly labelled behaviours. We evaluate ZAID on ToN_IoT using a zero-day protocol that excludes MITM, DDoS, and DoS from supervised training and introduces them during zone-level deployment and adaptation. ZAID achieves up to 83.16% accuracy on unseen attack traffic and transfers to UNSW-NB15 under the same procedure, with a best accuracy of 71.64%. These results indicate that parameter-efficient, zone-personalised collaboration can improve the detection of previously unseen attacks in contested IoBT environments.

[LG-54] he Limits of Long-Context Reasoning in Automated Bug Fixing

链接: https://arxiv.org/abs/2602.16069
作者: Ravi Raju,Mengmeng Ji,Shubhangi Upasani,Bo Li,Urmish Thakker
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 4 pages, under review

点击查看摘要

Abstract:Rapidly increasing context lengths have led to the assumption that large language models (LLMs) can directly reason over entire codebases. Concurrently, recent advances in LLMs have enabled strong performance on software engineering benchmarks, particularly when paired with agentic workflows. In this work, we systematically evaluate whether current LLMs can reliably perform long-context code debugging and patch generation. Using SWE-bench Verified as a controlled experimental setting, we first evaluate state-of-the-art models within an agentic harness (mini-SWE-agent), where performance improves substantially: GPT-5-nano achieves up to a 31% resolve rate on 100 samples, and open-source models such as Deepseek-R1-0528 obtain competitive results. However, token-level analysis shows that successful agentic trajectories typically remain under 20k tokens, and that longer accumulated contexts correlate with lower success rates, indicating that agentic success primarily arises from task decomposition into short-context steps rather than effective long-context reasoning. To directly test long-context capability, we construct a data pipeline where we artificially inflate the context length of the input by placing the relevant files into the context (ensuring perfect retrieval recall); we then study single-shot patch generation under genuinely long contexts (64k-128k tokens). Despite this setup, performance degrades sharply: Qwen3-Coder-30B-A3B achieves only a 7% resolve rate at 64k context, while GPT-5-nano solves none of the tasks. Qualitative analysis reveals systematic failure modes, including hallucinated diffs, incorrect file targets, and malformed patch headers. Overall, our findings highlight a significant gap between nominal context length and usable context capacity in current LLMs, and suggest that existing agentic coding benchmarks do not meaningfully evaluate long-context reasoning.

[LG-55] MARLEM: A Multi-Agent Reinforcement Learning Simulation Framework for Implicit Cooperation in Decentralized Local Energy Markets

链接: https://arxiv.org/abs/2602.16063
作者: Nelson Salazar-Pena,Alejandra Tabares,Andres Gonzalez-Mancera
类目: ystems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 32 pages, 7 figures, 1 table, 1 algorithm

点击查看摘要

Abstract:This paper introduces a novel, open-source MARL simulation framework for studying implicit cooperation in LEMs, modeled as a decentralized partially observable Markov decision process and implemented as a Gymnasium environment for MARL. Our framework features a modular market platform with plug-and-play clearing mechanisms, physically constrained agent models (including battery storage), a realistic grid network, and a comprehensive analytics suite to evaluate emergent coordination. The main contribution is a novel method to foster implicit cooperation, where agents’ observations and rewards are enhanced with system-level key performance indicators to enable them to independently learn strategies that benefit the entire system and aim for collectively beneficial outcomes without explicit communication. Through representative case studies (available in a dedicated GitHub repository in this https URL, we show the framework’s ability to analyze how different market configurations (such as varying storage deployment) impact system performance. This illustrates its potential to facilitate emergent coordination, improve market efficiency, and strengthen grid stability. The proposed simulation framework is a flexible, extensible, and reproducible tool for researchers and practitioners to design, test, and validate strategies for future intelligent, decentralized energy systems.

[LG-56] Multi-Objective Alignment of Language Models for Personalized Psychotherapy

链接: https://arxiv.org/abs/2602.16053
作者: Mehrab Beikzadeh,Yasaman Asadollah Salmanpour,Ashima Suvarna,Sriram Sankararaman,Matteo Malgaroli,Majid Sarrafzadeh,Saadia Gabriel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mental health disorders affect over 1 billion people worldwide, yet access to care remains limited by workforce shortages and cost constraints. While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety. We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization. We train reward models for six criteria – empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy – and systematically compare multi-objective approaches against single-objective optimization, supervised fine-tuning, and parameter merging. Multi-objective DPO (MODPO) achieves superior balance (77.6% empathy, 62.6% safety) compared to single-objective optimization (93.6% empathy, 47.8% safety), and therapeutic criteria outperform general communication principles by 17.2%. Blinded clinician evaluation confirms MODPO is consistently preferred, with LLM-evaluator agreement comparable to inter-clinician reliability.

[LG-57] MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

链接: https://arxiv.org/abs/2602.16052
作者: Bradley McDanel,Steven Li,Sruthikesh Surineni,Harshit Khaitan
类目: Machine Learning (cs.LG)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce speculation depth when MoE verification becomes expensive. We propose MoE-Spec, a training-free verification-time expert budgeting method that decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the experts that contribute most to verification and dropping the long tail of rarely used experts that drive bandwidth overhead. Experiments across multiple model scales and datasets show that this method yields 10–30% higher throughput than state-of-the-art speculative decoding baselines (EAGLE-3) at comparable quality, with flexibility to trade accuracy for further latency reductions through tighter budgets.

[LG-58] Heuristic Search as Language-Guided Program Optimization

链接: https://arxiv.org/abs/2602.16038
作者: Mingxin Yu,Ruixiao Yang,Chuchu Fan
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, under review

点击查看摘要

Abstract:Large Language Models (LLMs) have advanced Automated Heuristic Design (AHD) in combinatorial optimization (CO) in the past few years. However, existing discovery pipelines often require extensive manual trial-and-error or reliance on domain expertise to adapt to new or complex problems. This stems from tightly coupled internal mechanisms that limit systematic improvement of the LLM-driven design process. To address this challenge, we propose a structured framework for LLM-driven AHD that explicitly decomposes the heuristic discovery process into modular stages: a forward pass for evaluation, a backward pass for analytical feedback, and an update step for program refinement. This separation provides a clear abstraction for iterative refinement and enables principled improvements of individual components. We validate our framework across four diverse real-world CO domains, where it consistently outperforms baselines, achieving up to 0.17 improvement in QYI on unseen test sets. Finally, we show that several popular AHD methods are restricted instantiations of our framework. By integrating them in our structured pipeline, we can upgrade the components modularly and significantly improve their performance.

[LG-59] MolCrystalFlow: Molecular Crystal Structure Prediction via Flow Matching

链接: https://arxiv.org/abs/2602.16020
作者: Cheng Zeng,Harry W. Sullivan,Thomas Egg,Maya M. Martirossyan,Philipp Höllmer,Jirui Jin,Richard G. Hennig,Adrian Roitberg,Stefano Martiniani,Ellad B. Tadmor,Mingjie Liu
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:Molecular crystal structure prediction represents a grand challenge in computational chemistry due to large sizes of constituent molecules and complex intra- and intermolecular interactions. While generative modeling has revolutionized structure discovery for molecules, inorganic solids, and metal-organic frameworks, extending such approaches to fully periodic molecular crystals is still elusive. Here, we present MolCrystalFlow, a flow-based generative model for molecular crystal structure prediction. The framework disentangles intramolecular complexity from intermolecular packing by embedding molecules as rigid bodies and jointly learning the lattice matrix, molecular orientations, and centroid positions. Centroids and orientations are represented on their native Riemannian manifolds, allowing geodesic flow construction and graph neural network operations that respects geometric symmetries. We benchmark our model against state-of-the-art generative models for large-size periodic crystals and rule-based structure generation methods on two open-source molecular crystal datasets. We demonstrate an integration of MolCrystalFlow model with universal machine learning potential to accelerate molecular crystal structure prediction, paving the way for data-driven generative discovery of molecular crystals.

[LG-60] Geometry-Aware Uncertainty Quantification via Conformal Prediction on Manifolds

链接: https://arxiv.org/abs/2602.16015
作者: Marzieh Amiri Shahbazi,Ali Baheri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction provides distribution-free coverage guaranties for regression; yet existing methods assume Euclidean output spaces and produce prediction regions that are poorly calibrated when responses lie on Riemannian manifolds. We propose \emphadaptive geodesic conformal prediction, a framework that replaces Euclidean residuals with geodesic nonconformity scores and normalizes them by a cross-validated difficulty estimator to handle heteroscedastic noise. The resulting prediction regions, geodesic caps on the sphere, have position-independent area and adapt their size to local prediction difficulty, yielding substantially more uniform conditional coverage than non-adaptive alternatives. In a synthetic sphere experiment with strong heteroscedasticity and a real-world geomagnetic field forecasting task derived from IGRF-14 satellite data, the adaptive method markedly reduces conditional coverage variability and raises worst-case coverage much closer to the nominal level, while coordinate-based baselines waste a large fraction of coverage area due to chart distortion.

[LG-61] Verifier-Constrained Flow Expansion for Discovery Beyond the Data ICLR2026

链接: https://arxiv.org/abs/2602.15984
作者: Riccardo De Santi,Kimon Protopapas,Ya-Ping Hsieh,Andreas Krause
类目: Machine Learning (cs.LG)
*备注: ICLR 2026

点击查看摘要

Abstract:Flow and diffusion models are typically pre-trained on limited available data (e.g., molecular samples), covering only a fraction of the valid design space (e.g., the full molecular space). As a consequence, they tend to generate samples from only a narrow portion of the feasible domain. This is a fundamental limitation for scientific discovery applications, where one typically aims to sample valid designs beyond the available data distribution. To this end, we address the challenge of leveraging access to a verifier (e.g., an atomic bonds checker), to adapt a pre-trained flow model so that its induced density expands beyond regions of high data availability, while preserving samples validity. We introduce formal notions of strong and weak verifiers and propose algorithmic frameworks for global and local flow expansion via probability-space optimization. Then, we present Flow Expander (FE), a scalable mirror descent scheme that provably tackles both problems by verifier-constrained entropy maximization over the flow process noised state space. Next, we provide a thorough theoretical analysis of the proposed method, and state convergence guarantees under both idealized and general assumptions. Ultimately, we empirically evaluate our method on both illustrative, yet visually interpretable settings, and on a molecular design task showcasing the ability of FE to expand a pre-trained flow model increasing conformer diversity while preserving validity.

[LG-62] Fast Online Learning with Gaussian Prior-Driven Hierarchical Unimodal Thompson Sampling

链接: https://arxiv.org/abs/2602.15972
作者: Tianchi Zhao,He Liu,Hongyin Shi,Jinliang Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study a type of Multi-Armed Bandit (MAB) problems in which arms with a Gaussian reward feedback are clustered. Such an arm setting finds applications in many real-world problems, for example, mmWave communications and portfolio management with risky assets, as a result of the universality of the Gaussian distribution. Based on the Thompson Sampling algorithm with Gaussian prior (TSG) algorithm for the selection of the optimal arm, we propose our Thompson Sampling with Clustered arms under Gaussian prior (TSCG) specific to the 2-level hierarchical structure. We prove that by utilizing the 2-level structure, we can achieve a lower regret bound than we do with ordinary TSG. In addition, when the reward is Unimodal, we can reach an even lower bound on the regret by our Unimodal Thompson Sampling algorithm with Clustered Arms under Gaussian prior (UTSCG). Each of our proposed algorithms are accompanied by theoretical evaluation of the upper regret bound, and our numerical experiments confirm the advantage of our proposed algorithms.

[LG-63] R2Energy: A Large-Scale Benchmark for Robust Renewable Energy Forecasting under Diverse and Extreme Conditions

链接: https://arxiv.org/abs/2602.15961
作者: Zhi Sheng,Yuan Yuan,Guozhen Zhang,Yong Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid expansion of renewable energy, particularly wind and solar power, has made reliable forecasting critical for power system operations. While recent deep learning models have achieved strong average accuracy, the increasing frequency and intensity of climate-driven extreme weather events pose severe threats to grid stability and operational security. Consequently, developing robust forecasting models that can withstand volatile conditions has become a paramount challenge. In this paper, we present R ^2 Energy, a large-scale benchmark for NWP-assisted renewable energy forecasting. It comprises over 10.7 million high-fidelity hourly records from 902 wind and solar stations across four provinces in China, providing the diverse meteorological conditions necessary to capture the wide-ranging variability of renewable generation. We further establish a standardized, leakage-free forecasting paradigm that grants all models identical access to future Numerical Weather Prediction (NWP) signals, enabling fair and reproducible comparison across state-of-the-art representative forecasting architectures. Beyond aggregate accuracy, we incorporate regime-wise evaluation with expert-aligned extreme weather annotations, uncovering a critical ``robustness gap’’ typically obscured by average metrics. This gap reveals a stark robustness-complexity trade-off: under extreme conditions, a model’s reliability is driven by its meteorological integration strategy rather than its architectural complexity. R ^2 Energy provides a principled foundation for evaluating and developing forecasting models for safety-critical power system applications.

[LG-64] Adaptive Semi-Supervised Training of P300 ERP-BCI Speller System with Minimum Calibration Effort

链接: https://arxiv.org/abs/2602.15955
作者: Shumeng Chen,Jane E. Huggins,Tianwen Ma
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:A P300 ERP-based Brain-Computer Interface (BCI) speller is an assistive communication tool. It searches for the P300 event-related potential (ERP) elicited by target stimuli, distinguishing it from the neural responses to non-target stimuli embedded in electroencephalogram (EEG) signals. Conventional methods require a lengthy calibration procedure to construct the binary classifier, which reduced overall efficiency. Thus, we proposed a unified framework with minimum calibration effort such that, given a small amount of labeled calibration data, we employed an adaptive semi-supervised EM-GMM algorithm to update the binary classifier. We evaluated our method based on character-level prediction accuracy, information transfer rate (ITR), and BCI utility. We applied calibration on training data and reported results on testing data. Our results indicate that, out of 15 participants, 9 participants exceed the minimum character-level accuracy of 0.7 using either on our adaptive method or the benchmark, and 7 out of these 9 participants showed that our adaptive method performed better than the benchmark. The proposed semi-supervised learning framework provides a practical and efficient alternative to improve the overall spelling efficiency in the real-time BCI speller system, particularly in contexts with limited labeled data.

[LG-65] Statistical-Geometric Degeneracy in UAV Search: A Physics-Aware Asymmetric Filtering Approach

链接: https://arxiv.org/abs/2602.15893
作者: Zhiyuan Ren,Yudong Fang,Tao Zhang,Wenchi Cheng,Ben Lan
类目: Robotics (cs.RO); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-disaster survivor localization using Unmanned Aerial Vehicles (UAVs) faces a fundamental physical challenge: the prevalence of Non-Line-of-Sight (NLOS) propagation in collapsed structures. Unlike standard Gaussian noise, signal reflection from debris introduces strictly non-negative ranging biases. Existing robust estimators, typically designed with symmetric loss functions (e.g., Huber or Tukey), implicitly rely on the assumption of error symmetry. Consequently, they experience a theoretical mismatch in this regime, leading to a phenomenon we formally identify as Statistical-Geometric Degeneracy (SGD)-a state where the estimator stagnates due to the coupling of persistent asymmetric bias and limited observation geometry. While emerging data-driven approaches offer alternatives, they often struggle with the scarcity of training data and the sim-to-real gap inherent in unstructured disaster zones. In this work, we propose a physically-grounded solution, the AsymmetricHuberEKF, which explicitly incorporates the non-negative physical prior of NLOS biases via a derived asymmetric loss function. Theoretically, we show that standard symmetric filters correspond to a degenerate case of our framework where the physical constraint is relaxed. Furthermore, we demonstrate that resolving SGD requires not just a robust filter, but specific bilateral information, which we achieve through a co-designed active sensing strategy. Validated in a 2D nadir-view scanning scenario, our approach significantly accelerates convergence compared to symmetric baselines, offering a resilient building block for search operations where data is scarce and geometry is constrained.

[LG-66] Distributed physics-informed neural networks via domain decomposition for fast flow reconstruction

链接: https://arxiv.org/abs/2602.15883
作者: Yixiao Qian,Jiaxu Liu,Zewei Xia,Song Chen,Chao Xu,Shengze Cai
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) offer a powerful paradigm for flow reconstruction, seamlessly integrating sparse velocity measurements with the governing Navier-Stokes equations to recover complete velocity and latent pressure fields. However, scaling such models to large spatiotemporal domains is hindered by computational bottlenecks and optimization instabilities. In this work, we propose a robust distributed PINNs framework designed for efficient flow reconstruction via spatiotemporal domain decomposition. A critical challenge in such distributed solvers is pressure indeterminacy, where independent sub-networks drift into inconsistent local pressure baselines. We address this issue through a reference anchor normalization strategy coupled with decoupled asymmetric weighting. By enforcing a unidirectional information flow from designated master ranks where the anchor point lies to neighboring ranks, our approach eliminates gauge freedom and guarantees global pressure uniqueness while preserving temporal continuity. Furthermore, to mitigate the Python interpreter overhead associated with computing high-order physics residuals, we implement a high-performance training pipeline accelerated by CUDA graphs and JIT compilation. Extensive validation on complex flow benchmarks demonstrates that our method achieves near-linear strong scaling and high-fidelity reconstruction, establishing a scalable and physically rigorous pathway for flow reconstruction and understanding of complex hydrodynamics.

[LG-67] BamaER: A Behavior-Aware Memory-Augmented Model for Exercise Recommendation

链接: https://arxiv.org/abs/2602.15879
作者: Qing Yang,Yuhao Jiang,Rui Wang,Jipeng Guo,Yejiang Wang,Xinghe Cheng,Zezheng Wu,Jiapu Wang,Jingwei Zhang
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Exercise recommendation focuses on personalized exercise selection conditioned on students’ learning history, personal interests, and other individualized characteristics. Despite notable progress, most existing methods represent student learning solely as exercise sequences, overlooking rich behavioral interaction information. This limited representation often leads to biased and unreliable estimates of learning progress. Moreover, fixed-length sequence segmentation limits the incorporation of early learning experiences, thereby hindering the modeling of long-term dependencies and the accurate estimation of knowledge mastery. To address these limitations, we propose BamaER, a Behavior-aware memory-augmented Exercise Recommendation framework that comprises three core modules: (i) the learning progress prediction module that captures heterogeneous student interaction behaviors via a tri-directional hybrid encoding scheme; (ii) the memory-augmented knowledge tracing module that maintains a dynamic memory matrix to jointly model historical and current knowledge states for robust mastery estimation; and (iii) the exercise filtering module that formulates candidate selection as a diversity-aware optimization problem, solved via the Hippopotamus Optimization Algorithm to reduce redundancy and improve recommendation coverage. Experiments on five real-world educational datasets show that BamaER consistently outperforms state-of-the-art baselines across a range of evaluation metrics.

[LG-68] Parameter-free representations outperform single-cell foundation models on downstream benchmarks

链接: https://arxiv.org/abs/2602.16696
作者: Huan Souza,Pankaj Mehta
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure. This has motivated the development of large-scale foundation models, such as TranscriptFormer, that use transformer-based architectures to learn a generative model for gene expression by embedding genes into a latent vector space. These embeddings have been used to obtain state-of-the-art (SOTA) performance on downstream tasks such as cell-type classification, disease-state prediction, and cross-species learning. Here, we ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations. Using simple, interpretable pipelines that rely on careful normalization and linear methods, we obtain SOTA or near SOTA performance across multiple benchmarks commonly used to evaluate single-cell foundation models, including outperforming foundation models on out-of-distribution tasks involving novel cell types and organisms absent from the training data. Our findings highlight the need for rigorous benchmarking and suggest that the biology of cell identity can be captured by simple linear representations of single cell gene expression data.

[LG-69] Synthetic-Powered Multiple Testing with FDR Control

链接: https://arxiv.org/abs/2602.16690
作者: Yonghoon Lee,Meshi Bashari,Edgar Dobriban,Yaniv Romano
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multiple hypothesis testing with false discovery rate (FDR) control is a fundamental problem in statistical inference, with broad applications in genomics, drug screening, and outlier detection. In many such settings, researchers may have access not only to real experimental observations but also to auxiliary or synthetic data – from past, related experiments or generated by generative models – that can provide additional evidence about the hypotheses of interest. We introduce SynthBH, a synthetic-powered multiple testing procedure that safely leverages such synthetic data. We prove that SynthBH guarantees finite-sample, distribution-free FDR control under a mild PRDS-type positive dependence condition, without requiring the pooled-data p-values to be valid under the null. The proposed method adapts to the (unknown) quality of the synthetic data: it enhances the sample efficiency and may boost the power when synthetic data are of high quality, while controlling the FDR at a user-specified level regardless of their quality. We demonstrate the empirical performance of SynthBH on tabular outlier detection benchmarks and on genomic analyses of drug-cancer sensitivity associations, and further study its properties through controlled experiments on simulated data.

[LG-70] Investigating Nonlinear Quenching Effects on Polar Field Buildup in the Sun Using Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2602.16656
作者: Jithu J. Athalathil,Mohammed H. Talafha,Bhargav Vaidya
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: Accepted for publication in The Astrophysical Journal

点击查看摘要

Abstract:The solar dynamo relies on the regeneration of the poloidal magnetic field through processes strongly modulated by nonlinear feedbacks such as tilt quenching (TQ) and latitude quenching (LQ). These mechanisms play a decisive role in regulating the buildup of the Sun’s polar field and, in turn, the amplitude of future solar cycles. In this work, we employ Physics-Informed Neural Networks (PINN) to solve the surface flux transport (SFT) equation, embedding physical constraints directly into the neural network framework. By systematically varying transport parameters, we isolate the relative contributions of TQ and LQ to polar dipole buildup. We use the residual dipole moment as a diagnostic for cycle-to-cycle amplification and show that TQ suppression strengthens with increasing diffusivity, while LQ dominates in advection-dominated regimes. The ratio \Delta D_\mathrmLQ/\Delta D_\mathrmTQ exhibits a smooth inverse-square dependence on the dynamo effectivity range, refining previous empirical fits with improved accuracy and reduced scatter. The results further reveal that the need for a decay term is not essential for PINN set-up due to the training process. Compared with the traditional 1D SFT model, the PINN framework achieves significantly lower error metrics and more robust recovery of nonlinear trends. Our results suggest that the nonlinear interplay between LQ and TQ can naturally produce alternations between weak and strong cycles, providing a physical explanation for the observed even-odd cycle modulation. These findings demonstrate the potential of PINN as an accurate, efficient, and physically consistent tool for solar cycle prediction.

[LG-71] Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study

链接: https://arxiv.org/abs/2602.16601
作者: Nail B. Khelifa,Richard E. Turner,Ramji Venkataramanan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models are increasingly trained or fine-tuned on synthetic data. Recursively training on such data has been observed to significantly degrade performance in a wide range of tasks, often characterized by a progressive drift away from the target distribution. In this work, we theoretically analyze this phenomenon in the setting of score-based diffusion models. For a realistic pipeline where each training round uses a combination of synthetic data and fresh samples from the target distribution, we obtain upper and lower bounds on the accumulated divergence between the generated and target distributions. This allows us to characterize different regimes of drift, depending on the score estimation error and the proportion of fresh data used in each generation. We also provide empirical results on synthetic data and images to illustrate the theory.

[LG-72] Separating Oblivious and Adaptive Models of Variable Selection

链接: https://arxiv.org/abs/2602.16568
作者: Ziyun Chen,Jerry Li,Kevin Tian,Yusong Zhu
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 40 pages

点击查看摘要

Abstract:Sparse recovery is among the most well-studied problems in learning theory and high-dimensional statistics. In this work, we investigate the statistical and computational landscapes of sparse recovery with \ell_\infty error guarantees. This variant of the problem is motivated by \emphvariable selection tasks, where the goal is to estimate the support of a k -sparse signal in \mathbbR^d . Our main contribution is a provable separation between the \emphoblivious (for each'') and \emphadaptive (for all’') models of \ell_\infty sparse recovery. We show that under an oblivious model, the optimal \ell_\infty error is attainable in near-linear time with \approx k\log d samples, whereas in an adaptive model, \gtrsim k^2 samples are necessary for any algorithm to achieve this bound. This establishes a surprising contrast with the standard \ell_2 setting, where \approx k \log d samples suffice even for adaptive sparse recovery. We conclude with a preliminary examination of a \emphpartially-adaptive model, where we show nontrivial variable selection guarantees are possible with \approx k\log d measurements.

[LG-73] Learning Distributed Equilibria in Linear-Quadratic Stochastic Differential Games: An α-Potential Approach

链接: https://arxiv.org/abs/2602.16555
作者: Philipp Plank,Yufei Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We analyze independent policy-gradient (PG) learning in N -player linear-quadratic (LQ) stochastic differential games. Each player employs a distributed policy that depends only on its own state and updates the policy independently using the gradient of its own objective. We establish global linear convergence of these methods to an equilibrium by showing that the LQ game admits an \alpha -potential structure, with \alpha determined by the degree of pairwise interaction asymmetry. For pairwise-symmetric interactions, we construct an affine distributed equilibrium by minimizing the potential function and show that independent PG methods converge globally to this equilibrium, with complexity scaling linearly in the population size and logarithmically in the desired accuracy. For asymmetric interactions, we prove that independent projected PG algorithms converge linearly to an approximate equilibrium, with suboptimality proportional to the degree of asymmetry. Numerical experiments confirm the theoretical results across both symmetric and asymmetric interaction networks.

[LG-74] Optimal training-conditional regret for online conformal prediction

链接: https://arxiv.org/abs/2602.16537
作者: Jiadong Liang,Zhimei Ren,Yuxin Chen
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study online conformal prediction for non-stationary data streams subject to unknown distribution drift. While most prior work studied this problem under adversarial settings and/or assessed performance in terms of gaps of time-averaged marginal coverage, we instead evaluate performance through training-conditional cumulative regret. We specifically focus on independently generated data with two types of distribution shift: abrupt change points and smooth drift. When non-conformity score functions are pretrained on an independent dataset, we propose a split-conformal style algorithm that leverages drift detection to adaptively update calibration sets, which provably achieves minimax-optimal regret. When non-conformity scores are instead trained online, we develop a full-conformal style algorithm that again incorporates drift detection to handle non-stationarity; this approach relies on stability - rather than permutation symmetry - of the model-fitting algorithm, which is often better suited to online learning under evolving environments. We establish non-asymptotic regret guarantees for our online full conformal algorithm, which match the minimax lower bound under appropriate restrictions on the prediction sets. Numerical experiments corroborate our theoretical findings. Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2602.16537 [math.ST] (or arXiv:2602.16537v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2602.16537 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-75] Functional Decomposition and Shapley Interactions for Interpreting Survival Models

链接: https://arxiv.org/abs/2602.16505
作者: Sophie Hanna Langbein,Hubert Baniecki,Fabian Fumagalli,Niklas Koenen,Marvin N. Wright,Julia Herbinger
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hazard and survival functions are natural, interpretable targets in time-to-event prediction, but their inherent non-additivity fundamentally limits standard additive explanation methods. We introduce Survival Functional Decomposition (SurvFD), a principled approach for analyzing feature interactions in machine learning survival models. By decomposing higher-order effects into time-dependent and time-independent components, SurvFD offers a previously unrecognized perspective on survival explanations, explicitly characterizing when and why additive explanations fail. Building on this theoretical decomposition, we propose SurvSHAP-IQ, which extends Shapley interactions to time-indexed functions, providing a practical estimator for higher-order, time-dependent interactions. Together, SurvFD and SurvSHAP-IQ establish an interaction- and time-aware interpretability approach for survival modeling, with broad applicability across time-to-event prediction tasks.

[LG-76] Learning Preference from Observed Rankings

链接: https://arxiv.org/abs/2602.16476
作者: Yu-Chang Chen,Chen Chian Fuh,Shang En Tsai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating consumer preferences is central to many problems in economics and marketing. This paper develops a flexible framework for learning individual preferences from partial ranking information by interpreting observed rankings as collections of pairwise comparisons with logistic choice probabilities. We model latent utility as the sum of interpretable product attributes, item fixed effects, and a low-rank user-item factor structure, enabling both interpretability and information sharing across consumers and items. We further correct for selection in which comparisons are observed: a comparison is recorded only if both items enter the consumer’s consideration set, inducing exposure bias toward frequently encountered items. We model pair observability as the product of item-level observability propensities and estimate these propensities with a logistic model for the marginal probability that an item is observable. Preference parameters are then estimated by maximizing an inverse-probability-weighted (IPW), ridge-regularized log-likelihood that reweights observed comparisons toward a target comparison population. To scale computation, we propose a stochastic gradient descent (SGD) algorithm based on inverse-probability resampling, which draws comparisons in proportion to their IPW weights. In an application to transaction data from an online wine retailer, the method improves out-of-sample recommendation performance relative to a popularity-based benchmark, with particularly strong gains in predicting purchases of previously unconsumed products.

[LG-77] Multi-Channel Replay Speech Detection using Acoustic Maps

链接: https://arxiv.org/abs/2602.16399
作者: Michael Neri,Tuomas Virtanen
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to EUSIPCO 2026

点击查看摘要

Abstract:Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.

[LG-78] Machine Learning in Epidemiology

链接: https://arxiv.org/abs/2602.16352
作者: Marvin N. Wright,Lukas Burk,Pegah Golchian,Jan Kapar,Niklas Koenen,Sophie Hanna Langbein
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the age of digital epidemiology, epidemiologists are faced by an increasing amount of data of growing complexity and dimensionality. Machine learning is a set of powerful tools that can help to analyze such enormous amounts of data. This chapter lays the methodological foundations for successfully applying machine learning in epidemiology. It covers the principles of supervised and unsupervised learning and discusses the most important machine learning methods. Strategies for model evaluation and hyperparameter optimization are developed and interpretable machine learning is introduced. All these theoretical parts are accompanied by code examples in R, where an example dataset on heart disease is used throughout the chapter.

[LG-79] Structured Unitary Tensor Network Representations for Circuit-Efficient Quantum Data Encoding

链接: https://arxiv.org/abs/2602.16266
作者: Guang Lin,Toshihisa Tanaka,Qibin Zhao
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Encoding classical data into quantum states is a central bottleneck in quantum machine learning: many widely used encodings are circuit-inefficient, requiring deep circuits and substantial quantum resources, which limits scalability on quantum hardware. In this work, we propose TNQE, a circuit-efficient quantum data encoding framework built on structured unitary tensor network (TN) representations. TNQE first represents each classical input via a TN decomposition and then compiles the resulting tensor cores into an encoding circuit through two complementary core-to-circuit strategies. To make this compilation trainable while respecting the unitary nature of quantum operations, we introduce a unitary-aware constraint that parameterizes TN cores as learnable block unitaries, enabling them to be directly optimized and directly encoded as quantum operators. The proposed TNQE framework enables explicit control over circuit depth and qubit resources, allowing the construction of shallow, resource-efficient circuits. Across a range of benchmarks, TNQE achieves encoding circuits as shallow as 0.04\times the depth of amplitude encoding, while naturally scaling to high-resolution images ( 256 \times 256 ) and demonstrating practical feasibility on real quantum hardware.

[LG-80] On sparsity extremal structure and monotonicity properties of Wasserstein and Gromov-Wasserstein optimal transport plans

链接: https://arxiv.org/abs/2602.16265
作者: Titouan Vayer(COMPACT)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This note gives a self-contained overview of some important properties of the Gromov-Wasserstein (GW) distance, compared with the standard linear optimal transport (OT) framework. More specifically, I explore the following questions: are GW optimal transport plans sparse? Under what conditions are they supported on a permutation? Do they satisfy a form of cyclical monotonicity? In particular, I present the conditionally negative semi-definite property and show that, when it holds, there are GW optimal plans that are sparse and supported on a permutation.

[LG-81] Local adapt-then-combine algorithms for distributed nonsmooth optimization: Achieving provable communication acceleration

链接: https://arxiv.org/abs/2602.16148
作者: Luyao Guo,Xinli Shi,Wenying Xu,Jinde Cao
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper is concerned with the distributed composite optimization problem over networks, where agents aim to minimize a sum of local smooth components and a common nonsmooth term. Leveraging the probabilistic local updates mechanism, we propose a communication-efficient Adapt-Then-Combine (ATC) framework, FlexATC, unifying numerous ATC-based distributed algorithms. Under stepsizes independent of the network topology and the number of local updates, we establish sublinear and linear convergence rates for FlexATC in convex and strongly convex settings, respectively. Remarkably, in the strong convex setting, the linear rate is decoupled from the objective functions and network topology, and FlexATC permits communication to be skipped in most iterations without any deterioration of the linear rate. In addition, the proposed unified theory demonstrates for the first time that local updates provably lead to communication acceleration for ATC-based distributed algorithms. Numerical experiments further validate the efficacy of the proposed framework and corroborate the theoretical results.

[LG-82] Ratio Covers of Convex Sets and Optimal Mixture Density Estimation

链接: https://arxiv.org/abs/2602.16142
作者: Spencer Compton,Gábor Lugosi,Jaouad Mourtada,Jian Qian,Nikita Zhivotovskiy
类目: atistics Theory (math.ST); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: 45 pages

点击查看摘要

Abstract:We study density estimation in Kullback-Leibler divergence: given an i.i.d. sample from an unknown density p , the goal is to construct an estimator \widehat p such that \mathrmKL(p,\widehat p) is small with high probability. We consider two settings involving a finite dictionary of M densities: (i) model aggregation, where p belongs to the dictionary, and (ii) convex aggregation (mixture density estimation), where p is a mixture of densities from the dictionary. Crucially, we make no assumption on the base densities: their ratios may be unbounded and their supports may differ. For both problems, we identify the best possible high-probability guarantees in terms of the dictionary size, sample size, and confidence level. These optimal rates are higher than those achievable when density ratios are bounded by absolute constants; for mixture density estimation, they match existing lower bounds in the special case of discrete distributions. Our analysis of the mixture case hinges on two new covering results. First, we provide a sharp, distribution-free upper bound on the local Hellinger entropy of the class of mixtures of M distributions. Second, we prove an optimal ratio covering theorem for convex sets: for every convex compact set K\subset \mathbbR_+^d , there exists a subset A\subset K with at most 2^8d elements such that each element of K is coordinate-wise dominated by an element of A up to a universal constant factor. This geometric result is of independent interest; notably, it yields new cardinality estimates for \varepsilon -approximate Pareto sets in multi-objective optimization when the attainable set of objective vectors is convex. Comments: 45 pages Subjects: Statistics Theory (math.ST); Computational Geometry (cs.CG); Machine Learning (cs.LG) Cite as: arXiv:2602.16142 [math.ST] (or arXiv:2602.16142v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2602.16142 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-83] Empirical Cumulative Distribution Function Clustering for LLM -based Agent System Analysis

链接: https://arxiv.org/abs/2602.16131
作者: Chihiro Watanabe,Jingyu Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the k -medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.

[LG-84] Examining Fast Radiative Feedbacks Using Machine-Learning Weather Emulators

链接: https://arxiv.org/abs/2602.16090
作者: Ankur Mahesh,William D. Collins,Travis A. O’Brien,Paul B. Goddard,Sinclaire Zebaze,Shashank Subramanian,James P.C. Duncan,Oliver Watt-Meyer,Boris Bonev,Thorsten Kurth,Karthik Kashinath,Michael S. Pritchard,Da Yang
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The response of the climate system to increased greenhouse gases and other radiative perturbations is governed by a combination of fast and slow feedbacks. Slow feedbacks are typically activated in response to changes in ocean temperatures on decadal timescales and manifest as changes in climatic state with no recent historical analogue. However, fast feedbacks are activated in response to rapid atmospheric physical processes on weekly timescales, and they are already operative in the present-day climate. This distinction implies that the physics of fast radiative feedbacks is present in the historical meteorological reanalyses used to train many recent successful machine-learning-based (ML) emulators of weather and climate. In addition, these feedbacks are functional under the historical boundary conditions pertaining to the top-of-atmosphere radiative balance and sea-surface temperatures. Together, these factors imply that we can use historically trained ML weather emulators to study the response of radiative-convective equilibrium (RCE), and hence the global hydrological cycle, to perturbations in carbon dioxide and other well-mixed greenhouse gases. Without retraining on prospective Earth system conditions, we use ML weather emulators to quantify the fast precipitation response to reduced and elevated carbon dioxed concentrations with no recent historical precedent. We show that the responses from historically trained emulators agree with those produced by full-physics Earth System Models (ESMs). In conclusion, we discuss the prospects for and advantages from using ESMs and ML emulators to study fast processes in global climate.

[LG-85] Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models

链接: https://arxiv.org/abs/2602.16061
作者: Hongyu Chen,David Simchi-Levi,Ruoxuan Xiong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak shadow variables: they satisfy a conditional independence assumption with respect to missingness but need not meet the completeness conditions required by classical shadow-variable methods. When predictions are sufficiently informative, the bounds collapse to a point, recovering standard identification as a special case. In finite samples, to provide valid coverage of the identified set, we propose a set-expansion estimator that achieves slower-than- \sqrtn convergence rate in the set-identified regime and the standard \sqrtn rate under point identification. In simulations and semi-synthetic experiments on customer-service dialogues, we find that LLM predictions are often ill-conditioned for classical shadow-variable methods yet remain highly effective in our framework. They shrink identification intervals by 75–83% while maintaining valid coverage under realistic MNAR mechanisms.

[LG-86] Edge-Local and Qubit-Efficient Quantum Graph Learning for the NISQ Era

链接: https://arxiv.org/abs/2602.16018
作者: Armin Ahmadkhaniha,Jake Doliskani
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are a powerful framework for learning representations from graph-structured data, but their direct implementation on near-term quantum hardware remains challenging due to circuit depth, multi-qubit interactions, and qubit scalability constraints. In this work, we introduce a fully quantum graph convolutional architecture designed explicitly for unsupervised learning in the noisy intermediate-scale quantum (NISQ) regime. Our approach combines a variational quantum feature extraction layer with an edge-local and qubit-efficient quantum message-passing mechanism inspired by the Quantum Alternating Operator Ansatz (QAOA) framework. Unlike prior models that rely on global operations or multi-controlled unitaries, our model decomposes message passing into pairwise interactions along graph edges using only hardware-native single- and two-qubit gates. This design reduces the qubit requirement from O(Nn) to O(n) for a graph with N nodes and n -qubit feature registers, enabling implementation on current quantum devices regardless of graph size. We train the model using the Deep Graph Infomax objective to perform unsupervised node representation learning. Experiments on the Cora citation network and a large-scale genomic SNP dataset demonstrate that our model remains competitive with prior quantum and hybrid approaches.

[LG-87] Imaging-Derived Coronary Fractional Flow Reserve: Advances in Physics-Based Machine-Learning and Physics-Informed Methods

链接: https://arxiv.org/abs/2602.16000
作者: Tanxin Zhu,Emran Hossen,Chen Zhao,Michele Esposito,Jiguang Sun,Weihua Zhou
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注: 26 pages 4 tables

点击查看摘要

Abstract:Purpose of Review Imaging derived fractional flow reserve (FFR) is rapidly evolving beyond conventional computational fluid dynamics (CFD) based pipelines toward machine learning (ML), deep learning (DL), and physics informed approaches that enable fast, wire free, and scalable functional assessment of coronary stenosis. This review synthesizes recent advances in CT and angiography based FFR, with particular emphasis on emerging physics informed neural networks and neural operators (PINNs and PINOs) and key considerations for their clinical translation. Recent Findings ML/DL approaches have markedly improved automation and computational speed, enabling prediction of pressure and FFR from anatomical descriptors or angiographic contrast dynamics. However, their real-world performance and generalizability can remain variable and sensitive to domain shift, due to multi-center heterogeneity, interpretability challenges, and differences in acquisition protocols and image quality. Physics informed learning introduces conservation structure and boundary condition consistency into model training, improving generalizability and reducing dependence on dense supervision while maintaining rapid inference. Recent evaluation trends increasingly highlight deployment oriented metrics, including calibration, uncertainty quantification, and quality control gatekeeping, as essential for safe clinical use. Summary The field is converging toward imaging derived FFR methods that are faster, more automated, and more reliable. While ML/DL offers substantial efficiency gains, physics informed frameworks such as PINNs and PINOs may provide a more robust balance between speed and physical consistency. Prospective multi center validation and standardized evaluation will be critical to support broad and safe clinical adoption.

[LG-88] Exploring New Frontiers in Vertical Federated Learning: the Role of Saddle Point Reformulation

链接: https://arxiv.org/abs/2602.15996
作者: Aleksandr Beznosikov,Georgiy Kormakov,Alexander Grigorievskiy,Mikhail Rudakov,Ruslan Nazykov,Alexander Rogozin,Anton Vakhrushev,Andrey Savchenko,Martin Takáč,Alexander Gasnikov
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 104 pages, 1 table, 9 figures, 10 theorems, 12 algorithms

点击查看摘要

Abstract:The objective of Vertical Federated Learning (VFL) is to collectively train a model using features available on different devices while sharing the same users. This paper focuses on the saddle point reformulation of the VFL problem via the classical Lagrangian function. We first demonstrate how this formulation can be solved using deterministic methods. More importantly, we explore various stochastic modifications to adapt to practical scenarios, such as employing compression techniques for efficient information transmission, enabling partial participation for asynchronous communication, and utilizing coordinate selection for faster local computation. We show that the saddle point reformulation plays a key role and opens up possibilities to use mentioned extension that seem to be impossible in the standard minimization formulation. Convergence estimates are provided for each algorithm, demonstrating their effectiveness in addressing the VFL problem. Additionally, alternative reformulations are investigated, and numerical experiments are conducted to validate performance and effectiveness of the proposed approach.

[LG-89] MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models

链接: https://arxiv.org/abs/2602.15951
作者: Tianyi Li,Shihui Zang,Moritz Münchmeyer
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a general framework to discover scientific algorithms and apply it to three problems in computational cosmology. Our code, MadEvolve, is similar to Google’s AlphaEvolve, but places a stronger emphasis on free parameters and their optimization. Our code starts with a baseline human algorithm implementation, and then optimizes its performance metrics by making iterative changes to its code. As a further convenient feature, MadEvolve automatically generates a report that compares the input algorithm with the evolved algorithm, describes the algorithmic innovations and lists the free parameters and their function. Our code supports both auto-differentiable, gradient-based parameter optimization and gradient-free optimization methods. We apply MadEvolve to the reconstruction of cosmological initial conditions, 21cm foreground contamination reconstruction and effective baryonic physics in N-body simulations. In all cases, we find substantial improvements over the base algorithm. We make MadEvolve and our three tasks publicly available at this http URL.

[LG-90] Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation

链接: https://arxiv.org/abs/2602.15925
作者: Zier Mensch,Lars Holdijk,Samuel Duffield,Maxwell Aifer,Patrick J. Coles,Max Welling,Miranda C. N. Cheng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stochastic-gradient MCMC methods enable scalable Bayesian posterior sampling but often suffer from sensitivity to minibatch size and gradient noise. To address this, we propose Stochastic Gradient Lattice Random Walk (SGLRW), an extension of the Lattice Random Walk discretization. Unlike conventional Stochastic Gradient Langevin Dynamics (SGLD), SGLRW introduces stochastic noise only through the off-diagonal elements of the update covariance; this yields greater robustness to minibatch size while retaining asymptotic correctness. Furthermore, as comparison we analyze a natural analogue of SGLD utilizing gradient clipping. Experimental validation on Bayesian regression and classification demonstrates that SGLRW remains stable in regimes where SGLD fails, including in the presence of heavy-tailed gradient noise, and matches or improves predictive performance.

[LG-91] Including Node Textual Metadata in Laplacian-constrained Gaussian Graphical Models

链接: https://arxiv.org/abs/2602.15920
作者: Jianhua Wang,Killian Cressant,Pedro Braconnot Velloso,Arnaud Breloy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted to EUSIPCO 2026

点击查看摘要

Abstract:This paper addresses graph learning in Gaussian Graphical Models (GGMs). In this context, data matrices often come with auxiliary metadata (e.g., textual descriptions associated with each node) that is usually ignored in traditional graph estimation processes. To fill this gap, we propose a graph learning approach based on Laplacian-constrained GGMs that jointly leverages the node signals and such metadata. The resulting formulation yields an optimization problem, for which we develop an efficient majorization-minimization (MM) algorithm with closed-form updates at each iteration. Experimental results on a real-world financial dataset demonstrate that the proposed method significantly improves graph clustering performance compared to state-of-the-art approaches that use either signals or metadata alone, thus illustrating the interest of fusing both sources of information.

[LG-92] Steering Dynamical Regimes of Diffusion Models by Breaking Detailed Balance

链接: https://arxiv.org/abs/2602.15914
作者: Haiqi Lu,Ying Tang
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We show that deliberately breaking detailed balance in generative diffusion processes can accelerate the reverse process without changing the stationary distribution. Considering the Ornstein–Uhlenbeck process, we decompose the dynamics into a symmetric component and a non-reversible anti-symmetric component that generates rotational probability currents. We then construct an exponentially optimal non-reversible perturbation that improves the long-time relaxation rate while preserving the stationary target. We analyze how such non-reversible control reshapes the macroscopic dynamical regimes of the phase transitions recently identified in generative diffusion models. We derive a general criterion for the speciation time and show that suitable non-reversible perturbations can accelerate speciation. In contrast, the collapse transition is governed by a trace-controlled phase-space contraction mechanism that is fixed by the symmetric component, and the corresponding collapse time remains unchanged under anti-symmetric perturbations. Numerical experiments on Gaussian mixture models support these findings.