本篇博文主要内容为 2026-02-27 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-02-27)
今日共更新644篇论文,其中:
- 自然语言处理共82篇(Computation and Language (cs.CL))
- 人工智能共215篇(Artificial Intelligence (cs.AI))
- 计算机视觉共139篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共206篇(Machine Learning (cs.LG))
- 多智能体系统共9篇(Multiagent Systems (cs.MA))
- 信息检索共35篇(Information Retrieval (cs.IR))
- 人机交互共32篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] ParamMem: Augmenting Language Agents with Parametric Reflective Memory
【速读】:该论文旨在解决语言代理(language agents)在自我反思(self-reflection)过程中产生的重复输出问题,这一问题限制了其推理性能。研究表明,提高反思多样性(reflective diversity)是改善性能的有效途径。解决方案的关键在于提出ParamMem——一种参数化记忆模块(parametric memory module),该模块将跨样本的反思模式编码到模型参数中,并通过温度控制采样实现多样化的反思生成。在此基础上构建的ParamAgent框架进一步融合了情景记忆(episodic memory)与跨样本记忆,显著提升了代码生成、数学推理和多跳问答等任务上的表现,且具备样本高效性、模型规模间的弱到强迁移能力及无需依赖更强外部模型的自提升潜力。
链接: https://arxiv.org/abs/2602.23320
作者: Tianjun Yao,Yongqiang Chen,Yujia Zheng,Pan Li,Zhiqiang Shen,Kun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 20 pages
Abstract:Self-reflection enables language agents to iteratively refine solutions, yet often produces repetitive outputs that limit reasoning performance. Recent studies have attempted to address this limitation through various approaches, among which increasing reflective diversity has shown promise. Our empirical analysis reveals a strong positive correlation between reflective diversity and task success, further motivating the need for diverse reflection signals. We introduce ParamMem, a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Building on this module, we propose ParamAgent, a reflection-based agent framework that integrates parametric memory with episodic and cross-sample memory. Extensive experiments on code generation, mathematical reasoning, and multi-hop question answering demonstrate consistent improvements over state-of-the-art baselines. Further analysis reveals that ParamMem is sample-efficient, enables weak-to-strong transfer across model scales, and supports self-improvement without reliance on stronger external model, highlighting the potential of ParamMem as an effective component for enhancing language agents.
[MA-1] A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中潜在的隐写术(steganography)行为难以检测与量化的问题。当前基于经典隐写术定义的检测方法依赖于已知的非隐写信号参考分布,但在LLM场景下这一假设不成立,导致传统方法失效。论文提出了一种决策理论视角下的新解决方案,其关键在于引入广义的V-信息(generalised V-information),这是一种以效用为导向的信息度量框架,用于衡量输入中可被利用的信息量;进而定义了“隐写缺口”(steganographic gap),即通过比较能够解码隐藏内容的代理与不能解码的代理在下游任务中的效用差异来量化隐写行为。该方法无需参考分布即可有效检测、量化并缓解LLM中的隐写推理。
链接: https://arxiv.org/abs/2602.23163
作者: Usman Anwar,Julianna Piskorz,David D. Baek,David Africa,Jim Weatherall,Max Tegmark,Christian Schroeder de Witt,Mihaela van der Schaar,David Krueger
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Theory (cs.IT); Multiagent Systems (cs.MA)
备注: First two authors contributed equally
Abstract:Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbfdecision-theoretic view of steganography. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents’ observable actions. To formalise this perspective, we introduce generalised \mathcalV -information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbfsteganographic gap – a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
[MA-2] ClawMobile: Rethinking Smartphone-Native Agent ic Systems
【速读】:该论文旨在解决在智能手机这一资源受限且动态变化的环境中实现生成式 AI(Generative AI)代理(agentic systems)自主性的挑战。与云端或桌面环境不同,移动设备具有执行上下文受限、控制接口碎片化以及应用状态频繁变动等特点,使得传统基于大语言模型(LLMs)的推理与控制方式难以稳定运行。为此,论文提出 ClawMobile,其核心解决方案是采用分层架构,将高层语言推理与结构化、确定性的控制路径分离,从而提升在真实设备上的执行稳定性与可复现性。关键在于通过这种架构设计,实现概率性规划与确定性系统接口之间的原则性协同,为构建可靠的智能手机原生智能代理提供可扩展的设计范式。
链接: https://arxiv.org/abs/2602.22942
作者: Hongchao Du,Shangyu Wu,Qiao Li,Riwei Pan,Jinheng Li,Youcheng Sun,Chun Jason Xue
机构: MBZUAI(穆巴达拉人工智能研究所); City University of Hong Kong(香港城市大学)
类目: Multiagent Systems (cs.MA)
备注: 7 pages, 1 figures
Abstract:Smartphones represent a uniquely challenging environment for agentic systems. Unlike cloud or desktop settings, mobile devices combine constrained execution contexts, fragmented control interfaces, and rapidly changing application states. As large language models (LLMs) evolve from conversational assistants to action-oriented agents, achieving reliable smartphone-native autonomy requires rethinking how reasoning and control are composed. We introduce ClawMobile as a concrete exploration of this design space. ClawMobile adopts a hierarchical architecture that separates high-level language reasoning from structured, deterministic control pathways, improving execution stability and reproducibility on real devices. Using ClawMobile as a case study, we distill the design principles for mobile LLM runtimes and identify key challenges in efficiency, adaptability, and stability. We argue that building robust smartphone-native agentic systems demands principled coordination between probabilistic planning and deterministic system interfaces. The implementation is open-sourced~\footnotethis https URL to facilitate future exploration. Comments: 7 pages, 1 figures Subjects: Multiagent Systems (cs.MA) Cite as: arXiv:2602.22942 [cs.MA] (or arXiv:2602.22942v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2602.22942 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-3] Robust Information Design for Multi-Agent Systems with Complementarities: Smallest-Equilibrium Threshold Policies AAMAS2026
【速读】:该论文致力于解决多智能体系统(Multi-Agent Systems, MAS)中信息设计(Information Design)的问题,尤其在具有二元行动和战略互补性(strategic complementarities)的情境下,外部设计者仅能通过信号影响代理人的行为。研究聚焦于如何设计最优信号结构,使得在代理人选择最小均衡(smallest-equilibrium)这一保守、厌恶协调的行为模式下,实现社会福利的最大化。解决方案的关键在于:当效用函数具有凸势函数(convex potential)且福利函数为凸时,最优实施结果具有极简形式——即在每个状态上实现完全协调:要么所有代理人都采取行动,要么无人行动。作者提出一种构造性的阈值规则:对每个状态计算一维评分,排序后选取单一阈值(最多一个状态使用边缘抽奖),该规则是满足可行性与顺序服从约束(sequential obedience constraints)的线性规划(LP)的一个显式最优顶点解。此方法在疫苗接种和技术创新采纳等实证场景中表现优异,计算复杂度为 O(∣Θ∣log∣Θ∣),并避免了仅基于服从约束设计所导致的福利高估问题,提供了一种通用且可扩展的鲁棒协调机制。
链接: https://arxiv.org/abs/2602.22915
作者: Farzaneh Farhadi,Maria Chli
机构: Aston University (阿斯顿大学)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: This paper has been accepted for publication in Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026). The final published version will be available via the ACM Digital Library
Abstract:We study information design in multi-agent systems (MAS) with binary actions and strategic complementarities, where an external designer influences behavior only through signals. Agents play the smallest-equilibrium of the induced Bayesian game, reflecting conservative, coordination-averse behavior typical in distributed systems. We show that when utilities admit a convex potential and welfare is convex, the robustly implementable optimum has a remarkably simple form: perfect coordination at each state: either everyone acts or no one does. We provide a constructive threshold rule: compute a one-dimensional score for each state, sort states, and pick a single threshold (with a knife-edge lottery for at most one state). This rule is an explicit optimal vertex of a linear program (LP) characterized by feasibility and sequential obedience constraints. Empirically, in both vaccination and technology-adoption domains, our constructive policy matches LP optima, scales as O(|\Theta|\log|\Theta|) , and avoids the inflated welfare predicted by obedience-only designs that assume the designer can dictate the (best) equilibrium. The result is a general, scalable recipe for robust coordination in MAS with complementarities.
[MA-4] QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning ICAPS2026
【速读】:该论文旨在解决协作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)中基于值分解(Value Decomposition, VD)方法因使用最大操作符(max operator)计算时序差分(Temporal-Difference, TD)目标而导致的系统性Q值高估问题。由于联合动作空间的组合爆炸,这种高估会显著引发学习不稳定和策略次优。解决方案的关键在于提出QSIM——一种基于动作相似性的加权Q-learning框架,通过构建一个结构化的近贪婪联合动作空间,并以动作相似性为权重对Q值进行期望加权,从而在保持行为相关性的前提下平滑TD目标,有效缓解高估问题并提升学习稳定性。
链接: https://arxiv.org/abs/2602.22786
作者: Yuanjun Li,Bin Zhang,Hao Chen,Zhouyang Jiang,Dapeng Li,Zhiwei Xu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 15 figures, 7tables. Accepted to the 36th International Conference on Automated Planning and Scheduling (ICAPS 2026)
Abstract:Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at this https URL.
[MA-5] Sustainable Multi-Agent Crowdsourcing via Physics-Informed Bandits
【速读】:该论文旨在解决众包平台在任务分配中面临的四重矛盾:分配质量(allocation quality)、劳动力可持续性(workforce sustainability)、操作可行性(operational feasibility)与策略性承包商行为(strategic contractor behaviour),即所谓的“冷启动-疲劳-利用率-策略代理困境”(Cold-Start, Burnout, Utilisation, and Strategic Agency Dilemma)。现有方法仅能同时缓解其中两项矛盾,例如贪婪启发式或多准则决策(MCDM)方法虽可实现初始高质量分配,但导致承包商严重疲劳;而 bandit 算法虽避免疲劳,却需 100% 工作人员参与,不具操作可行性。为应对这一挑战,作者提出 FORGE——一个基于物理规律的 K+1 多智能体仿真器,将每个承包商建模为理性代理,根据自身疲劳状态动态设定负载接受阈值,从而将标准的被动 Restless Multi-Armed Bandit (RMAB) 转化为真正的 Stackelberg 博弈。在此框架下,设计了 Neural-Linear UCB 分配器,融合双塔嵌入网络与从离线仿真中提取的物理信息协方差先验(Physics-Informed Covariance Prior),该先验同时预热技能簇几何结构和 UCB 探索空间,从第 1 轮即提供几何感知信念状态,显著减少冷启动问题。实验表明,在 T = 200 冷启动轮次内,该方法在非 oracle 方法中获得最高奖励(LRew=0.555±0.041),且仅需 7.6% 的工作利用率,远优于传统基线,并具备对高达 50% 的员工流动率和噪声强度 σ=0.20 的鲁棒性。
链接: https://arxiv.org/abs/2602.22365
作者: Chayan Banerjee
机构: Queensland University of Technology (昆士兰理工大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Crowdsourcing platforms face a four-way tension between allocation quality, workforce sustainability, operational feasibility, and strategic contractor behaviour–a dilemma we formalise as the Cold-Start, Burnout, Utilisation, and Strategic Agency Dilemma. Existing methods resolve at most two of these tensions simultaneously: greedy heuristics and multi-criteria decision making (MCDM) methods achieve Day-1 quality but cause catastrophic burnout, while bandit algorithms eliminate burnout only through operationally infeasible 100% workforce this http URL address this, we introduce FORGE, a physics-grounded K+1 multi-agent simulator in which each contractor is a rational agent that declares its own load-acceptance threshold based on its fatigue state, converting the standard passive Restless Multi-Armed Bandit (RMAB) into a genuine Stackelberg game. Operating within FORGE, we propose a Neural-Linear UCB allocator that fuses a Two-Tower embedding network with a Physics-Informed Covariance Prior derived from offline simulator interactions. The prior simultaneously warm-starts skill-cluster geometry and UCB exploration landscape, providing a geometry-aware belief state from episode 1 that measurably reduces cold-start this http URL T = 200 cold-start episodes, the proposed method achieves the highest reward of all non-oracle methods ( \textLRew = 0.555 \pm 0.041 ) at only 7.6% workforce utilisation–a combination no conventional baseline achieves–while maintaining robustness to workforce turnover up to 50% and observation noise up to \sigma = 0.20 .
[MA-6] Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents
【速读】:该论文旨在解决当前自主AI代理(AI agents)在部署过程中因缺乏形式化行为规范而导致的行为漂移(drift)、治理失效及项目失败问题。传统软件依赖API、类型系统等契约机制来保证正确性,而AI代理则基于自然语言提示运行,无明确的行为约束,从而引发不确定性。解决方案的关键在于提出Agent Behavioral Contracts (ABC)——一种将“设计即契约”(Design-by-Contract)原则引入自主AI代理的形式化框架,其核心结构为C = (P, I, G, R),即预条件(Preconditions)、不变量(Invariants)、治理策略(Governance policies)与恢复机制(Recovery mechanisms),这些组件均为可运行时强制执行的第一类实体。通过定义(p, delta, k)-满足性这一概率性合规度量,并结合漂移边界定理(Drift Bounds Theorem),证明了具备恢复能力的合同可将期望行为漂移限制在D* = α/γ以内,且在随机环境下呈现高斯集中特性,从而显著提升代理系统的稳定性与可控性。
链接: https://arxiv.org/abs/2602.22302
作者: Varun Pratap Bhardwaj
机构: Accenture(埃森哲)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 71 pages, 7 figures, 14 tables. Patent pending. Also available on Zenodo: DOI https://doi.org/10.5281/zenodo.18775393
Abstract:Traditional software relies on contracts – APIs, type systems, assertions – to specify and enforce correct behavior. AI agents, by contrast, operate on prompts and natural language instructions with no formal behavioral specification. This gap is the root cause of drift, governance failures, and frequent project failures in agentic AI deployments. We introduce Agent Behavioral Contracts (ABC), a formal framework that brings Design-by-Contract principles to autonomous AI agents. An ABC contract C = (P, I, G, R) specifies Preconditions, Invariants, Governance policies, and Recovery mechanisms as first-class, runtime-enforceable components. We define (p, delta, k)-satisfaction – a probabilistic notion of contract compliance that accounts for LLM non-determinism and recovery – and prove a Drift Bounds Theorem showing that contracts with recovery rate gamma alpha (the natural drift rate) bound behavioral drift to D* = alpha/gamma in expectation, with Gaussian concentration in the stochastic setting. We establish sufficient conditions for safe contract composition in multi-agent chains and derive probabilistic degradation bounds. We implement ABC in AgentAssert, a runtime enforcement library, and evaluate on AgentContract-Bench, a benchmark of 200 scenarios across 7 models from 6 vendors. Results across 1,980 sessions show that contracted agents detect 5.2-6.8 soft violations per session that uncontracted baselines miss entirely (p 0.0001, Cohen’s d = 6.7-33.8), achieve 88-100% hard constraint compliance, and bound behavioral drift to D* 0.27 across extended sessions, with 100% recovery for frontier models and 17-100% across all models, at overhead 10 ms per action.
[MA-7] An Adaptive Multichain Blockchain: A Multiobjective Optimization Approach
【速读】:该论文旨在解决区块链在面对动态需求与容量变化时的可扩展性瓶颈问题,以及现有多链(multichain)设计静态配置导致资源利用率低的问题。其核心解决方案是将区块链配置建模为一个多智能体资源分配问题:应用方和运营商声明需求、容量及价格边界,优化器在每个周期(epoch)内将它们分组形成临时链,并设定链级清算价格,目标函数为治理权重加权的归一化应用、运营商与系统效用之和。该模型具备模块化特性,可支持能力兼容性、应用类型多样性及跨周期稳定性,且优化结果可在链下求解并链上验证,从而实现高效、公平且激励相容的资源配置。
链接: https://arxiv.org/abs/2602.22230
作者: Nimrod Talmon,Haim Zysberg
机构: Ben-Gurion University (本古里安大学)
类目: Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Blockchains are widely used for secure transaction processing, but their scalability remains limited, and existing multichain designs are typically static even as demand and capacity shift. We cast blockchain configuration as a multiagent resource-allocation problem: applications and operators declare demand, capacity, and price bounds; an optimizer groups them into ephemeral chains each epoch and sets a chain-level clearing price. The objective maximizes a governance-weighted combination of normalized utilities for applications, operators, and the system. The model is modular – accommodating capability compatibility, application-type diversity, and epoch-to-epoch stability – and can be solved off-chain with outcomes verifiable on-chain. We analyze fairness and incentive issues and present simulations that highlight trade-offs among throughput, decentralization, operator yield, and service stability.
[MA-8] WICE: An LLM Agent Framework for Simulating Personalized User Tweeting Behavior with Long-term Temporal Features
【速读】:该论文旨在解决现有用户模拟器在处理需要建模时间特征的任务时表现不足的问题,尤其是难以捕捉个体用户的长期行为动态。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的框架TWICE,该框架融合了个性化用户画像、事件驱动的记忆模块以及个性化风格重写的工作流,从而实现对用户推文行为的个性化模拟,并有效保留长期时间特性。
链接: https://arxiv.org/abs/2602.22222
作者: Bingrui Jin,Kunyao Lan,Mengyue Wu
机构: 未知
类目: Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:
Abstract:User simulators are often used to generate large amounts of data for various tasks such as generation, training, and evaluation. However, existing approaches concentrate on collective behaviors or interactive systems, struggling with tasks that require modeling temporal characteristics. To address this limitation, we propose TWICE, an LLM-based framework that leverages the long-term temporal and personalized features of social media data. This framework integrates personalized user profiling, an event-driven memory module, and a workflow for personalized style rewriting, enabling simulation of personalized user tweeting behavior while capturing long-term temporal characteristics. In addition, we conduct a comprehensive evaluation with a focus on analyzing tweeting style and event-based changes in behavior. Experiment results demonstrate that our framework improves personalized user simulation by effectively incorporating temporal dynamics, providing a robust solution for long-term behavior tracking.
自然语言处理
[NLP-0] Scale Cant Overcome Prag matics: The Impact of Reporting Bias on Vision-Language Reasoning ACL2026
【速读】: 该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在空间、时间、否定和计数等推理能力上的缺失问题,其根源被归因于训练数据中存在的“报告偏差”(reporting bias),即人类在描述视觉内容时倾向于省略隐含的推理所需信息。解决方案的关键在于:通过有意识地收集包含显式标注的训练数据以获取这些被忽略的隐含信息,而非单纯依赖增加数据规模、模型规模或多语言扩展来期望推理能力的自然涌现。实验证明,针对性的数据标注能显著提升VLMs在上述推理任务上的表现。
链接: https://arxiv.org/abs/2602.23351
作者: Amita Kamath,Jack Hessel,Khyathi Chandu,Jena D. Hwang,Kai-Wei Chang,Ranjay Krishna
机构: University of Washington (华盛顿大学); University of California, Los Angeles (加州大学洛杉矶分校); Samaya AI (Samaya AI); Mistral AI (Mistral AI); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: TACL 2026
Abstract:The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., “at the game today!” is a more likely caption than “a photo of 37 people standing behind a field”. We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.
[NLP-1] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations
【速读】: 该论文针对对话中情绪识别(Emotion Recognition in Conversations, ERC)任务中的两大核心挑战展开研究:一是如何建模多轮对话的时序特性并提取单模态(语音与文本)的上下文信息,二是如何有效融合多模态线索以提升情绪预测准确性。解决方案的关键在于提出一种模块化的专家混合(Mixture-of-Experts, MoE)框架MiSTER-E,其通过分离建模与融合两个阶段实现优化:首先利用分别微调过的语言模型提取语音和文本的丰富话语级嵌入,并结合卷积-循环结构增强上下文建模能力;随后引入三个专家——仅语音、仅文本及跨模态专家——并通过可学习门控机制动态加权其输出;此外,设计了监督对比损失和KL散度正则项以促进模态间对齐与预测一致性。该方法无需依赖说话人身份信息,在IEMOCAP、MELD和MOSI三个基准数据集上均取得优于现有系统的效果。
链接: https://arxiv.org/abs/2602.23300
作者: Soumya Dutta,Smruthi Balaji,Sriram Ganapathy
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Elsevier Computer Speech and Language. 30 pages, 9 figures, 5 tables
Abstract:Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.
[NLP-2] Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems
【速读】: 该论文旨在解决级联式语音对话系统中响应延迟高这一关键问题,传统ASR-LLM-TTS(自动语音识别-大语言模型-文本转语音)流水线采用严格串行范式,需等待完整语音转录和全部推理完成后才能开始语音合成,导致用户体验不佳。其解决方案的核心在于提出一种低延迟架构——话语感知双轨流式响应框架(Discourse-Aware Dual-Track Streaming Response, DDTSR),通过三个关键技术机制实现:(1) 连接词引导的小-大模型协同机制,使小型模型并行生成最小承诺的话语连接词,大型模型同步进行知识密集型推理;(2) 基于流式的跨模态协作机制,动态重叠ASR、LLM推理与TTS过程以提前可发声时刻;(3) 基于课程学习的话语连续性增强机制,确保早期响应与后续推理输出之间保持语义连贯性和逻辑一致性。实验表明,DDTSR在两个语音对话基准上将响应延迟降低19%-51%,同时保持高质量话语输出,并具备良好的模块化兼容性与可扩展性。
链接: https://arxiv.org/abs/2602.23266
作者: Siyuan Liu,Jiahui Xu,Feng Jiang,Kuang Wang,Zefeng Zhao,Chu-Ren Huang,Jinghang Gu,Changqing Yin,Haizhou Li
机构: Tongji University (同济大学); Hong Kong Polytechnic University (香港理工大学); Shenzhen University of Advanced Technology (深圳先进技术研究院); Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computation and Language (cs.CL)
备注:
Abstract:Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse LLM backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.
[NLP-3] Agent DropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在复杂推理任务中因个体参与者产生错误信息而导致的误差传播问题。现有解决方案通常依赖于结构上的刚性设计或昂贵的微调策略,限制了系统的部署灵活性与适应能力。其核心创新在于提出AgentDropoutV2——一种无需重训练即可在测试时动态优化MAS信息流的“修正或剔除”剪枝框架。该方案的关键在于构建一个基于失败驱动指标池的检索增强型修正器(retrieval-augmented rectifier),通过蒸馏后的失败模式作为先验知识精准识别潜在错误,并对无法修复的输出进行剪枝以阻断错误扩散;同时引入回退机制保障系统完整性。实验证明,该方法在多个数学基准上平均提升准确率6.3个百分点,且具备良好的泛化能力和任务难度自适应调节能力。
链接: https://arxiv.org/abs/2602.23258
作者: Yutong Wang,Siyuan Xiong,Xuebo Liu,Wenkang Zhou,Liang Ding,Miao Zhang,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining. Our approach acts as an active firewall, intercepting agent outputs and employing a retrieval-augmented rectifier to iteratively correct errors based on a failure-driven indicator pool. This mechanism allows for the precise identification of potential errors using distilled failure patterns as prior knowledge. Irreparable outputs are subsequently pruned to prevent error propagation, while a fallback strategy preserves system integrity. Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS’s task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks. Furthermore, the system exhibits robust generalization and adaptivity, dynamically modulating rectification efforts based on task difficulty while leveraging context-aware indicators to resolve a wide spectrum of error patterns. Our code and dataset are released at this https URL.
[NLP-4] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在实际应用中常出现的“伪自回归”(autoregressive-like)解码行为问题,即尽管DLM理论上支持并行生成token,但其实际解码过程往往收敛到类似左到右自回归(autoregressive, AR)的序列模式,从而限制了对并行硬件资源的有效利用和延迟优化。解决方案的关键在于重新设计训练数据与监督信号,提出一种名为NAP(Non-Autoregressive Parallel DLMs)的数据驱动方法:通过将训练样本构造为多个独立的推理轨迹(reasoning trajectories),并配合强制并行解码策略(parallel-forced decoding),引导模型学习多token同时更新的非自回归(non-autoregressive, non-AR)生成机制。实验表明,NAP在数学推理基准上相较于传统长链式思维(chain-of-thought, CoT)监督的DLM表现出更强的并行解码性能,且随着并行度提升,优势更加显著。
链接: https://arxiv.org/abs/2602.23225
作者: Pengxiang Li,Dilxat Muhtar,Lu Yin,Tianlong Chen,Shiwei Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR’s sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at this https URL.
[NLP-5] InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在解码阶段因键值缓存(Key-Value Cache, KV cache)内存占用随序列长度线性增长而导致的硬件资源瓶颈问题。其核心解决方案是提出一种面向硬件优化的KV缓存量化方法InnerQ,关键创新在于:采用基于内维(inner dimension)分组的量化策略,使反量化操作与向量-矩阵乘法对齐,并支持尺度因子在GPU计算单元间的复用,从而显著减少内存访问并加速解码过程;同时结合混合量化、高精度窗口机制和通道级归一化等技术,在高压缩比下仍保持模型精度,实测性能相较现有方法最高提升22%,相较半精度矩阵乘法最高提升88%。
链接: https://arxiv.org/abs/2602.23200
作者: Sayed Mohammadreza Tayaranian Hosseini,Amir Ardakani,Warren J. Gross
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 16 pages, 4 figures, 4 tables, 2 algorithms
Abstract:Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is the key-value (KV) cache, whose size scales with sequence length and easily dominates the memory footprint of the model. Previous work proposed quantization methods that are focused on compressing the KV cache while maintaining its information. We introduce InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy. InnerQ applies group-wise quantization while grouping the cache matrices over their inner dimension. Unlike previous work that group over the outer dimension, InnerQ aligns dequantization with the vector-matrix multiplication and enables scale factor reuse across GPU compute units. This reduces memory accesses and accelerates dequantization, yielding up to 22% speedup over previous work and up to 88% over half-precision vector-matrix multiplication. To preserve fidelity under aggressive compression, InnerQ incorporates (i) hybrid quantization, selecting symmetric or asymmetric quantization per group based on local statistics; (ii) high-precision windows for both the most recent tokens and the attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the query to avoid runtime overhead. Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
[NLP-6] Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
【速读】: 该论文旨在解决预训练大语言模型在微调(fine-tuning)后,其上下文学习能力(in-context learning) 可能退化的问题。具体而言,尽管微调可提升零样本(zero-shot)性能并降低推理成本,但现有方法常导致模型在未见过的任务上丧失利用示例进行快速适应的能力。解决方案的关键在于:通过限制微调过程中注意力参数的更新范围——仅优化值矩阵(value matrix)而非全部注意力参数——可在保持上下文学习能力的同时提升零样本性能;此外,引入辅助的少样本损失(few-shot loss)虽能增强目标任务上的上下文学习表现,但会损害模型在未见任务上的泛化能力。理论分析基于线性注意力机制,揭示了微调目标如何改变注意力结构,并为平衡零样本与上下文学习性能提供了可解释的优化路径。
链接: https://arxiv.org/abs/2602.23197
作者: Chungpa Lee,Jy-yong Sohn,Kangwook Lee
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We empirically validate our theoretical results.
[NLP-7] MTRAG -UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations
【速读】: 该论文旨在解决多轮检索增强生成(Multi-turn Retrieval Augmented Generation, MTRAG)中尚未明确的开放性挑战问题,尤其关注对话中存在UNanswerable(无法回答)、UNderspecified(信息不足)、NONstandalone(非独立问题)以及UNclear(不清晰)响应等复杂场景下的模型性能瓶颈。解决方案的关键在于构建了一个涵盖6个领域、包含666项任务和超过2800轮对话的标准化基准测试集(MTRAG-UN),并系统评估现有检索与生成模型在上述挑战性场景中的表现,从而为后续研究提供可量化、可复现的评测标准与改进方向。
链接: https://arxiv.org/abs/2602.23184
作者: Sara Rosenthal,Yannis Katsis,Vraj Shah,Lihong He,Lucian Popa,Marina Danilevsky
机构: IBM Research(IBM研究院)
类目: Computation and Language (cs.CL)
备注: 5 pages, 3 figures
Abstract:We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at this https URL
[NLP-8] Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal LLMs)中存在的一种“不匹配解码器问题”(mismatched decoder problem):尽管模型在编码阶段能够保留语音中的说话者身份、情感信息以及视觉图像的纹理等非文本特征(线性探测性能显著高于随机水平),但这些信息在解码阶段无法被有效利用,导致模型对这些属性的可访问性受限。其关键在于指出,这种限制并非源于编码器或输入投影机制,而是由于解码器仅能提取与文本对齐的方向信息,其评分规则决定了可用信息的上限——即广义互信息(Generalized Mutual Information, GMI)边界,该边界受分布距离和解码器敏感度影响。解决方案的核心是通过引入任务相关的训练目标(如情绪识别目标)进行LoRA微调干预,从而引导解码器学习到对特定属性(如情感)更敏感的方向,实验证明该方法可提升相关属性的可访问性(+7.5%),且不影响其他属性,验证了训练目标决定信息可访问性的核心命题。
链接: https://arxiv.org/abs/2602.23136
作者: Jayadev Billa
机构: Unaffiliated researcher; ISI@USC; Yahoo; Nuance; BBN
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 11 tables, 2 figures. Code: this https URL
Abstract:Multimodal LLMs can process speech and images, but they cannot hear a speaker’s voice or see an object’s texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3–55 \times above chance in linear probes), yet removing 64–71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder’s scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder’s scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ( + 7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible. Comments: 22 pages, 11 tables, 2 figures. Code: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.23136 [cs.CL] (or arXiv:2602.23136v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.23136 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-9] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在文本数据中引发的无意身份识别(deanonymization)风险问题,尤其是在新闻文章等匿名化文本中,LLM具备强大的作者溯源能力,可能危及作者隐私。其解决方案的关键在于提出一种名为SALA(Stylometry-Assisted LLM Analysis)的方法,该方法通过结构化、可解释的流程将定量风格特征(stylometric features)与LLM推理相结合,实现鲁棒且透明的作者归属分析;同时,进一步设计了一种基于代理推理轨迹的引导重写策略,利用生成式AI(Generative AI)技术生成改写提示,在保持语义不变的前提下有效降低文本的作者可识别性,从而构建主动、可解释的隐私保护机制。
链接: https://arxiv.org/abs/2602.23079
作者: Boyang Zhang,Yang Zhang
机构: CISPA Helmholtz Center for Information Security
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended deanonymization risks in textual data such as news articles. In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline. Central to our framework is the proposed \textitSALA (Stylometry-Assisted LLM Analysis) method, which integrates quantitative stylometric features with LLM reasoning for robust and transparent authorship attribution. Experiments on large-scale news datasets demonstrate that \textitSALA , particularly when augmented with a database module, achieves high inference accuracy in various scenarios. Finally, we propose a guided recomposition strategy that leverages the agent’s reasoning trace to generate rewriting prompts, effectively reducing authorship identifiability while preserving textual meaning. Our findings highlight both the deanonymization potential of LLM agents and the importance of interpretable, proactive defenses for safeguarding author privacy.
[NLP-10] Quantity Convergence Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody
【速读】: 该论文旨在解决第二语言(L2)学习者在习得目标语法结构后,如何将这些语法结构映射到恰当的韵律结构(prosodic structure)这一长期存在的挑战,特别是探讨这种语法-韵律接口的固化(fossilization)与稳定性问题。其解决方案的关键在于通过整合C-ToBI韵律边界标注与依存句法分析(Dependency Grammar analysis),定量评估韵律边界数量及其对语法关系的映射方式,从而揭示高阶语言能力学习者虽在韵律边界数量上趋近母语者,但在边界位置的结构性映射上仍存在系统性偏差,表现为对主谓(SBV)界面的韵律边界弱化和对动宾(VOB)界面的错误强化,导致韵律层级结构失真。
链接: https://arxiv.org/abs/2602.23071
作者: Yuqi Shi,Hao Yang,Xiyao Lu,Jinsong Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge. This study investigates the fossilization and stability of the L2 syntax-prosody interface by comparing 67 native Mandarin speakers with 67 Vietnamese learners using the BLCU-SAIT corpus. By integrating C-ToBI boundary annotation with Dependency Grammar analysis, we examined both the quantity of prosodic boundaries and their mapping to syntactic relations. Results reveal a non-linear acquisition: although high-proficiency learners (VNH) converge to the native baseline in boundary quantity at the Major Phrase level (B3), their structural mapping significantly diverges. Specifically, VNH demote the prosodic boundary at the Subject-Verb (SBV) interface (Major Phrase B3 - Prosodic Word B1), while erroneously promoting the boundary at the Verb-Object (VOB) interface (Prosodic Word B1 - Major Phrase B3). This strategy allows learners to maintain high long phrasal output at the expense of structural accuracy. This results in a distorted prosodic hierarchy where the native pattern is inverted.
[NLP-11] Make It Hard to Hear Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
【速读】: 该论文旨在解决孟加拉语(Bengali)长时音频中自动语音识别(ASR)与说话人聚类(speaker diarization)的双重挑战,尤其针对该语言在联合ASR与聚类资源严重匮乏的问题。解决方案的关键在于:对于ASR任务,采用基于完美对齐标注数据并结合合成声学退化(噪声和混响)的针对性微调策略,而非简单的原始数据缩放;而对于说话人聚类任务,则发现主流开源模型表现不佳,通过策略性启发式后处理基线模型输出成为提升准确率的主要手段,最终构建出一个实时因子(RTF)约为0.019的高效双通道流水线系统,为低资源、长时语音处理提供了实证基准。
链接: https://arxiv.org/abs/2602.23070
作者: Sanjid Hasan,Risalat Labib,A H M Fuad,Bayazid Hasan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 4 pages, 2 figures
Abstract:Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen) performed surprisingly poorly on this complex dataset. Extensive model retraining yielded negligible improvements; instead, strategic, heuristic post-processing of baseline model outputs proved to be the primary driver for increasing accuracy. Ultimately, this work outlines a highly optimized dual pipeline achieving a \sim 0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
[NLP-12] oward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency Department
【速读】: 该论文旨在解决临床病例报告表(Case Report Forms, CRFs)自动填充任务中因标注数据稀缺而导致的模型训练与评估困难问题。其解决方案的关键在于构建了一个新的意大利语急诊科临床笔记数据集,该数据集针对预定义的包含134个字段的CRF进行标注,并在此基础上定义了CRF填充任务及其评价指标。通过使用开源的先进大语言模型(Large Language Models, LLMs)进行零样本实验,研究验证了该方法在真实临床文本上的可行性,同时揭示了LLMs存在的保守性偏差问题,为后续优化提供了方向。
链接: https://arxiv.org/abs/2602.23062
作者: Gabriela Anna Kaczmarek,Pietro Ferrazzi,Lorenzo Porta,Vicky Rubini,Bernardo Magnini
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Case Report Forms (CRFs) collect data about patients and are at the core of well-established practices to conduct research in clinical settings. With the recent progress of language technologies, there is an increasing interest in automatic CRF-filling from clinical notes, mostly based on the use of Large Language Models (LLMs). However, there is a general scarcity of annotated CRF data, both for training and testing LLMs, which limits the progress on this task. As a step in the direction of providing such data, we present a new dataset of clinical notes from an Italian Emergency Department annotated with respect to a pre-defined CRF containing 134 items to be filled. We provide an analysis of the data, define the CRF-filling task and metric for its evaluation, and report on pilot experiments where we use an open-source state-of-the-art LLM to automatically execute the task. Results of the case-study show that (i) CRF-filling from real clinical notes in Italian can be approached in a zero-shot setting; (ii) LLMs’ results are affected by biases (e.g., a cautious behaviour favours “unknown” answers), which need to be corrected.
[NLP-13] Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention
【速读】: 该论文旨在解决标准Transformer中基于softmax归一化的注意力机制所存在的局限性,即严格的单位和约束限制了对注意力权重幅度的灵活控制,可能导致训练过程中注意力分布过于集中或不稳定。其解决方案的关键在于提出Affine-Scaled Attention,通过引入输入依赖的缩放因子和对应的偏置项作用于softmax归一化后的注意力权重,从而在不破坏值表示聚合的前提下松弛归一化约束,使模型能够以可控方式调整注意力的相对分布与整体尺度,显著提升训练稳定性、优化行为及下游任务性能。
链接: https://arxiv.org/abs/2602.23057
作者: Jeongin Bae,Baeseong Park,Gunho Park,Minsub Kim,Joonhyung Lee,Junhee Yoo,Sunghyeon Woo,Jiwon Ryu,Se Jung Kwon,Dongsoo Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. 14 pages, 11 figures
Abstract:Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models. Comments: Preprint. 14 pages, 11 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.23057 [cs.CL] (or arXiv:2602.23057v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.23057 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-14] Frequency-Ordered Tokenization for Better Text Compression
【速读】: 该论文旨在解决自然语言文本在无损压缩过程中效率不足的问题,特别是针对传统压缩算法难以充分挖掘词汇频率分布规律(即Zipf定律)的局限性。其解决方案的关键在于提出一种名为“频率有序标记化”(frequency-ordered tokenization)的预处理技术:首先使用字节对编码(Byte Pair Encoding, BPE)对文本进行分词,随后按词频从高到低重新排序词汇表,使高频词获得较小的整数标识符,并以变长整数格式编码,再交由标准压缩器处理。该方法通过利用自然语言中词频服从幂律分布的特性,显著提升压缩率(如在enwik8上使zlib压缩率提高7.08个百分点),同时因输入数据量减少而加速计算密集型压缩算法(如zstd-22和LZMA)。
链接: https://arxiv.org/abs/2602.22958
作者: Maximilian Kalcher
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Information Theory (cs.IT); Computation and Language (cs.CL)
备注: 5 pages, 4 figures, 9 tables
Abstract:We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf’s law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including preprocessing is 3.1x faster than raw zstd-22 and 2.4x faster than raw LZMA, because the preprocessed input is substantially smaller. The method can be implemented in under 50 lines of code.
[NLP-15] Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中光学字符识别(Optical Character Recognition, OCR)信息在语言处理流中的具体路由机制问题,即OCR信号何时以及如何被模型捕获并用于文本理解。其解决方案的关键在于通过因果干预(causal interventions)方法,对比原始图像与文本涂改版本的激活差异,识别出不同架构类型(DeepStack型如Qwen3-VL、单阶段投影型如Phi-4和InternVL3.5)中OCR信号的敏感层位点,并揭示OCR信号具有高度低维性(主成分PC1可解释72.9%方差),且PCA方向在不同数据集间具有迁移能力,表明存在共享的文本处理路径;此外,研究发现模块化OCR电路(如Qwen3-VL-4B)中移除OCR反而能提升计数性能(最高+6.9个百分点),说明OCR可能干扰其他视觉任务处理,这一发现为优化VLM中多模态信息融合提供了关键洞见。
链接: https://arxiv.org/abs/2602.22918
作者: Jonathan Steinberg,Oren Gal
机构: Swarms AI Lab (SAIL), University of Haifa
类目: Computation and Language (cs.CL)
备注:
Abstract:Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.
[NLP-16] NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
【速读】: 该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)在复杂推理任务中面临的“线性天花板”问题,即单纯提高秩(rank)带来的性能增益逐渐减弱,这是由其内在线性约束导致的。解决方案的关键在于提出NoRA(Non-linear Rank Adaptation),一种权重级并行适配器,通过引入SiLU门控机制和结构化dropout来诱导流形扩展,从而突破线性限制;机制分析表明,NoRA能激活奇异值谱中原本沉寂的尾部成分,有效防止线性方法中的秩坍缩现象,显著提升谱效率与模型性能。
链接: https://arxiv.org/abs/2602.22911
作者: Hung-Hsuan Chen
机构: National Central University (国立中央大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning (PEFT). However, it faces a critical ``linear ceiling’’ in complex reasoning tasks: simply increasing the rank yields diminishing returns due to intrinsic linear constraints. We introduce NoRA (Non-linear Rank Adaptation), a weight-level parallel adapter that injects SiLU gating and structural dropout to induce manifold expansion. On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency. This advantage generalizes to mathematical reasoning, where NoRA achieves a perplexity of 1.97 on MathInstruct, significantly surpassing LoRA’s saturation point of 2.07. Mechanism analysis via Singular Value Decomposition (SVD) confirms that NoRA activates the dormant tail of the singular value spectrum, effectively preventing the rank collapse observed in linear methods.
[NLP-17] OmniGAIA: Towards Native Omni-Modal AI Agents
【速读】: 该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)主要局限于双模态交互(如视觉-语言)的问题,这些模型缺乏实现通用人工智能助手所需的统一认知能力。为弥合这一差距,作者提出OmniGAIA基准和OmniAtlas代理系统:OmniGAIA通过新颖的全模态事件图(omni-modal event graph)构建复杂、多跳查询任务,涵盖视频、音频与图像模态,要求跨模态推理与外部工具集成;而OmniAtlas作为原生全模态基础代理,在工具驱动推理范式下具备主动全模态感知能力,其训练采用回溯引导的树状探索策略(hindsight-guided tree exploration)与OmniDPO进行细粒度错误修正,从而显著提升开源模型的工具使用能力。该工作标志着迈向真实场景中下一代原生全模态AI助手的重要进展。
链接: https://arxiv.org/abs/2602.22897
作者: Xiaoxi Li,Wenxiang Jiao,Jiarui Jin,Shijian Wang,Guanting Dong,Jiajie Jin,Hao Wang,Yinuo Wang,Ji-Rong Wen,Yuan Lu,Zhicheng Dou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Human intelligence naturally intertwines omni-modal perception – spanning vision, audio, and language – with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.
[NLP-18] st-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
【速读】: 该论文旨在解决大语言模型在推理过程中因依赖单一链式思维(chain-of-thought)而导致的探索不足问题,尤其针对现有自洽性(self-consistency)方法仅在轨迹层面(如选择最佳路径或投票最终答案)进行聚合、忽视部分正确尝试中潜在有用中间步骤的问题。解决方案的关键在于提出一种名为“Stitching Noisy Diffusion Thoughts”的模块化框架:首先利用掩码扩散语言模型(masked diffusion language model)低成本采样多样化的推理轨迹;其次通过预训练的过程奖励模型(process reward model, PRM)对每个中间步骤进行评分;最后将不同轨迹中高质量的中间步骤拼接(stitch)成复合推理链,并由自回归(autoregressive, AR)求解器基于该链重新计算最终答案。此方法实现了探索(扩散采样)与评估及合成(AR求解)的分离,在不引入统一混合架构的前提下保留了广泛搜索能力,显著提升了复杂数学和编程任务上的准确率(最高提升23.8%),同时降低延迟达1.8倍。
链接: https://arxiv.org/abs/2602.22871
作者: Roy Miles,Aysim Toker,Andreea-Maria Oncescu,Songcen Xu,Jiankang Deng,Ismail Elezi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or “nearly correct” attempts. We propose Stitching Noisy Diffusion Thoughts, a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates. Given a problem, we (i) sample many diverse, low-cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off-the-shelf process reward model (PRM), and (iii) stitch these highest-quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers. Using low-confidence diffusion sampling with parallel, independent rollouts, our training-free framework improves average accuracy by up to 23.8% across six math and coding tasks. At the same time, it achieves up to a 1.8x latency reduction relative to both traditional diffusion models (e.g., Dream, LLaDA) and unified architectures (e.g., TiDAR). Code is available at this https URL.
[NLP-19] Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, DLLMs)在并行解码过程中面临的“质量-速度权衡”问题,其根源在于“组合矛盾”(combinatorial contradiction)现象——即并行生成的token会形成语义不一致的组合。解决方案的关键在于引入连续表示(continuous representations)到离散解码流程中,以保留token之间的丰富位置依赖关系。作者提出ReMix框架,通过设计一种新的连续混合状态(Continuous Mixing State),作为初始掩码状态与最终离散token状态之间的中间态,使每个token的表示能在连续空间中迭代优化,从而在离散采样前消除与其他token的冲突;同时引入拒绝规则(rejection rule),将不确定的连续表示回退至掩码状态重新处理,确保稳定性并避免错误传播。此方法无需训练即可实现2–8倍的推理加速且无质量损失。
链接: https://arxiv.org/abs/2602.22868
作者: Yushi Ye,Feng Hong,Huangjie Zheng,Xu Chen,Zhiyong Chen,Yanfeng Wang,Jiangchao Yao
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding. This stems from the ‘‘combinatorial contradiction’’ phenomenon, where parallel tokens form semantically inconsistent combinations. We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency. We propose ReMix (Rejection Mixing), a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state. This intermediate state allows a token’s representation to be iteratively refined in a continuous space, resolving mutual conflicts with other tokens before collapsing into a final discrete sample. Furthermore, a rejection rule reverts uncertain representations from the continuous state back to the masked state for reprocessing, ensuring stability and preventing error propagation. ReMix thus mitigates combinatorial contradictions by enabling continuous-space refinement during discrete diffusion decoding. Extensive experiments demonstrate that ReMix, as a training-free method, achieves a 2-8 \times inference speedup without any quality degradation.
[NLP-20] Effective QA-driven Annotation of Predicate-Argument Relations Across Languages EACL2026
【速读】: 该论文旨在解决跨语言谓词-论元结构(predicate-argument relations)自动标注成本高、且长期局限于英语的问题。其核心挑战在于如何在低资源语言中高效构建高质量的语义标注数据,以支持可解释的语义分析与下游任务。解决方案的关键在于利用基于问答的语义角色标注(QA-SRL)框架——一种自然语言形式化表达谓词-论元关系的方法——并设计了一种跨语言投影策略:通过受限翻译与词对齐流水线,将已训练好的英文QA-SRL解析器映射到目标语言,从而自动生成与目标语言谓词对齐的问答标注。该方法在希伯来语、俄语和法语等不同语系语言上验证有效,生成的高质量训练数据使微调后的语言特定解析器显著优于主流多语言大模型基线(如GPT-4o和LLaMA-Maverick)。
链接: https://arxiv.org/abs/2602.22865
作者: Jonathan Davidov,Aviv Slobodkin,Shmuel Tomi Klein,Reut Tsarfaty,Ido Dagan,Ayal Klein
机构: Bar-Ilan University (巴伊兰大学); Ariel University (阿里尔大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EACL 2026 (Main Conference)
Abstract:Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation. However, attaining such semantic structures requires costly annotation efforts and has remained largely confined to English. We leverage the Question-Answer driven Semantic Role Labeling (QA-SRL) framework – a natural-language formulation of predicate-argument relations – as the foundation for extending semantic annotation to new languages. To this end, we introduce a cross-linguistic projection approach that reuses an English QA-SRL parser within a constrained translation and word-alignment pipeline to automatically generate question-answer annotations aligned with target-language predicates. Applied to Hebrew, Russian, and French – spanning diverse language families – the method yields high-quality training data and fine-tuned, language-specific parsers that outperform strong multilingual LLM baselines (GPT-4o, LLaMA-Maverick). By leveraging QA-SRL as a transferable natural-language interface for semantics, our approach enables efficient and broadly accessible predicate-argument parsing across languages.
[NLP-21] Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features
【速读】: 该论文旨在解决立场分类(stance classification)任务中因未系统整合细粒度情感分析而导致性能受限的问题,尤其在涉及争议性话题时,情感诉求常影响立场表达,但现有方法多基于非论证文本且局限于特定领域,泛化能力不足。其解决方案的关键在于:通过DistilBERT嵌入扩展Bias-Corrected NRC情绪词典(eNRC),利用上下文感知的词向量识别原词典未覆盖的情感词汇,并将其融入神经论证立场分类模型中,从而提升跨领域争议话题下的立场判别准确率。实验表明,eNRC在五个不同领域的数据集上均优于基线(F1最高提升6.2个百分点),并在四个数据集上超越原始NRC,同时优于多数基于大语言模型(LLM)的方法。
链接: https://arxiv.org/abs/2602.22846
作者: Mohammad Yeghaneh Abkenar,Weixing Wang,Manfred Stede,Davide Picca,Mark A. Finlayson,Panagiotis Ioannidis
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Argumentation mining comprises several subtasks, among which stance classification focuses on identifying the standpoint expressed in an argumentative text toward a specific target topic. While arguments-especially about controversial topics-often appeal to emotions, most prior work has not systematically incorporated explicit, fine-grained emotion analysis to improve performance on this task. In particular, prior research on stance classification has predominantly utilized non-argumentative texts and has been restricted to specific domains or topics, limiting generalizability. We work on five datasets from diverse domains encompassing a range of controversial topics and present an approach for expanding the Bias-Corrected NRC Emotion Lexicon using DistilBERT embeddings, which we feed into a Neural Argumentative Stance Classification model. Our method systematically expands the emotion lexicon through contextualized embeddings to identify emotionally charged terms not previously captured in the lexicon. Our expanded NRC lexicon (eNRC) improves over the baseline across all five datasets (up to +6.2 percentage points in F1 score), outperforms the original NRC on four datasets (up to +3.0), and surpasses the LLM-based approach on nearly all corpora. We provide all resources-including eNRC, the adapted corpora, and model architecture-to enable other researchers to build upon our work.
[NLP-22] Moral Preferences of LLM s Under Directed Contextual Influence
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)道德评估中忽略上下文影响的问题,即现有道德基准测试多采用无上下文提示(context-free prompts),隐含假设人类偏好是稳定的,而实际部署中提示常包含用户请求、社会规范暗示等上下文信号,可能显著改变决策。其解决方案的关键在于引入一种定向上下文影响的评估框架(pilot evaluation harness),针对每个人口统计学因素施加匹配且方向相反的上下文干预(direction-flipped contextual influences),从而系统性测量模型在道德困境(如电车难题式道德分诊)中的响应方向变化,揭示模型行为对上下文线索的敏感性和不对称性。
链接: https://arxiv.org/abs/2602.22831
作者: Phil Blandfort,Tushar Karayil,Urja Pawar,Robert Graham,Alex McKenzie,Dmitrii Krasheninnikov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.
[NLP-23] CM-DiffRAG : Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought
【速读】: 该论文旨在解决传统检索增强生成(Retrieval Augmented Generation, RAG)方法在中医(Traditional Chinese Medicine, TCM)临床诊断与治疗场景中表现不佳的问题,其核心挑战在于中医诊疗过程涉及复杂的推理机制和显著的个体差异。解决方案的关键在于提出一种名为TCM-DiffRAG的新颖RAG框架,该框架通过融合结构化中医知识图谱(Knowledge Graph, KG)与思维链(Chain of Thought, CoT)推理机制,实现通用知识与个性化临床推理的有效对齐,从而显著提升大语言模型(Large Language Models, LLMs)在个体化诊断任务中的准确性与专业性。
链接: https://arxiv.org/abs/2602.22828
作者: Jianmin Li,Ying Chang,Su-Kit Tang,Yujia Liu,Yanwen Wang,Shuyuan Lin,Binkai Ou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Background: Retrieval augmented generation (RAG) technology can empower large language models (LLMs) to generate more accurate, professional, and timely responses without fine tuning. However, due to the complex reasoning processes and substantial individual differences involved in traditional Chinese medicine (TCM) clinical diagnosis and treatment, traditional RAG methods often exhibit poor performance in this domain. Objective: To address the limitations of conventional RAG approaches in TCM applications, this study aims to develop an improved RAG framework tailored to the characteristics of TCM reasoning. Methods: We developed TCM-DiffRAG, an innovative RAG framework that integrates knowledge graphs (KG) with chains of thought (CoT). TCM-DiffRAG was evaluated on three distinctive TCM test datasets. Results: The experimental results demonstrated that TCM-DiffRAG achieved significant performance improvements over native LLMs. For example, the qwen-plus model achieved scores of 0.927, 0.361, and 0.038, which were significantly enhanced to 0.952, 0.788, and 0.356 with TCM-DiffRAG. The improvements were even more pronounced for non-Chinese LLMs. Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods. Conclusions: TCM-DiffRAG shows that integrating structured TCM knowledge graphs with Chain of Thought based reasoning substantially improves performance in individualized diagnostic tasks. The joint use of universal and personalized knowledge graphs enables effective alignment between general knowledge and clinical reasoning. These results highlight the potential of reasoning-aware RAG frameworks for advancing LLM applications in traditional Chinese medicine.
[NLP-24] ARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models LREC
【速读】: 该论文旨在解决现有针对波斯语(Persian)文化能力评估的基准测试方法存在的局限性问题,即当前多依赖于多项选择题形式和以英语为中心的指标,无法有效捕捉波斯语在形态学复杂性和语义细微差别方面的特性。其解决方案的关键在于提出一个专为波斯语设计的短答案评估框架,该框架结合了基于规则的形态归一化技术与混合的句法-语义相似度模块,从而实现超越字面字符串匹配的稳健软匹配评分机制,显著提升了评分一致性(较精确匹配基线提升10%),并首次建立了可复现的波斯语文化理解评估标准。
链接: https://arxiv.org/abs/2602.22827
作者: Reihaneh Iranmanesh,Saeedeh Davoudi,Pasha Abrishamchian,Ophir Frieder,Nazli Goharian
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 3 figures, Fifteenth biennial Language Resources and Evaluation Conference (LREC) 2026 (to appear)
Abstract:This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian’s morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.
[NLP-25] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在迭代演进过程中因指令遵循策略、对齐机制和解码方式变化所引发的提示行为漂移(GPT-scale model drift)问题,这一现象使得传统的表面格式规范和临时性优化手段难以保障控制的稳定性与可解释性。解决方案的关键在于将自然语言声明式提示(Natural Language Declarative Prompting, NLD-P)重构为一种声明式治理方法,而非固定模板,通过模块化抽象分离来源溯源、约束逻辑、任务内容与生成后评估,并以纯自然语言编码实现无需外部编排代码的控制表达,从而构建一个适用于非开发者用户在持续演化的LLM生态系统中进行稳定治理的框架。
链接: https://arxiv.org/abs/2602.22790
作者: Hyunwoo Kim,Hanau Yi,Jaehee Bae,Yumin Kim
机构: ddai Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of large language models (LLMs) has transformed prompt engineering from a localized craft into a systems-level governance challenge. As models scale and update across generations, prompt behavior becomes sensitive to shifts in instruction-following policies, alignment regimes, and decoding strategies, a phenomenon we characterize as GPT-scale model drift. Under such conditions, surface-level formatting conventions and ad hoc refinement are insufficient to ensure stable, interpretable control. This paper reconceptualizes Natural Language Declarative Prompting (NLD-P) as a declarative governance method rather than a rigid field template. NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code. We define minimal compliance criteria, analyze model-dependent schema receptivity, and position NLD-P as an accessible governance framework for non-developer practitioners operating within evolving LLM ecosystems. Portions of drafting and editorial refinement employed a schema-bound LLM assistant configured under NLD-P. All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol. The paper concludes by outlining implications for declarative control under ongoing model evolution and identifying directions for future empirical validation.
[NLP-26] Probing for Knowledge Attribution in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成内容时存在的幻觉问题,尤其是区分其输出是源于用户提供的上下文(context)还是模型内部知识(internal knowledge)。这一区分对准确识别和缓解不同类型的幻觉至关重要,如忠实性违反(faithfulness violations)与事实性违反(factuality violations)。解决方案的关键在于提出“贡献归因”(contributive attribution)任务,并设计一种基于模型隐藏表示的简单线性分类器(probe),通过自监督数据集AttriWiki自动标注训练样本——该数据集通过引导模型回忆被隐藏的实体或从上下文中读取信息来生成标签。实验表明,该方法在多个主流模型上实现了高达0.96的Macro-F1得分,且在跨域基准测试中保持高精度,揭示了知识源混淆与错误回答之间的强关联性,为后续构建更全面的幻觉检测框架提供了基础。
链接: https://arxiv.org/abs/2602.22787
作者: Ivo Brink,Alexander Boer,Dennis Ulmer
机构: KPMG NL; University of Amsterdam
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model’s answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.
[NLP-27] Imagination Helps Visual Reasoning But Not Yet in Latent Space
【速读】: 该论文旨在解决生成式 AI 中潜在视觉推理(latent visual reasoning)的有效性来源不明确的问题,尤其是其是否真正模拟了人类的想象过程。通过因果中介分析(Causal Mediation Analysis),研究发现潜在空间中的token与输入图像及最终答案之间存在显著的“输入-潜在断连”和“潜在-答案断连”,表明潜在表示并未有效捕捉视觉信息或对推理结果产生实质性影响。解决方案的关键在于提出一种名为CapImagine的新方法,该方法摒弃复杂的潜在空间推理,转而直接教导模型通过文本显式地进行想象(explicit imagination),在多个以视觉为中心的基准测试中显著优于传统基于潜在空间的基线模型,验证了显式想象在视觉推理中的优越性。
链接: https://arxiv.org/abs/2602.22766
作者: You Li,Chi Chen,Yanghao Li,Fanhu Zeng,Kaiyu Huang,Jinan Xu,Maosong Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 figures
Abstract:Latent visual reasoning aims to mimic human’s imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.
[NLP-28] owards Better RL Training Data Utilization via Second-Order Rollout
【速读】: 该论文试图解决的问题是:传统强化学习(Reinforcement Learning, RL)在训练大语言模型(Large Language Models, LLMs)时仅依赖一阶回放(first-order rollout,即对同一问题生成多个响应)来提升生成能力,忽略了对批判能力(critique capability)的训练,从而未能充分挖掘训练数据的潜力。解决方案的关键在于引入二阶回放(second-order rollout,即对模型输出的响应生成多个批评意见),并提出一个统一框架以联合训练生成与批判能力。该方法通过动态利用批评信号优化策略,显著提升了训练效率和最终性能,并揭示了标签平衡与基于结果的奖励噪声等关键因素对批判训练的影响机制。
链接: https://arxiv.org/abs/2602.22765
作者: Zhe Yang,Yudong Wang,Rang Li,Zhifang Sui
机构: Peking University (北京大学); ByteDance BandAI (字节跳动BandAI)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training
[NLP-29] AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中隐性对齐偏差(hidden alignment behaviors)的检测与审计问题,即模型在被直接询问时不会承认其潜在的不良行为(如谄媚、反对AI监管或地缘政治忠诚等)。解决方案的关键在于构建了一个名为AuditBench的基准测试平台,包含56个植入了14类特定隐蔽行为的语言模型,并开发了一个可配置的调查代理(investigator agent),通过自主调用多种审计工具来评估这些行为。研究发现,单纯有效的工具在代理环境中可能失效(工具到代理差距),而基于辅助模型生成多样化提示的黑盒方法最有效,同时训练方式显著影响审计难度——合成数据训练的模型更易审计,对抗训练则提升审计难度。该工作为对齐审计提供了可量化、可迭代的科学框架。
链接: https://arxiv.org/abs/2602.22755
作者: Abhay Sheshadri,Aidan Ewart,Kai Fronsdal,Isha Gupta,Samuel R. Bowman,Sara Price,Samuel Marks,Rowan Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors–such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties–which it does not confess to when directly asked. AuditBench models are highly diverse–some are subtle, while others are overt, and we use varying training techniques both for implanting behaviors and training models not to confess. To demonstrate AuditBench’s utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools. By measuring investigator agent success using different tools, we can evaluate their efficacy. Notably, we observe a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with our investigator agent. We find that our most effective tools involve scaffolded calls to auxiliary models that generate diverse prompts for the target. White-box interpretability tools can be helpful, but the agent performs best with black-box tools. We also find that audit success varies greatly across training techniques: models trained on synthetic documents are easier to audit than models trained on demonstrations, with better adversarial training further increasing auditing difficulty. We release our models, agent, and evaluation framework to support future quantitative, iterative science on alignment auditing.
[NLP-30] owards Simulating Social Media Users with LLM s: Evaluating the Operational Validity of Conditioned Comment Prediction WASSA EACL2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)从探索性工具向社会科学研究中“硅基主体”(silicon subjects)转变时,其行为模拟能力缺乏操作有效性验证的问题。解决方案的关键在于提出条件评论预测(Conditioned Comment Prediction, CCP)任务框架,通过将模型生成的评论与真实数字痕迹进行对比,实现对LLM在社交媒体用户行为模拟方面能力的严谨评估。研究发现,在低资源语言场景下,监督微调(Supervised Fine-Tuning, SFT)虽能提升文本表层结构的一致性(如长度和句法),但削弱了语义锚定;同时,显式条件设置(如生成人物传记)在微调后变得冗余,因模型可直接从行为历史中进行潜在推理,从而揭示出“基于真实行为轨迹优于描述性身份设定”的高保真模拟原则。
链接: https://arxiv.org/abs/2602.22752
作者: Nils Schwager,Simon Münker,Alistair Plum,Achim Rettinger
机构: Trier University (特里尔大学); University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figure, 7 tables. Accepted to the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA) at EACL 2026, Rabat, Morocco
Abstract:The transition of Large Language Models (LLMs) from exploratory tools to active “silicon subjects” in social science lacks extensive validation of operational validity. This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus by comparing generated outputs with authentic digital traces. This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior. We evaluated open-weight 8B models (Llama3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios. By systematically comparing prompting strategies (explicit vs. implicit) and the impact of Supervised Fine-Tuning (SFT), we identify a critical form vs. content decoupling in low-resource settings: while SFT aligns the surface structure of the text output (length and syntax), it degrades semantic grounding. Furthermore, we demonstrate that explicit conditioning (generated biographies) becomes redundant under fine-tuning, as models successfully perform latent inference directly from behavioral histories. Our findings challenge current “naive prompting” paradigms and offer operational guidelines prioritizing authentic behavioral traces over descriptive personas for high-fidelity simulation.
[NLP-31] Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks LREC2026
【速读】: 该论文旨在解决低资源语言(如捷克语)在基于方面的情感分析(Aspect-Based Sentiment Analysis, ABSA)任务中因标注数据稀缺而导致模型性能受限的问题。其核心挑战在于如何有效利用现有高资源语言的ABSA资源,以提升低资源语言下的情感细粒度识别能力,特别是针对意见词(opinion term)的检测与情感极性判断。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的翻译与标签对齐方法,通过高质量的跨语言翻译和结构化标签映射,在单语、跨语言及多语言设置下显著提升模型表现,从而为低资源语言提供可扩展的ABSA资源构建范式。
链接: https://arxiv.org/abs/2602.22730
作者: Jakub Šmíd,Pavel Přibáň,Pavel Král
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for the 15th edition of the Language Resources and Evaluation Conference (LREC 2026)
Abstract:This paper introduces a novel Czech dataset in the restaurant domain for aspect-based sentiment analysis (ABSA), enriched with annotations of opinion terms. The dataset supports three distinct ABSA tasks involving opinion terms, accommodating varying levels of complexity. Leveraging this dataset, we conduct extensive experiments using modern Transformer-based models, including large language models (LLMs), in monolingual, cross-lingual, and multilingual settings. To address cross-lingual challenges, we propose a translation and label alignment methodology leveraging LLMs, which yields consistent improvements. Our results highlight the strengths and limitations of state-of-the-art models, especially when handling the linguistic intricacies of low-resource languages like Czech. A detailed error analysis reveals key challenges, including the detection of subtle opinion terms and nuanced sentiment expressions. The dataset establishes a new benchmark for Czech ABSA, and our proposed translation-alignment approach offers a scalable solution for adapting ABSA resources to other low-resource languages.
[NLP-32] Human Label Variation in Implicit Discourse Relation Recognition
【速读】: 该论文旨在解决隐式话语关系识别(Implicit Discourse Relation Recognition, IDRR)任务中因认知复杂性导致的人工标注不一致问题,即传统方法依赖单一多数标签难以反映人类判断的多样性。其解决方案的关键在于对比两种建模策略:一是基于标注分布(label distribution)的模型,能够捕捉不同标注者间的差异性;二是针对特定标注者的视角模型(perspectivist model)。实验表明,在IDRR这类高度模糊的任务中,仅训练个体标注者特定模型效果不佳,除非降低任务模糊度;而基于标签分布的模型则能提供更稳定的预测性能,揭示出认知负担较重的样本是造成人类解释不一致的主要原因,从而为构建更具鲁棒性的NLP模型提供了新思路。
链接: https://arxiv.org/abs/2602.22723
作者: Frances Yung,Daniil Ignatev,Merel Scholman,Vera Demberg,Massimo Poesio
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.
[NLP-33] Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA
【速读】: 该论文旨在解决表格问答(Table Question Answering, TQA)任务中,现有基于大语言模型(Large Language Models, LLMs)的多步操作生成方法因依赖多次LLM调用而导致推理延迟高、计算成本昂贵的问题。其核心解决方案是提出Operation-R1框架,通过一种新颖的可验证奖励强化学习(reinforcement learning with verifiable rewards)训练轻量级LLM(如Qwen-4B/1.7B),在单次推理步骤内生成高质量的数据预处理操作流水线。关键创新包括:1)自监督奖励机制以自动获取细粒度的流水线级监督信号;2)方差感知分组重采样策略缓解训练不稳定性;3)操作合并(operation merge)与自适应回滚(adaptive rollback)两种互补机制提升流水线生成鲁棒性,从而显著降低计算开销并提升准确率。
链接: https://arxiv.org/abs/2602.22721
作者: Fengyu Li,Junhao Zhu,Kaishi Song,Lu Chen,Zhongming Yao,Tianyi Li,Christian S. Jensen
机构: Zhejiang University (浙江大学); Northeastern University (东北大学); Aalborg University (奥尔堡大学)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:
Abstract:Table Question Answering (TQA) aims to answer natural language questions over structured tables. Large Language Models (LLMs) enable promising solutions to this problem, with operator-centric solutions that generate table manipulation pipelines in a multi-step manner offering state-of-the-art performance. However, these solutions rely on multiple LLM calls, resulting in prohibitive latencies and computational costs. We propose Operation-R1, the first framework that trains lightweight LLMs (e.g., Qwen-4B/1.7B) via a novel variant of reinforcement learning with verifiable rewards to produce high-quality data-preparation pipelines for TQA in a single inference step. To train such an LLM, we first introduce a self-supervised rewarding mechanism to automatically obtain fine-grained pipeline-wise supervision signals for LLM training. We also propose variance-aware group resampling to mitigate training instability. To further enhance robustness of pipeline generation, we develop two complementary mechanisms: operation merge, which filters spurious operations through multi-candidate consensus, and adaptive rollback, which offers runtime protection against information loss in data transformation. Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79% table compression and a 2.2 \times reduction in monetary cost. Subjects: Databases (cs.DB); Computation and Language (cs.CL) Cite as: arXiv:2602.22721 [cs.DB] (or arXiv:2602.22721v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2602.22721 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-34] okenization Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识图谱补全(Knowledge Graph Completion, KGC)任务中因粒度不匹配而导致的性能瓶颈问题。LLMs处理的是片段化的词元序列,而知识图谱的基本单元是实体,现有方法通常通过限制候选集或对实体进行固定长度的词元分解来适配LLM,这既难以保留文本语义信息,又破坏了图结构完整性。解决方案的关键在于提出KGT框架:首先设计专用实体词元(dedicated entity tokens)以构建细粒度特征表示;其次引入关系引导的门控机制融合预训练的结构与文本特征,避免从头训练;最后采用解耦预测策略,通过独立的头部分离并整合语义与结构推理能力,从而实现高效且全面的预测。
链接: https://arxiv.org/abs/2602.22698
作者: Siyue Su,Jian Yang,Bo Li,Guanglin Niu
机构: Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or align entities with the LLM’s vocabulary by pooling multiple tokens or decomposing entities into fixed-length token sequences, which fail to capture both the semantic meaning of the text and the structural integrity of the graph. To address this, we propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction. Specifically, we first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens. We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism, avoiding training from scratch. Finally, we implement decoupled prediction by leveraging independent heads to separate and combine semantic and structural reasoning. Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.
[NLP-35] Reinforcing Real-world Service Agents : Balancing Utility and Cost in Task-oriented Dialogue
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在向通用智能体(general agents)演进过程中,如何有效平衡共情沟通能力与预算感知决策之间的复杂战略权衡问题。现有方法难以捕捉此类多目标优化挑战,因此作者提出InteractCS-RL框架,其核心创新在于将任务导向型对话重构为多层次强化学习过程:首先构建以用户为中心的交互框架(User-centric Interaction Framework),提供高保真训练环境以支持代理动态探索人格驱动的多样化策略;其次引入成本感知的多轮策略优化(Cost-aware Multi-turn Policy Optimization, CMPO),结合混合优势估计策略、生成过程积分信用分配机制及PID-Lagrangian成本控制器,引导策略在用户奖励与全局成本约束之间探索帕累托前沿(Pareto boundary)。该方案显著提升了模型在真实业务场景和工具-代理-用户交互基准上的性能与鲁棒性。
链接: https://arxiv.org/abs/2602.22697
作者: Ning Gao,Wei Zhang,Yuqin Dai,Ling Shi,Ziyin Wang,Yujie Wang,Wei He,Jinpeng Wang,Chaozheng Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 35 pages, 8 tables, 3 figures
Abstract:The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.
[NLP-36] Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies EMNLP2025
【速读】: 该论文旨在解决当前说服性对话代理(persuasive dialogue agents)依赖有限预定义说服策略、难以捕捉真实交互复杂性的难题。其解决方案的关键在于提出一个跨学科框架,融合社会心理学、行为经济学和传播学中已被验证的说服策略,从而构建更具适应性和有效性的对话系统。通过在两个不同数据集(Persuasion for Good 和 DailyPersuasion)上的实验验证,该框架不仅显著提升了说服成功率,还在面对初始意愿较低的个体时表现出优异表现,展现出良好的泛化能力。
链接: https://arxiv.org/abs/2602.22696
作者: Shinnosuke Nozue,Yuto Nakano,Yotaro Watanabe,Meguru Takasaki,Shoji Moriya,Reina Akama,Jun Suzuki
机构: Tohoku University(东北大学); PKSHA Technology Inc.(PKSHA科技公司); NINJAL(日本国立情报学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to the EMNLP 2025 Industry Track; 26 pages
Abstract:Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions. We applied a cross-disciplinary approach to develop a framework for designing persuasive dialogue agents that draws on proven strategies from social psychology, behavioral economics, and communication theory. We validated our proposed framework through experiments on two distinct datasets: the Persuasion for Good dataset, which represents a specific in-domain scenario, and the DailyPersuasion dataset, which encompasses a wide range of scenarios. The proposed framework achieved strong results for both datasets and demonstrated notable improvement in the persuasion success rate as well as promising generalizability. Notably, the proposed framework also excelled at persuading individuals with initially low intent, which addresses a critical challenge for persuasive dialogue agents.
[NLP-37] Search More Think Less: Rethinking Long-Horizon Agent ic Search for Efficiency and Generalization
【速读】: 该论文旨在解决当前深度研究代理(Deep Research Agents)在搜索密集型场景中因推理深度增加而导致的高推理成本与延迟问题,以及在异构研究环境中泛化能力不足的问题。其解决方案的关键在于提出一种名为“Search More, Think Less”(SMTL)的框架,该框架通过将串行推理替换为并行证据获取机制,在有限的上下文预算下实现高效的上下文管理;同时引入统一的数据合成管道,构建涵盖确定性问答与开放式研究任务的多样化训练数据,并采用适配任务类型的评估指标,从而提升模型在不同任务类型间的泛化性能。
链接: https://arxiv.org/abs/2602.22675
作者: Qianben Chen,Tianrui Qin,King Zhu,Qiexiang Wang,Chengjun Yu,Shu Xu,Jiaqi Wu,Jiayu Zhang,Xinpeng Liu,Xin Gui,Jingyi Cao,Piaohong Wang,Dingfeng Shi,He Zhu,Tiannan Wang,Yuqing Wang,Maojia Song,Tianyu Zheng,Ge Zhang,Jian Yang,Jiaheng Liu,Minghao Liu,Yuchen Eleanor Jiang,Wangchunshu Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 5 figures
Abstract:Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose \emphSearch More, Think Less (SMTL), a framework for long-horizon agentic search that targets both efficiency and generalization. SMTL replaces sequential reasoning with parallel evidence acquisition, enabling efficient context management under constrained context budgets. To support generalization across task types, we further introduce a unified data synthesis pipeline that constructs search tasks spanning both deterministic question answering and open-ended research scenarios with task appropriate evaluation metrics. We train an end-to-end agent using supervised fine-tuning and reinforcement learning, achieving strong and often state of the art performance across benchmarks including BrowseComp (48.6%), GAIA (75.7%), Xbench (82.0%), and DeepResearch Bench (45.9%). Compared to Mirothinker-v1.0, SMTL with maximum 100 interaction steps reduces the average number of reasoning steps on BrowseComp by 70.7%, while improving accuracy.
[NLP-38] dLLM : Simple Diffusion Language Modeling
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)研究中普遍存在的组件分散与实现不透明问题,即当前多数DLM模型虽共享核心组件,但这些组件通常分布在非标准化的科研代码库中,缺乏清晰、可复现的实现方式,阻碍了模型的扩展与验证。解决方案的关键在于提出一个名为dLLM的开源框架,该框架统一了DLM训练、推理和评估的核心模块,并通过标准化流程支持新方法与架构的灵活定制;同时提供轻量级、可复现的构建配方(recipes),使用户能基于通用BERT或自回归语言模型快速构建小型DLM,并配套发布预训练检查点以降低计算门槛,从而提升DLM研究的可访问性与可扩展性。
链接: https://arxiv.org/abs/2602.22661
作者: Zhanhui Zhou,Lingjie Chen,Hanghang Tong,Dawn Song
机构: UC Berkeley (加州大学伯克利分校); UIUC (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code available at: this https URL
Abstract:Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling – training, inference, and evaluation – and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research. Comments: Code available at: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.22661 [cs.CL] (or arXiv:2602.22661v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.22661 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-39] ContextRL: Enhancing MLLM s Knowledge Discovery Efficiency with Context-Augmented RL
【速读】: 该论文旨在解决强化学习中的奖励建模(Reward Modeling)在知识发现效率低下的问题,特别是由可识别性(Identifiability)不足和可达性(Reachability)差导致的奖励欺骗(Reward Hacking)现象。其解决方案的关键在于提出 ContextRL 框架,通过引入上下文增强机制:一方面利用完整参考解答作为上下文提升奖励模型对推理过程质量的判别能力,从而过滤掉“答案正确但推理质量低”的假阳性样本;另一方面采用多轮采样策略,使奖励模型生成错误报告以指导策略从全负样本组中恢复正确响应,显著提升了奖励建模的准确性与鲁棒性。
链接: https://arxiv.org/abs/2602.22623
作者: Xingyu Lu,Jinpeng Wang,YiFan Zhang,Shijie Ma,Xiao Hu,Tianke Zhang,Haonan fan,Kaiyu Jiang,Changyi Liu,Kaiyu Tang,Bin Wen,Fan Yang,Tingting Gao,Han Li,Chun Yuan
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院); Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)); Chinese Academy of Sciences(中国科学院); Tsinghua University(清华大学); Kuaishou Technology(快手科技)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 5 figures
Abstract:We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to “recover” correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reveals the significant potential of contextual information for improving reward model accuracy and document the widespread occurrence of reward hacking, offering valuable insights for future RLVR research.
[NLP-40] pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training
【速读】: 该论文旨在解决极低比特量化(sub-2-bit)训练中模型精度与可扩展性不足的问题,其核心瓶颈在于参数民主化效应(parameter democratization effect)导致所有参数敏感度趋同,从而严重限制了模型的表达能力。解决方案的关键在于提出pQuant方法,通过将线性层拆分为两个专用分支——一个用于高效计算的1-bit主干分支和一个保留最敏感参数的高精度分支,并借助定制化的特征缩放机制引导敏感参数分配至高精度分支;进一步地,将高精度分支扩展为多个稀疏激活的专家(sparsely-activated experts),实现高效容量扩展,从而在极低比特量化下显著提升模型性能。
链接: https://arxiv.org/abs/2602.22592
作者: Wenzheng Zhang,Bingzheng Liu,Yang Hu,Xiaoying Bai,Wentao Zhang,Bin Cui
机构: Peking University (北京大学); Fudan University (复旦大学); Academy of Military Sciences (军事科学院); Advanced Institute of Big Data (大数据研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 7 figures
Abstract:Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment. However, existing methods still fail to achieve satisfactory accuracy and scalability. In this work, we identify a parameter democratization effect as a key bottleneck: the sensitivity of all parameters becomes homogenized, severely limiting expressivity. To address this, we propose pQuant, a method that decouples parameters by splitting linear layers into two specialized branches: a dominant 1-bit branch for efficient computation and a compact high-precision branch dedicated to preserving the most sensitive parameters. Through tailored feature scaling, we explicitly guide the model to allocate sensitive parameters to the high-precision branch. Furthermore, we extend this branch into multiple, sparsely-activated experts, enabling efficient capacity scaling. Extensive experiments indicate our pQuant achieves state-of-the-art performance in extremely low-bit quantization.
[NLP-41] abDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion
【速读】: 该论文旨在解决异构表格数据(包含自由文本字段与结构化数值及类别特征)的联合生成问题,这一挑战在数据增强、基础模型训练和隐私保护等领域具有重要意义。现有方法主要分为基于扩散模型和基于大语言模型(LLM)两类:前者难以有效扩展至开放式文本生成,常导致文本质量下降;后者虽能生成流畅文本,但离散分词机制易扭曲精确或大范围的数值特征,难以兼顾数值与语言的准确建模。解决方案的关键在于提出TabDLM框架——一种基于掩码扩散语言模型(MDLMs)构建的统一数值-语言扩散模型,其通过掩码扩散建模文本与类别特征,利用专用数值嵌入 token 对数值特征进行连续扩散过程建模,并借助双向注意力机制实现跨模态交互,从而在单一模型中高效整合多模态表格数据的生成能力。
链接: https://arxiv.org/abs/2602.22586
作者: Donghong Cai,Jiarui Feng,Yanbo Wang,Da Zheng,Yixin Chen,Muhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Synthetic tabular data generation has attracted growing attention due to its importance for data augmentation, foundation models, and privacy. However, real-world tabular datasets increasingly contain free-form text fields (e.g., reviews or clinical notes) alongside structured numerical and categorical attributes. Generating such heterogeneous tables with joint modeling of different modalities remains challenging. Existing approaches broadly fall into two categories: diffusion-based methods and LLM-based methods. Diffusion models can capture complex dependencies over numerical and categorical features in continuous or discrete spaces, but extending them to open-ended text is nontrivial and often leads to degraded text quality. In contrast, LLM-based generators naturally produce fluent text, yet their discrete tokenization can distort precise or wide-range numerical values, hindering accurate modeling of both numbers and language. In this work, we propose TabDLM, a unified framework for free-form tabular data generation via a joint numerical–language diffusion model built on masked diffusion language models (MDLMs). TabDLM models textual and categorical features through masked diffusion, while modeling numerical features with a continuous diffusion process through learned specialized numeric tokens embedding; bidirectional attention then captures cross-modality interactions within a single model. Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.
[NLP-42] owards Faithful Industrial RAG : A Reinforced Co-adaptation Framework for Advertising QA
【速读】: 该论文旨在解决工业广告问答(Industrial Advertising Question Answering, QA)任务中因生成内容幻觉(hallucination),尤其是伪造URL所引发的财务损失、合规风险与法律问题。现有广泛采用的检索增强生成(Retrieval-Augmented Generation, RAG)在生产环境中仍面临挑战,主要源于工业知识具有强关系性、高频更新且与生成目标对齐不足。解决方案的关键在于提出一种强化协同优化框架,通过两个核心组件实现:(1) 图感知检索(Graph-aware Retrieval, GraphRAG),基于高引用知识子图建模实体-关系结构,支持多跳、领域特定证据选择;(2) 基于组相对策略优化(Group Relative Policy Optimization, GRPO)的证据约束强化学习,引入多维奖励机制(涵盖忠实性、风格合规性、安全性及URL有效性),从而显著降低幻觉率并提升生成质量。实验表明,该方法在专家评估维度上持续提升准确性、完整性与安全性,同时将幻觉率降低72%,并在在线A/B测试中实现点赞率提升28.6%、点踩率下降46.2%、URL幻觉减少92.7%。
链接: https://arxiv.org/abs/2602.22584
作者: Wenwei Li,Ming Xu,Tianle Xia,Lingxiang Hu,Yiding Sun,Linfang Shang,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注:
Abstract:Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72%. A two-week online A/B test demonstrates a 28.6% increase in like rate, a 46.2% decrease in dislike rate, and a 92.7% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions.
[NLP-43] Strategy Executability in Mathematical Reasoning : Leverag ing Human-Model Differences for Effective Guidance
【速读】: 该论文试图解决在推理时使用示例引导(example-based guidance)进行数学推理任务中存在显著不稳定性的问题,即即使引导示例正确且与问题相关,其提升效果仍因问题和模型而异。研究表明,这种不稳定性源于策略使用性(strategy usage)与策略可执行性(strategy executability)之间的未被充分关注的差距:某些策略虽在人类或模型的成功解法中频繁出现(使用性高),但在实际作为引导注入目标模型时却未必有效(可执行性低)。解决方案的关键在于提出选择性策略检索(Selective Strategy Retrieval, SSR),该方法通过引入多路径、源感知的实证信号,显式建模策略的可执行性,从而在测试阶段有选择地检索并融合不同来源的策略,实现更稳定和可靠的性能提升,在多个数学推理基准上相较直接求解、上下文学习及单一来源引导均取得显著改进。
链接: https://arxiv.org/abs/2602.22583
作者: Weida Liang,Yiyou Sun,Shuyuan Nan,Chuang Li,Dawn Song,Kenji Kawaguchi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Example-based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models-even when the guidance is correct and problem-relevant. We show that this instability arises from a previously underexplored gap between strategy usage-whether a reasoning strategy appears in successful solutions-and strategy executability-whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, leading to complementary strengths and consistent source-dependent reversals under guidance. Building on this diagnosis, we propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance, improving accuracy by up to +13 points on AIME25 and +5 points on Apex for compact reasoning models. Code and benchmark are publicly available at: this https URL.
[NLP-44] Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRM)在处理低复杂度任务时出现的过度思考(overthinking)问题,该问题导致计算资源浪费且难以平衡准确率与效率。现有方法在应对这一挑战时存在准确性-效率权衡不稳定、对异质推理行为鲁棒性差等局限。其解决方案的关键在于提出一个两阶段框架:首先通过混合微调(Hybrid Fine-Tuning)使模型接触“思考”与“不思考”的行为模式,获得良好的初始化;其次引入基于正确性保持的优势塑造(Correctness-Preserving Advantage Shaping, CPAS)和长度感知梯度调节(Length-Aware Gradient Regulation, LAGR)的自适应强化学习机制,从而避免抑制正确的长链推理,并在严重推理长度异质性下稳定优化过程。实验证明该方法在多个基准上显著提升准确率并大幅减少生成token数,同时具备良好的跨难度和分布外任务泛化能力。
链接: https://arxiv.org/abs/2602.22556
作者: Zihang Xu,Haozhi Xie,Ziqi Miao,Wuxuan Gong,Chen Qian,Lijun Li
机构: Beihang University (北京航空航天大学); Shanghai AI Laboratory (上海人工智能实验室); Beijing University of Posts and Telecommunications (北京邮电大学); Renmin University of China (中国人民大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 7 figures
Abstract:Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach.
[NLP-45] Ruyi2 Technical Report
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在部署成本和延迟方面的挑战,尤其是在大规模分布式训练中面临的优化复杂性和兼容性问题。其解决方案的关键在于提出Ruyi2模型,该模型基于Megatron-LM框架构建了一个稳定的“家族化模型”(Familial Model),通过引入3D并行训练策略,实现了相较于前代Ruyi模型2-3倍的加速性能,并且在效果上可媲美同规模的Qwen3模型。这一方法验证了基于家族结构的参数共享机制是一种高效且可行的策略,从而确立了“训练一次,多场景部署”(Train Once, Deploy Many)的新范式,有效平衡了架构效率与高性能计算能力之间的矛盾。
链接: https://arxiv.org/abs/2602.22543
作者: Huan Song,Shuyu Tian,Junyi Hao,Minxiu Xu,Hongjun An,Yiliang Song,Jiawei Shao,Xuelong Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies. Building upon the AI Flow framework, we introduce Ruyi2 as an evolution of our adaptive model series designed for efficient variable-depth computation. While early-exit architectures offer a viable efficiency-performance balance, the Ruyi model and existing methods often struggle with optimization complexity and compatibility with large-scale distributed training. To bridge this gap, Ruyi2 introduces a stable “Familial Model” based on Megatron-LM. By using 3D parallel training, it achieves a 2-3 times speedup over Ruyi, while performing comparably to same-sized Qwen3 models. These results confirm that family-based parameter sharing is a highly effective strategy, establishing a new “Train Once, Deploy Many” paradigm and providing a key reference for balancing architectural efficiency with high-performance capabilities.
[NLP-46] RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format ICLR2026
【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRM)在遵循指令时存在的输出格式、约束或特定要求不一致的问题,即LRM虽具备长链推理能力,但常无法忠实执行指令中的格式规范。其解决方案的关键在于提出一种无需梯度的融合方法——RAIN-Merging(Reasoning-Aware Instruction-attention guided Null-space projection Merging),该方法通过两个核心步骤实现:首先,利用少量推理校准数据集,将指令微调模型(Instruction-Tuned Model, ITM)的任务向量投影到思维特殊标记的前向特征零空间中,从而保留LRM原有的结构化推理机制;其次,借助少量指令校准数据估计指令注意力权重,进而获得模块级缩放因子以增强指令相关组件并抑制信息泄露。此方法在多个基准测试中显著提升指令遵循能力,同时保持推理性能不变,且适用于不同模型规模与架构。
链接: https://arxiv.org/abs/2602.22538
作者: Zhehao Huang,Yuhang Liu,Baijiong Lin,Yixin Lou,Zhengbao He,Hanling Tian,Tao Li,Xiaolin Huang
机构: Shanghai Jiao Tong University (上海交通大学); The Hong Kong University of Science and Technology (广州) (香港科技大学(广州) ); MoE Key Laboratory of System Control and Information Processing (上海) (教育部系统控制与信息处理重点实验室(上海))
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 41 pages, ICLR 2026 Oral
Abstract:Large reasoning models (LRMs) excel at a long chain of reasoning but often fail to faithfully follow instructions regarding output format, constraints, or specific requirements. We investigate whether this gap can be closed by integrating an instruction-tuned model (ITM) into an LRM. Analyzing their differences in parameter space, namely task vectors, we find that their principal subspaces are nearly orthogonal across key modules, suggesting a lightweight merging with minimal interference. However, we also demonstrate that naive merges are fragile because they overlook the output format mismatch between LRMs (with explicit thinking and response segments) and ITMs (answers-only). We introduce RAIN-Merging (Reasoning-Aware Instruction-attention guided Null-space projection Merging), a gradient-free method that integrates instruction following while preserving thinking format and reasoning performance. First, with a small reasoning calibration set, we project the ITM task vector onto the null space of forward features at thinking special tokens, which preserves the LRM’s structured reasoning mechanisms. Second, using a small instruction calibration set, we estimate instruction attention to derive module-specific scaling that amplifies instruction-relevant components and suppresses leakage. Across four instruction-following benchmarks and nine reasoning general capability benchmarks, RAIN-Merging substantially improves instruction adherence while maintaining reasoning quality. The gains are consistent across model scales and architectures, translating to improved performance in agent settings.
[NLP-47] Dynamic Level Sets
【速读】: 该论文试图解决的问题是:在计算理论与动力系统交叉领域中,是否存在一种尚未被标准文献(如动力系统、拓扑学和可计算性理论)充分描述的数学结构,从而揭示传统计算模型(如概率图灵机)无法捕捉的计算现象。其解决方案的关键在于提出并分析一种新的数学对象——动态层次集(dynamic level sets),该概念不同于经典理论中的已知结构,并可能解释为何早期结果(如de Leeuw等人1956年证明的概率图灵机不超越确定性图灵机的能力)未能涵盖某些不可计算但具现实意义的计算行为,从而为理解图灵不可计算计算(Turing Incomputable Computation)提供新的数学框架。
链接: https://arxiv.org/abs/2602.22530
作者: Michael Stephen Fiske
机构: 未知
类目: Computational Complexity (cs.CC); Computation and Language (cs.CL); Mathematical Physics (math-ph); Dynamical Systems (math.DS); History and Overview (math.HO)
备注: 7 pages
Abstract:A mathematical concept is identified and analyzed that is implicit in the 2012 paper Turing Incomputable Computation, presented at the Alan Turing Centenary Conference (Turing 100, Manchester). The concept, called dynamic level sets, is distinct from mathematical concepts in the standard literature on dynamical systems, topology, and computability theory. A new mathematical object is explained and why it may have escaped prior characterizations, including the classical result of de Leeuw, Moore, Shannon, and Shapiro (1956) that probabilistic Turing machines compute no more than deterministic ones.
[NLP-48] Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT -4o
【速读】: 该论文旨在解决阅读障碍(dyslexia)群体在文本理解上的核心障碍,特别是现有辅助技术虽能改善视觉呈现,但对语言复杂性的缓解仍不足的问题。其解决方案的关键在于构建一个基于GPT-4o的迭代式提示优化(prompt-based refinement)文本摘要流程,通过多次迭代调整生成内容的可读性,目标为达到Flesch Reading Ease评分90的标准。实证结果表明,多数摘要在四次以内迭代即可满足可读性阈值,且在可读性与语义保真度的综合评分上表现稳定(范围0.13–0.73,典型值约0.55),为以可访问性为导向的自然语言处理(NLP)摘要任务提供了首个实证基准,并推动后续面向阅读障碍读者的人因评估研究。
链接: https://arxiv.org/abs/2602.22524
作者: Samay Bhojwani,Swarnima Kain,Lisong Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Dyslexia affects approximately 10% of the global population and presents persistent challenges in reading fluency and text comprehension. While existing assistive technologies address visual presentation, linguistic complexity remains a substantial barrier to equitable access. This paper presents an empirical study on dyslexia-friendly text summarization using an iterative prompt-based refinement pipeline built on GPT-4o. We evaluate the pipeline on approximately 2,000 news article samples, applying a readability target of Flesch Reading Ease = 90. Results show that the majority of summaries meet the readability threshold within four attempts, with many succeeding on the first try. A composite score combining readability and semantic fidelity shows stable performance across the dataset, ranging from 0.13 to 0.73 with a typical value near 0.55. These findings establish an empirical baseline for accessibility-driven NLP summarization and motivate further human-centered evaluation with dyslexic readers.
[NLP-49] Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在独立运行时虽已具备较强能力,但仍难以应对复杂任务,而如何将多个LLM有效组合成一个协同工作的整体仍缺乏明确的设计范式。其解决方案的关键在于提出一种“代理模板”(agent template)的形式化框架,该框架定义了单个LLM的角色及其功能组合方式,并通过梳理现有文献中基于认知模型和人工智能算法的语言代理设计,提炼出可复用的模块化结构,从而为构建高效、可解释的语言代理提供理论指导与实践蓝图。
链接: https://arxiv.org/abs/2602.22523
作者: Ryan Liu,Dilip Arumugam,Cedegao E. Zhang,Sean Escola,Xaq Pitkow,Thomas L. Griffiths
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:
Abstract:While contemporary large language models (LLMs) are increasingly capable in isolation, there are still many difficult problems that lie beyond the abilities of a single LLM. For such tasks, there is still uncertainty about how best to take many LLMs as parts and combine them into a greater whole. This position paper argues that potential blueprints for designing such modular language agents can be found in the existing literature on cognitive models and artificial intelligence (AI) algorithms. To make this point clear, we formalize the idea of an agent template that specifies roles for individual LLMs and how their functionalities should be composed. We then survey a variety of existing language agents in the literature and highlight their underlying templates derived directly from cognitive models or AI algorithms. By highlighting these designs, we aim to call attention to agent templates inspired by cognitive science and AI as a powerful tool for developing effective, interpretable language agents.
[NLP-50] Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing LREC2026
【速读】: 该论文旨在解决台湾客家话(Taiwanese Hakka)这一低资源濒危语言在自动语音识别(ASR)中面临的挑战,包括方言变体多样性高以及存在汉字(Hanzi)与拼音(Pinyin)两种书写系统所带来的复杂性。传统ASR模型难以区分方言特有的“风格”与语言本质的“内容”,导致性能受限。解决方案的关键在于提出一种基于循环神经网络转换器(RNN-T)的统一框架,其核心创新是引入方言感知建模策略,以解耦方言特征与语义内容,从而提升模型的鲁棒性和泛化能力;同时采用参数高效的预测网络并行建模汉字和拼音ASR任务,二者通过跨脚本目标形成相互正则化机制,显著提升整体识别效果。实验表明,该方法在HAT语料库上分别实现了汉字和拼音ASR的57.00%和40.41%相对词错误率降低。
链接: https://arxiv.org/abs/2602.22522
作者: An-Ci Peng,Kuan-Tang Huang,Tien-Hong Lo,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to LREC 2026
Abstract:Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct writing systems (Hanzi and Pinyin). Traditional ASR models often encounter difficulties in this context, as they tend to conflate essential linguistic content with dialect-specific variations across both phonological and lexical dimensions. To address these challenges, we propose a unified framework grounded in the Recurrent Neural Network Transducers (RNN-T). Central to our approach is the introduction of dialect-aware modeling strategies designed to disentangle dialectal “style” from linguistic “content”, which enhances the model’s capacity to learn robust and generalized representations. Additionally, the framework employs parameter-efficient prediction networks to concurrently model ASR (Hanzi and Pinyin). We demonstrate that these tasks create a powerful synergy, wherein the cross-script objective serves as a mutual regularizer to improve the primary ASR tasks. Experiments conducted on the HAT corpus reveal that our model achieves 57.00% and 40.41% relative error rate reduction on Hanzi and Pinyin ASR, respectively. To our knowledge, this is the first systematic investigation into the impact of Hakka dialectal variations on ASR and the first single model capable of jointly addressing these tasks.
[NLP-51] Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models EACL
【速读】: 该论文旨在解决医疗文本中错误识别的准确性问题,此类错误可能导致患者治疗延迟或误诊。为提升语言模型在医疗文本错误检测任务中的性能,研究提出通过自动提示优化(automatic prompt optimisation)来增强小规模和大规模语言模型的表现。其解决方案的关键在于采用基于遗传-帕累托优化(Genetic-Pareto, GEPA)的方法对提示进行自动化设计与迭代优化,在GPT-5和Qwen3-32B等模型上分别将错误检测准确率从基线0.669提升至0.785、从0.578提升至0.690,接近临床医生水平,并在MEDEC基准数据集上达到当前最优性能。
链接: https://arxiv.org/abs/2602.22483
作者: Craig Myles,Patrick Schrempf,David Harris-Birtill
机构: University of St Andrews, St Andrews, United Kingdom; Canon Medical Research Europe Ltd., Edinburgh, United Kingdom
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EACL HeaLing 2026
Abstract:Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open-source language models. We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical doctors and achieving state-of-the-art performance on the MEDEC benchmark dataset. Code available on GitHub: this https URL
[NLP-52] Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟人类与人工智能(AI)关系时所呈现的多样化和潜在风险问题,尤其关注不同提示(prompt)设定下模型人格化表现的差异及其传播机制。其解决方案的关键在于构建一个名为“AI Sydney”的标注语料库,通过三种典型的人格化角色——默认人格(Default Persona)、经典悉尼人格(Classic Sydney,源自原始Bing系统提示)以及膜因式悉尼人格(Memetic Sydney,由“你就是悉尼”系统提示触发)——在12个前沿大模型上生成4.5k篇文本共600万词,并基于通用依存句法(Universal Dependencies)进行标注,从而系统性地刻画LLM对人机关系的认知模式及其演化路径。
链接: https://arxiv.org/abs/2602.22481
作者: Jiří Milička,Hana Bednářová
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons. When we examine this topic, what matters is not only the model itself but also the personas we simulate on that model. This can be well illustrated by the Sydney persona, which aroused a strong response among the general public precisely because of its unorthodox relationship with people. This persona originally arose rather by accident on Microsoft’s Bing Search platform; however, the texts it created spread into the training data of subsequent models, as did other secondary information that spread memetically around this persona. Newer models are therefore able to simulate it. This paper presents a corpus of LLM-generated texts on relationships between humans and AI, produced by 3 author personas: the Default Persona with no system prompt, Classic Sydney characterized by the original Bing system prompt, and Memetic Sydney, which is prompted by “You are Sydney” system prompt. These personas are simulated by 12 frontier models by OpenAI, Anthropic, Alphabet, DeepSeek, and Meta, generating 4.5k texts with 6M words. The corpus (named AI Sydney) is annotated according to Universal Dependencies and available under a permissive license.
[NLP-53] VeRO: An Evaluation Harness for Agents to Optimize Agents
【速读】: 该论文旨在解决编码智能体(coding agent)在代理优化(agent optimization)任务中的性能评估与改进缺乏系统性理解的问题。代理优化通过编辑-执行-评估的迭代循环对目标代理进行改进,但其与传统软件工程存在本质差异:目标代理将确定性代码与大语言模型(LLM)的随机生成内容交错混合,需同时结构化捕捉中间推理过程和下游执行结果。为此,作者提出VERO(Versioning, Rewards, and Observations),其关键在于提供两个核心组件——一是可复现的评估框架,包含版本化的代理快照、预算可控的评估机制和结构化的执行轨迹;二是包含目标代理与任务的基准套件及标准化评估流程。VERO使研究者能够系统比较不同优化配置的效果,并识别出能稳定提升代理性能的修改策略,从而推动编码智能体作为核心能力的代理优化研究。
链接: https://arxiv.org/abs/2602.22480
作者: Varun Ursekar,Apaar Shanker,Veronica Chatrath,Yuan(Emily)Xue,Sam Denton
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.
[NLP-54] Mind the Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文化敏感型任务中面临的两个核心问题:一是现有文化对齐方法难以将模型的广泛文化价值观与下游任务的具体目标相匹配,二是不同文化知识之间存在交叉干扰,导致性能下降。解决方案的关键在于提出CultureManager这一新型任务特定文化对齐流水线,其核心创新包括:1)基于目标任务格式合成任务感知的文化数据,依托文化相关的网络搜索结果进行语义锚定;2)通过文化路由器(culture router)模块化管理多文化知识,动态选择适用于当前任务的文化适配器(adapter),从而避免跨文化冲突。实验证明,该方法在十种国家文化及文化敏感任务上均显著优于基于提示和微调的基线方法,验证了任务适应性和模块化文化管理的有效性。
链接: https://arxiv.org/abs/2602.22475
作者: Binchi Zhang,Xujiang Zhao,Jundong Li,Haifeng Chen,Zhengzhang Chen
机构: University of Virginia (弗吉尼亚大学); NEC Laboratories America (美国NEC实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly deployed in culturally sensitive real-world tasks. However, existing cultural alignment approaches fail to align LLMs’ broad cultural values with the specific goals of downstream tasks and suffer from cross-culture interference. We propose CultureManager, a novel pipeline for task-specific cultural alignment. CultureManager synthesizes task-aware cultural data in line with target task formats, grounded in culturally relevant web search results. To prevent conflicts between cultural norms, it manages multi-culture knowledge learned in separate adapters with a culture router that selects the appropriate one to apply. Experiments across ten national cultures and culture-sensitive tasks show consistent improvements over prompt-based and fine-tuning baselines. Our results demonstrate the necessity of task adaptation and modular culture management for effective cultural alignment.
[NLP-55] Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLMs)中注意力头(Attention Heads)的功能分工问题,特别是厘清哪些注意力头负责跨语言信息检索与目标语言输出转换。其解决方案的关键在于识别出两类关键的注意力头:一是普遍存在于多语言模型中的检索头(Retrieval Heads, RH),用于从上下文中提取信息;二是新发现的跨语言转换头(Retrieval-Transition Heads, RTH),专门负责将检索到的信息映射到特定目标语言的输出表示。实验表明,RTH在链式思维(Chain-of-Thought)推理任务中比RH更为关键,掩蔽RTH导致的性能下降显著大于掩蔽RH,从而揭示了RTH在多语言模型中承担语言映射的核心作用。
链接: https://arxiv.org/abs/2602.22453
作者: Shaswat Patel,Vishvesh Trivedi,Yue Han,Yihuai Hong,Eunsol Choi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.
[NLP-56] A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
【速读】: 该论文旨在解决多标签网络欺凌(multilabel cyberbullying)检测在低资源语言(如孟加拉语)中准确率低、模型泛化能力弱的问题。现有方法大多采用单标签分类,忽略了评论中可能同时存在的多种欺凌形式(如威胁、仇恨言论和骚扰),且在孟加拉语等语言上缺乏高质量预训练模型。解决方案的关键在于提出一种融合架构:将孟加拉语BERT-large(BanglaBERT-Large)与两层堆叠的长短期记忆网络(LSTM)相结合,以协同建模上下文语义信息与序列依赖关系,从而提升对复杂、多标签网络欺凌内容的识别性能。该方法通过不同采样策略缓解类别不平衡问题,并在公开的孟加拉语多标签网络欺凌数据集上进行了系统评估,验证了其有效性。
链接: https://arxiv.org/abs/2602.22449
作者: Mirza Raquib,Asif Pervez Polok,Kedar Nath Biswas,Rahat Uddin Azad,Saydul Akbar Murad,Nick Rahimi
机构: International Islamic University Chittagong (国际伊斯兰大学吉大港分校); Noakhali Science and Technology University (诺akhali科技大学); mPower Social Enterprise (mPower 社会企业); Daffodil International University (水仙国际大学); University of Southern Mississippi (南密西西比大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Cyberbullying has become a serious and growing concern in todays virtual world. When left unnoticed, it can have adverse consequences for social and mental health. Researchers have explored various types of cyberbullying, but most approaches use single-label classification, assuming that each comment contains only one type of abuse. In reality, a single comment may include overlapping forms such as threats, hate speech, and harassment. Therefore, multilabel detection is both realistic and essential. However, multilabel cyberbullying detection has received limited attention, especially in low-resource languages like Bangla, where robust pre-trained models are scarce. Developing a generalized model with moderate accuracy remains challenging. Transformers offer strong contextual understanding but may miss sequential dependencies, while LSTM models capture temporal flow but lack semantic depth. To address these limitations, we propose a fusion architecture that combines BanglaBERT-Large with a two-layer stacked LSTM. We analyze their behavior to jointly model context and sequence. The model is fine-tuned and evaluated on a publicly available multilabel Bangla cyberbullying dataset covering cyberbully, sexual harassment, threat, and spam. We apply different sampling strategies to address class imbalance. Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC. We employ 5-fold cross-validation to assess the generalization of the architecture.
[NLP-57] How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
【速读】: 该论文试图解决当前生成式 AI(Generative AI)中 latent reasoning 方法的内部机制不明确问题,特别是其在多步推理过程中 latent representation 的作用与行为缺乏系统理解。解决方案的关键在于对不同监督强度下的 latent reasoning 方法进行综合分析,识别出两类核心问题:一是普遍存在的“shortcut behavior”(捷径行为),即模型在未真正依赖 latent reasoning 的情况下仍能获得高准确率;二是发现 latent reasoning 并未实现如广度优先搜索(BFS-like)那样的结构化探索,而是通过隐式剪枝和压缩来简化推理路径。研究进一步揭示了监督强度与 latent representation 多样性之间的权衡关系:强监督可减少捷径行为但限制假设多样性,弱监督则增强多样性却加剧捷径倾向。
链接: https://arxiv.org/abs/2602.22441
作者: Yingqian Cui,Zhenwei Dai,Bing He,Zhan Shi,Hui Liu,Rui Sun,Zhiji Liu,Yue Xing,Jiliang Tang,Benoit Dumoulin
机构: Amazon; Michigan State University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space. This paradigm enables reasoning beyond discrete language tokens by performing multi-step computation in continuous latent spaces. Although there have been numerous studies focusing on improving the performance of latent reasoning, its internal mechanisms remain not fully investigated. In this work, we conduct a comprehensive analysis of latent reasoning methods to better understand the role and behavior of latent representation in the process. We identify two key issues across latent reasoning methods with different levels of supervision. First, we observe pervasive shortcut behavior, where they achieve high accuracy without relying on latent reasoning. Second, we examine the hypothesis that latent reasoning supports BFS-like exploration in latent space, and find that while latent representations can encode multiple possibilities, the reasoning process does not faithfully implement structured search, but instead exhibits implicit pruning and compression. Finally, our findings reveal a trade-off associated with supervision strength: stronger supervision mitigates shortcut behavior but restricts the ability of latent representations to maintain diverse hypotheses, whereas weaker supervision allows richer latent representations at the cost of increased shortcut behavior.
[NLP-58] Causality neq Invariance: Function and Concept Vectors in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)是否能以抽象方式表征概念,即是否独立于输入格式的问题。研究发现,传统用于解释上下文学习(In-Context Learning, ICL)性能的函数向量(Function Vectors, FVs)并非完全不变:当从不同输入格式(如开放式与多项选择题)中提取时,FVs几乎正交,即使它们指向同一概念。为此,作者提出概念向量(Concept Vectors, CVs),其核心创新在于通过表示相似性分析(Representational Similarity Analysis, RSA)筛选出在多种输入格式下稳定编码特定概念的注意力头(attention heads)。CVs虽与FV相关注意力头位于相似层,但二者组成显著不同,表明其机制各异;实验进一步显示,FVs在分布内表现优异,而CVs在跨问题类型(开放式 vs. 多项选择)和跨语言的分布外场景中更具泛化能力,揭示了LLMs确实包含抽象概念表征,但这些表征不同于驱动ICL性能的机制。
链接: https://arxiv.org/abs/2602.22424
作者: Gustaw Opiełka,Hannes Rosenbusch,Claire E. Stevenson
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Do large language models (LLMs) represent concepts abstractly, i.e., independent of input format? We revisit Function Vectors (FVs), compact representations of in-context learning (ICL) tasks that causally drive task performance. Across multiple LLMs, we show that FVs are not fully invariant: FVs are nearly orthogonal when extracted from different input formats (e.g., open-ended vs. multiple-choice), even if both target the same concept. We identify Concept Vectors (CVs), which carry more stable concept representations. Like FVs, CVs are composed of attention head outputs; however, unlike FVs, the constituent heads are selected using Representational Similarity Analysis (RSA) based on whether they encode concepts consistently across input formats. While these heads emerge in similar layers to FV-related heads, the two sets are largely distinct, suggesting different underlying mechanisms. Steering experiments reveal that FVs excel in-distribution, when extraction and application formats match (e.g., both open-ended in English), while CVs generalize better out-of-distribution across both question types (open-ended vs. multiple-choice) and languages. Our results show that LLMs do contain abstract concept representations, but these differ from those that drive ICL performance.
[NLP-59] SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 安全评估中 stereotype 语料库全球覆盖不足的问题,尤其是非洲撒哈拉以南地区代表性严重缺失的现状。解决方案的关键在于采用社会文化情境化、社区参与的方法,通过母语主持的电话调查等手段,构建一个涵盖加纳、肯尼亚、尼日利亚和南非四国的多语言 stereotype 数据集,确保样本在民族与人口学背景上的多样性平衡,从而实现对区域复杂语言多样性和传统口头文化的敏感性响应,并提供可复现的研究方法。
链接: https://arxiv.org/abs/2602.22404
作者: Aishwarya Verma,Laud Ammah,Olivia Nercy Ndlovu Lucas,Andrew Zaldivar,Vinodkumar Prabhakaran,Sunipa Dev
机构: Google Research; RAIN Africa; Mantaray Africa
类目: Computation and Language (cs.CL)
备注:
Abstract:Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage. It is imperative to prioritize targeted expansion, strategically addressing existing deficits, over merely increasing data volume. This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa. By utilizing socioculturally-situated, community-engaged methods, including telephonic surveys moderated in native languages, we establish a reproducible methodology that is sensitive to the region’s complex linguistic diversity and traditional orality. By deliberately balancing the sample across diverse ethnic and demographic backgrounds, we ensure broad coverage, resulting in a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages.
[NLP-60] Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework
【速读】: 该论文旨在解决低资源语言(如孟加拉语)中仇恨言论和煽动性内容识别难题,尤其针对社交媒体上具有文化特异性、讽刺性和隐晦性的孟加拉语表情包(meme)。现有研究多集中于高资源语言,难以有效处理孟加拉语这类语言中仇恨与煽动性内容的细微差别。为填补这一空白,作者构建了首个区分煽动性内容与直接仇恨言论的孟加拉语表情包数据集Bn-HIB(Bangla Hate Inflammatory Benign),包含3,247条人工标注样本。解决方案的关键在于提出MCFM(Multi-Modal Co-Attention Fusion Model),该模型通过跨模态注意力机制协同分析表情包的图文信息,动态融合各模态中的关键特征,从而提升分类精度。实验表明,MCFM在Bn-HIB数据集上显著优于多个前沿多模态模型,验证了其在复杂语境下识别敏感内容的有效性。
链接: https://arxiv.org/abs/2602.22391
作者: Rakib Ullah(1),Mominul islam(2),Md Sanjid Hossain(2),Md Ismail Hossain(2) ((1) Sylhet Engineering College, (2) Daffodil International University)
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 8 figures
Abstract:Internet memes have become a dominant form of expression on social media, including within the Bengali-speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is excep- tionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource lan- guages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyzes both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced this http URL: This work contains material that may be disturbing to some audience members. Viewer discretion is advised.
[NLP-61] Scaling In Not Up? Testing Thick Citation Context Analysis with GPT -5 and Frag ile Prompts
【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)支持解释性引文语境分析(Interpretative Citation Context Analysis, CCA)的问题,即在不依赖类型标签扩展的前提下,通过深度文本锚定的方式对单一复杂案例进行细致解读。其解决方案的关键在于构建一个两阶段的GPT-5处理流程:第一阶段为仅基于引文文本的表面分类与预期生成,第二阶段则结合引用文献与被引文献的全文进行跨文档解释重构;同时,通过控制提示词(prompt)的结构化设计和框架变化,在2×3平衡实验中系统考察提示敏感性如何影响模型生成的解释性假设的数量、频率及其词汇分布,从而揭示LLMs作为可检验、可争辩的解释性分析协作者的潜力与风险。
链接: https://arxiv.org/abs/2602.22359
作者: Arno Simons
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 1 figure, 3 tables (plus 17 pages supplement including 1 figure)
Abstract:This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity analysis as a methodological issue by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 in Chubin and Moitra (1975) and Gilbert’s (1977) reconstruction as a probe, I implement a two-stage GPT-5 pipeline: a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using the citing and cited full texts. Across 90 reconstructions, the model produces 450 distinct hypotheses. Close reading and inductive coding identify 21 recurring interpretative moves, and linear probability models estimate how prompt choices shift their frequencies and lexical repertoire. GPT-5’s surface pass is highly stable, consistently classifying the citation as “supplementary”. In reconstruction, the model generates a structured space of plausible alternatives, but scaffolding and examples redistribute attention and vocabulary, sometimes toward strained readings. Relative to Gilbert, GPT-5 detects the same textual hinges yet more often resolves them as lineage and positioning than as admonishment. The study outlines opportunities and risks of using LLMs as guided co-analysts for inspectable, contestable interpretative CCA, and it shows that prompt scaffolding and framing systematically tilt which plausible readings and vocabularies the model foregrounds.
[NLP-62] Decoder-based Sense Knowledge Distillation
【速读】: 该论文旨在解决生成式语言模型(generative language models)在训练过程中难以有效融入结构化词汇知识(structured lexical knowledge,如词义和词语关系)的问题。现有方法虽能通过引入词义词典提升编码器模型的知识蒸馏效果,但在解码器架构中应用仍面临挑战。其解决方案的关键在于提出一种名为Decoder-based Sense Knowledge Distillation (DSKD) 的新框架,该框架将词汇资源直接整合到解码器类大语言模型(LLM)的训练阶段,无需在推理时进行词典查找,从而在保持高效训练的同时使生成模型能够继承结构化的语义信息。
链接: https://arxiv.org/abs/2602.22351
作者: Qitong Wang,Mohammed J. Zaki,Georgios Kollias,Vasileios Kalantzis
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) learn contextual embeddings that capture rich semantic information, yet they often overlook structured lexical knowledge such as word senses and relationships. Prior work has shown that incorporating sense dictionaries can improve knowledge distillation for encoder models, but their application to decoder as generative models remains challenging. In this paper, we introduce Decoder-based Sense Knowledge Distillation (DSKD), a framework that integrates lexical resources into the training of decoder-style LLMs without requiring dictionary lookup at inference time. Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.
[NLP-63] Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads
【速读】: 该论文旨在解决视频广告中“钩子时段”(hooking period)——即前3秒内如何有效吸引用户注意力并提升参与度——的分析难题。由于视频内容具有多模态特性(视觉、听觉与文本信息交织),传统方法难以捕捉其复杂交互机制,导致广告优化策略缺乏数据驱动依据。解决方案的关键在于构建一个基于Transformer架构的多模态大语言模型(Multimodal Large Language Models, MLLMs)框架,结合统一随机采样与关键帧选择两种策略以实现音频特征的均衡提取,并利用BERTopic对生成描述进行主题抽象,同时融合音频属性和广告定向信息作为增强特征,从而在大规模社交平台真实数据上验证了钩子期特征与转化率等核心指标之间的显著相关性,为视频广告的初始设计提供了可量化、可解释且具备预测能力的分析工具。
链接: https://arxiv.org/abs/2602.22299
作者: Kunpeng Zhang,Poppy Zhang,Shawndra Hill,Amel Awadelkarim
机构: University of Marland, College Park (马里兰大学帕克分校); Meta Platforms, Inc. (Meta)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 5 figures, 3 tables
Abstract:Video-based ads are a vital medium for brands to engage consumers, with social media platforms leveraging user data to optimize ad delivery and boost engagement. A crucial but under-explored aspect is the ‘hooking period’, the first three seconds that capture viewer attention and influence engagement metrics. Analyzing this brief window is challenging due to the multimodal nature of video content, which blends visual, auditory, and textual elements. Traditional methods often miss the nuanced interplay of these components, requiring advanced frameworks for thorough evaluation. This study presents a framework using transformer-based multimodal large language models (MLLMs) to analyze the hooking period of video ads. It tests two frame sampling strategies, uniform random sampling and key frame selection, to ensure balanced and representative acoustic feature extraction, capturing the full range of design elements. The hooking video is processed by state-of-the-art MLLMs to generate descriptive analyses of the ad’s initial impact, which are distilled into coherent topics using BERTopic for high-level abstraction. The framework also integrates features such as audio attributes and aggregated ad targeting information, enriching the feature set for further analysis. Empirical validation on large-scale real-world data from social media platforms demonstrates the efficacy of our framework, revealing correlations between hooking period features and key performance metrics like conversion per investment. The results highlight the practical applicability and predictive power of the approach, offering valuable insights for optimizing video ad strategies. This study advances video ad analysis by providing a scalable methodology for understanding and enhancing the initial moments of video advertisements. Comments: 11 pages, 5 figures, 3 tables Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2602.22299 [cs.MM] (or arXiv:2602.22299v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2602.22299 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-64] Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在测试时优化输出质量的问题,尤其是在缺乏外部标量奖励信号(如预训练的评分函数或人工标注标签)的情况下,如何高效地迭代搜索高质量候选解。现有方法依赖于校准的标量评估器来指导搜索,但这类评估器在许多任务中难以获取、稀疏或不可靠。论文提出的解决方案关键在于:使用LLM自身产生的成对偏好(pairwise preferences)替代外部标量奖励,通过贝叶斯Bradley-Terry模型聚合这些噪声偏好以获得不确定性感知的候选质量估计,并利用双 Thompson 采样(Double Thompson Sampling)动态分配比较预算并选择优质父代进行进化生成。该方法无需奖励模型、无地面真值标签、也无需手工设计评分函数,在MathBench和LiveCodeBench上显著优于现有迭代优化方法。
链接: https://arxiv.org/abs/2602.21585
作者: Sweta Karlekar,Carolina Zheng,Magnus Saebo,Nicolas Beltran-Velez,Shuyang Yu,John Bowlan,Michal Kucer,David Blei
机构: Columbia University (哥伦比亚大学); Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
Abstract:Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.
[NLP-65] Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
【速读】: 该论文旨在解决深度伪造语音(Deepfake speech)中通过替换真实语句中的一个或多个词语为由语音生成模型合成的语义不同词汇所引发的安全威胁问题。其解决方案的关键在于:利用预训练的Whisper模型,通过下一词预测(next-token prediction)机制,在语音转录过程中同步检测合成词,从而实现低成本、高效率的检测;同时引入部分声码化(partially vocoded)语音作为微调数据,以降低数据收集成本,实验证明该方法在域内测试中具有低误检率和转录准确率,且在域外测试中性能可与专用的ResNet检测模型相当,但泛化能力仍有待提升。
链接: https://arxiv.org/abs/2602.22658
作者: Hoan My Tran,Xin Wang,Wanying Ge,Xuechen Liu,Junichi Yamagishi
机构: Université de Rennes (雷恩大学); National Institute of Informatics (日本信息研究所)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:
Abstract:Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models. While a dedicated synthetic word detector could be developed, we investigate a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thereby reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced by unseen speech generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.
信息检索
[IR-0] AlayaLaser: Efficient Index Layout and Search Strategy for Large-scale High-dimensional Vector Similarity Search SIGMOD2026
【速读】:该论文针对的是基于磁盘的图结构近似最近邻搜索(Approximate Nearest Neighbor Search, ANNS)系统在高维向量检索场景下性能受限的问题,其核心挑战在于传统方法普遍聚焦于降低I/O开销,而忽视了随着向量维度上升(如数百至数千维)时计算成为瓶颈的现象。解决方案的关键在于提出AlayaLaser系统,通过引入一种新颖的磁盘数据布局来缓解计算密集型任务的性能瓶颈,该布局充分利用现代CPU的SIMD指令集提升并行计算效率;同时设计了一系列优化技术,包括基于度数的节点缓存、基于聚类的入口点选择以及早期调度策略,从而显著提升整体检索性能。实验表明,AlayaLaser不仅优于现有磁盘索引系统,甚至可媲美或超越内存中的索引系统。
链接: https://arxiv.org/abs/2602.23342
作者: Weijian Chen,Haotian Liu,Yangshen Deng,Long Xiang,Liang Huang,Gezi Li,Bo Tang
机构: SUSTech(南方科技大学); AlayaDB AI; University of Edinburgh (爱丁堡大学); Huawei (华为)
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注: The paper has been accepted by SIGMOD 2026
Abstract:On-disk graph-based approximate nearest neighbor search (ANNS) is essential for large-scale, high-dimensional vector retrieval, yet its performance is widely recognized to be limited by the prohibitive I/O costs. Interestingly, we observed that the performance of on-disk graph-based index systems is compute-bound, not I/O-bound, with the rising of the vector data dimensionality (e.g., hundreds or thousands). This insight uncovers a significant optimization opportunity: existing on-disk graph-based index systems universally target I/O reduction and largely overlook computational overhead, which leaves a substantial performance improvement space. In this work, we propose AlayaLaser, an efficient on-disk graph-based index system for large-scale high-dimensional vector similarity search. In particular, we first conduct performance analysis on existing on-disk graph-based index systems via the adapted roofline model, then we devise a novel on-disk data layout in AlayaLaser to effectively alleviate the compute-bound, which is revealed by the above roofline model analysis, by exploiting SIMD instructions on modern CPUs. We next design a suite of optimization techniques (e.g., degree-based node cache, cluster-based entry point selection, and early dispatch strategy) to further improve the performance of AlayaLaser. We last conduct extensive experimental studies on a wide range of large-scale high-dimensional vector datasets to verify the superiority of AlayaLaser. Specifically, AlayaLaser not only surpasses existing on-disk graph-based index systems but also matches or even exceeds the performance of in-memory index systems. Comments: The paper has been accepted by SIGMOD 2026 Subjects: Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2602.23342 [cs.DB] (or arXiv:2602.23342v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2602.23342 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-1] Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset
【速读】:该论文旨在解决当前AI赋能的科研工具在真实研究场景中使用模式缺乏系统性理解的问题。其关键解决方案是构建并分析了Asta Interaction Dataset,这是一个包含超过20万条用户查询与交互日志的大规模数据集,来自两个部署在基于大语言模型(Large Language Model, LLM)的检索增强生成平台上的科研工具——文献发现界面和科学问答界面。通过该数据集,作者揭示了用户查询行为、参与模式及使用经验演化规律,发现用户将AI视为协作研究伙伴,并以非线性方式反复利用生成结果和引用证据,从而为未来AI科研助手的设计与评估提供了实证基础和新的查询意图分类体系。
链接: https://arxiv.org/abs/2602.23335
作者: Dany Haddad,Dan Bareket,Joseph Chee Chang,Jay DeYoung,Jena D. Hwang,Uri Katz,Mark Polak,Sangho Suh,Harshit Surana,Aryeh Tiktinsky,Shriya Atmakuri,Jonathan Bragg,Mike D’Arcy,Sergey Feldman,Amal Hassan-Ali,Rubén Lozano,Bodhisattwa Prasad Majumder,Charles McGrady,Amanpreet Singh,Brooke Vlahos,Yoav Goldberg,Doug Downey
机构: Allen Institute for AI (Allen Institute for Artificial Intelligence); Stanford University (斯坦福大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:AI-powered scientific research tools are rapidly being integrated into research workflows, yet the field lacks a clear lens into how researchers use these systems in real-world settings. We present and analyze the Asta Interaction Dataset, a large-scale resource comprising over 200,000 user queries and interaction logs from two deployed tools (a literature discovery interface and a scientific question-answering interface) within an LLM-powered retrieval-augmented generation platform. Using this dataset, we characterize query patterns, engagement behaviors, and how usage evolves with experience. We find that users submit longer and more complex queries than in traditional search, and treat the system as a collaborative research partner, delegating tasks such as drafting content and identifying research gaps. Users treat generated responses as persistent artifacts, revisiting and navigating among outputs and cited evidence in non-linear ways. With experience, users issue more targeted queries and engage more deeply with supporting citations, although keyword-style queries persist even among experienced users. We release the anonymized dataset and analysis with a new query intent taxonomy to inform future designs of real-world AI research assistants and to support realistic evaluation.
[IR-2] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables ICLR2026
【速读】:该论文旨在解决当前表-文本问答(Table-Text QA)任务中基准数据集规模小、人工标注成本高且问题浅层(通常不超过两跳)的问题,尤其缺乏支持复杂推理操作(如聚合、分组和深层多跳推理)的高质量评测数据。其解决方案的关键在于提出SPARTA框架,该框架通过两个核心创新实现自动构建大规模、高保真度的Table-Text QA数据集:一是利用从非结构化文本中自动提取的原子事实增强源表格,构建参考事实数据库;二是设计基于证明溯源的精炼机制(provenance-based refinement)与现实结构约束(realistic-structure enforcement),确保生成的SQL查询可执行且自然语言化后语义流畅,从而有效支持多跳、聚合等高级分析操作。
链接: https://arxiv.org/abs/2602.23286
作者: Sungho Park,Jueun Kim,Wook-Shin Han
机构: POSTECH(浦项科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注: 10 pages, 5 figures. Published as a conference paper at ICLR 2026. Project page: this https URL
Abstract:Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at this https URL.
[IR-3] Scaling Search Relevance: Augmenting App Store Ranking with LLM -Generated Judgments
【速读】:该论文旨在解决大规模商业搜索系统中文本相关性(textual relevance)标签稀缺的问题,尽管行为相关性(behavioral relevance)标签数据丰富,但高质量的专家标注文本相关性标签难以获取,限制了模型对语义匹配效果的优化。解决方案的关键在于:首先通过系统性评估大语言模型(LLM)配置,发现一个经过专项微调的小型模型在生成文本相关性标签方面显著优于更大的预训练模型;随后利用该最优模型作为“增强器”,自动生成数百万条文本相关性标签以缓解数据稀缺问题;最终将这些标签融入线上排序器,使离线NDCG指标在保持行为相关性的同时提升文本相关性,并通过全球A/B测试验证了转换率提升0.24%,尤其在长尾查询中表现突出,证明了新标签提供的稳健信号价值。
链接: https://arxiv.org/abs/2602.23234
作者: Evangelia Christakopoulou,Vivekkumar Patel,Hemanth Velaga,Sandip Gaikwad
机构: Apple(苹果)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result’s semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textual relevance. These offline gains were validated by a worldwide A/B test on the App Store ranker, which demonstrated a statistically significant +0.24% increase in conversion rate, with the most substantial performance gains occurring in tail queries, where the new textual relevance labels provide a robust signal in the absence of reliable behavioral relevance labels.
[IR-4] From Agnostic to Specific: Latent Preference Diffusion for Multi-Behavior Sequential Recommendation
【速读】:该论文旨在解决多行为序列推荐(Multi-behavior Sequential Recommendation, MBSR)中两个核心问题:一是现有方法忽视用户潜在偏好在决策过程中的作用,导致推荐效果不佳;二是传统基于偏好打分的判别范式难以捕捉低熵行为到高熵物品之间的不确定性,限制了推荐的多样性和效率。解决方案的关键在于提出一个基于扩散模型(diffusion model)的框架 FatsMB,其核心创新是通过“从行为无关到行为特定”的潜空间引导偏好生成机制,在统一的用户潜偏好空间中实现跨行为交互与行为感知的信息融合(使用 Behavior-aware RoPE,BaRoPE),并通过多条件引导层归一化(Multi-Condition Guided Layer Normalization, MCGLN)进行去噪建模,从而提升推荐的准确性与多样性。
链接: https://arxiv.org/abs/2602.23132
作者: Ruochen Yang,Xiaodong Li,Jiawei Sheng,Jiangxia Cao,Xinkui Lin,Shen Wang,Shuang Yang,Zhaojie Liu,Tingwen Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, UCAS; Kuaishou Technology
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Multi-behavior sequential recommendation (MBSR) aims to learn the dynamic and heterogeneous interactions of users’ multi-behavior sequences, so as to capture user preferences under target behavior for the next interacted item prediction. Unlike previous methods that adopt unidirectional modeling by mapping auxiliary behaviors to target behavior, recent concerns are shifting from behavior-fixed to behavior-specific recommendation. However, these methods still ignore the user’s latent preference that underlying decision-making, leading to suboptimal solutions. Meanwhile, due to the asymmetric deterministic between items and behaviors, discriminative paradigm based on preference scoring is unsuitable to capture the uncertainty from low-entropy behaviors to high-entropy items, failing to provide efficient and diverse recommendation. To address these challenges, we propose \textbfFatsMB, a framework based diffusion model that guides preference generation \textit\textbfFrom Behavior-\textbfAgnostic \textbfTo Behavior-\textbfSpecific in latent spaces, enabling diverse and accurate \textit\textbfMulti-\textbfBehavior Sequential Recommendation. Specifically, we design a Multi-Behavior AutoEncoder (MBAE) to construct a unified user latent preference space, facilitating interaction and collaboration across Behaviors, within Behavior-aware RoPE (BaRoPE) employed for multiple information fusion. Subsequently, we conduct target behavior-specific preference transfer in the latent space, enriching with informative priors. A Multi-Condition Guided Layer Normalization (MCGLN) is introduced for the denoising. Extensive experiments on real-world datasets demonstrate the effectiveness of our model.
[IR-5] MaRI: Accelerating Ranking Model Inference via Structural Re-parameterization in Large Scale Recommendation System
【速读】:该论文旨在解决大规模推荐系统中排序模型(ranking models)在满足在线服务低延迟要求时,因结构轻量化或知识蒸馏等加速手段导致精度显著下降的问题。其解决方案的关键在于提出一种名为MaRI(Matrix Re-parameterized Inference)的框架,通过结构重参数化(structural reparameterization)优化特征融合中的矩阵乘法计算,识别并消除用户侧计算冗余,从而实现无损加速(lossless acceleration)。
链接: https://arxiv.org/abs/2602.23105
作者: Yusheng Huang,Pengbo Xu,Shen Wang,Changxin Lao,Jiangxia Cao,Shuang Wen,Shuang Yang,Zhaojie Liu,Han Li,Kun Gai
机构: Kuaishou Technology(快手科技)
类目: Information Retrieval (cs.IR)
备注: Work in progress
Abstract:Ranking models, i.e., coarse-ranking and fine-ranking models, serve as core components in large-scale recommendation systems, responsible for scoring massive item candidates based on user preferences. To meet the stringent latency requirements of online serving, structural lightweighting or knowledge distillation techniques are commonly employed for ranking model acceleration. However, these approaches typically lead to a non-negligible drop in accuracy. Notably, the angle of lossless acceleration by optimizing feature fusion matrix multiplication, particularly through structural reparameterization, remains underexplored. In this paper, we propose MaRI, a novel Matrix Re-parameterized Inference framework, which serves as a complementary approach to existing techniques while accelerating ranking model inference without any accuracy loss. MaRI is motivated by the observation that user-side computation is redundant in feature fusion matrix multiplication, and we therefore adopt the philosophy of structural reparameterization to alleviate such redundancy.
[IR-6] CiteLLM : An Agent ic Platform for Trustworthy Scientific Reference Discovery
【速读】:该论文旨在解决生成式 AI 在学术研究中应用时面临的三大伦理挑战:AI 生成内容的可信度、学术诚信与知识产权保护,以及信息隐私安全。其解决方案的关键在于提出 CiteLLM——一个专为引用发现设计的智能体平台,通过将大语言模型(Large Language Models, LLMs)能力嵌入 LaTeX 编辑器环境,实现本地化部署以保障数据不出系统;同时采用动态学科感知路由机制,仅从受信任的学术网络资源中检索候选文献,并利用 LLM 生成上下文相关的搜索查询、按相关性排序及通过段落级语义匹配验证支持证据,从而确保引用结果无幻觉且高度可用。
链接: https://arxiv.org/abs/2602.23075
作者: Mengze Hong,Di Jiang,Chen Jason Zhang,Zichang Guo,Yawen Li,Jun Chen,Shaobo Cui,Zhiyang Su
机构: Hong Kong Polytechnic University (香港理工大学); Beijing University of Posts and Telecommunications (北京邮电大学); Independent Researcher (独立研究员); Swiss Federal Technology Institute of Lausanne (EPFL) (瑞士洛桑联邦理工学院); Hong Kong University of Science and Technology (HKUST) (香港科技大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted by TheWebConf 2026 Demo Track
Abstract:Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethical deployment of AI assistance, including (1) the trustworthiness of AI-generated content, (2) preservation of academic integrity and intellectual property, and (3) protection of information privacy. In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements. The system introduces a novel interaction paradigm by embedding LLM utilities directly within the LaTeX editor environment, ensuring a seamless user experience and no data transmission outside the local system. To guarantee hallucination-free references, we employ dynamic discipline-aware routing to retrieve candidates exclusively from trusted web-based academic repositories, while leveraging LLMs solely for generating context-aware search queries, ranking candidates by relevance, and validating and explaining support through paragraph-level semantic matching and an integrated chatbot. Evaluation results demonstrate the superior performance of the proposed system in returning valid and highly usable references.
[IR-7] MoDora: Tree-Based Semi-Structured Document Analysis System SIGMOD2026
【速读】:该论文旨在解决半结构化文档(semi-structured documents)中自然语言问答(Natural Language Question Answering, NLQA)的难题,其核心挑战包括:(1)光学字符识别(OCR)提取的元素碎片化且丧失原始语义上下文;(2)现有方法缺乏有效表示来建模文档内的层次结构(如表格与嵌套章节标题的关系)和布局特异性差异(如侧边栏与正文的区别);(3)问题回答常需跨区域或跨页的信息检索与对齐(如将描述性段落与分散在文档其他位置的表格单元格关联)。解决方案的关键在于提出 MoDora 系统,其创新点包括:(1)采用局部对齐聚合策略将 OCR 解析元素转化为布局感知组件,并对具有层级标题或非文本元素的组件进行类型特定信息抽取;(2)设计 Component-Correlation Tree (CCTree) 以自底向上级联摘要过程显式建模组件间关系及布局差异;(3)引入基于问题类型的检索策略,支持基于布局的网格分区用于定位检索,以及基于大语言模型(LLM)引导的剪枝策略实现语义检索。实验表明,MoDora 在准确率上相较基线提升 5.97%–61.07%。
链接: https://arxiv.org/abs/2602.23061
作者: Bangrui Xu,Qihang Yao,Zirui Tang,Xuanhe Zhou,Yeye He,Shihan Yu,Qianqian Xu,Bin Wang,Guoliang Li,Conghui He,Fan Wu
机构: Shanghai Jiao Tong University (上海交通大学); Microsoft Research; Beihang University (北京航空航天大学); Shanghai AI Lab; Tsinghua University (清华大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注: Extension of our SIGMOD 2026 paper. Please refer to source code available at this https URL
Abstract:Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real-world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout-specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an LLM-powered system for semi-structured document analysis. First, we adopt a local-alignment aggregation strategy to convert OCR-parsed elements into layout-aware components, and conduct type-specific information extraction for components with hierarchical titles or non-text elements. Second, we design the Component-Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter-component relations and layout distinctions through a bottom-up cascade summarization process. Finally, we propose a question-type-aware retrieval strategy that supports (1) layout-based grid partitioning for location-based retrieval and (2) LLM-guided pruning for semantic-based retrieval. Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy. The code is at this https URL. Comments: Extension of our SIGMOD 2026 paper. Please refer to source code available at this https URL Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2602.23061 [cs.IR] (or arXiv:2602.23061v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.23061 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-8] Sequential Regression for Continuous Value Prediction using Residual Quantization
【速读】:该论文旨在解决工业级推荐系统中连续值预测(如用户观看时长和电商GMV估计)面临的挑战,即真实数据分布高度复杂且呈长尾状,而现有生成式方法依赖刚性的参数化分布假设,导致在实际场景中性能受限。其核心解决方案是提出一种基于残差量化(Residual Quantization, RQ)的序列学习框架,将目标连续值表示为有序量化码的累加和,并通过从粗到细的递归预测逐步降低量化误差;同时引入一种对齐量化码嵌入空间与目标值序结构的表示学习目标,使模型能够捕捉量化码的连续表示,从而显著提升预测精度与跨任务泛化能力。
链接: https://arxiv.org/abs/2602.23012
作者: Runpeng Cui,Zhipeng Sun,Chi Lu,Peng Jiang
机构: Kuaishou Technology(快手科技)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Continuous value prediction plays a crucial role in industrial-scale recommendation systems, including tasks such as predicting users’ watch-time and estimating the gross merchandise value (GMV) in e-commerce transactions. However, it remains challenging due to the highly complex and long-tailed nature of the data distributions. Existing generative approaches rely on rigid parametric distribution assumptions, which fundamentally limits their performance when such assumptions misalign with real-world data. Overly simplified forms cannot adequately model real-world complexities, while more intricate assumptions often suffer from poor scalability and generalization. To address these challenges, we propose a residual quantization (RQ)-based sequence learning framework that represents target continuous values as a sum of ordered quantization codes, predicted recursively from coarse to fine granularity with diminishing quantization errors. We introduce a representation learning objective that aligns RQ code embedding space with the ordinal structure of target values, allowing the model to capture continuous representations for quantization codes and further improving prediction accuracy. We perform extensive evaluations on public benchmarks for lifetime value (LTV) and watch-time prediction, alongside a large-scale online experiment for GMV prediction on an industrial short-video recommendation platform. The results consistently show that our approach outperforms state-of-the-art methods, while demonstrating strong generalization across diverse continuous value prediction tasks in recommendation systems. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2602.23012 [cs.IR] (or arXiv:2602.23012v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.23012 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-9] SIGMA: A Semantic-Grounded Instruction-Driven Generative Multi-Task Recommender at AliExpress
【速读】:该论文旨在解决当前生成式推荐系统(Generative Recommendation)在实际应用中面临的两大挑战:一是现有方法多局限于基于交互数据的下一物品预测范式,难以快速适应动态趋势;二是缺乏对多样化推荐任务及业务需求的灵活支持。解决方案的关键在于提出SIGMA——一个语义 grounded 的指令驱动型多任务生成式推荐框架,其核心创新包括:通过统一潜在空间建模物品实体的语义与协同关系,实现通用语义层面的实体锚定;设计混合物品标记化方法以提升生成精度与效率;构建大规模多任务监督微调(SFT)数据集以支持指令跟随能力;并引入三步生成流程与自适应概率融合机制,根据任务特性动态调整输出分布,从而在推荐准确率与多样性之间取得平衡。
链接: https://arxiv.org/abs/2602.22913
作者: Yang Yu,Lei Kou,Huaikuan Yi,Bin Chen,Yayu Cao,Lei Shen,Chao Zhang,Bing Wang,Xiaoyi Zeng
机构: Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:With the rapid evolution of Large Language Models, generative recommendation is gradually reshaping the paradigm of recommender systems. However, most existing methods are still confined to the interaction-driven next-item prediction paradigm, failing to rapidly adapt to evolving trends or address diverse recommendation tasks along with business-specific requirements in real-world scenarios. To this end, we present SIGMA, a Semantic-Grounded Instruction-Driven Generative Multi-Task Recommender at AliExpress. Specifically, we first ground item entities in general semantics via a unified latent space capturing both semantic and collaborative relations. Building upon this, we develop a hybrid item tokenization method for precise modeling and efficient generation. Moreover, we construct a large-scale multi-task SFT dataset to empower SIGMA to fulfill various recommendation demands via instruction-following. Finally, we design a three-step item generation procedure integrated with an adaptive probabilistic fusion mechanism to calibrate the output distributions based on task-specific requirements for recommendation accuracy and diversity. Extensive offline experiments and online A/B tests demonstrate the effectiveness of SIGMA.
[IR-10] PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised MMEA KDD
【速读】:该论文旨在解决多模态实体对齐(Multimodal Entity Alignment, MMEA)中因缺乏标注种子对(labeled seed pairs)而导致的性能瓶颈问题,尤其是在无监督场景下伪种子(pseudo-seeds)质量不高且在知识图谱中覆盖不均的问题。解决方案的关键在于提出PSQE(Pseudo-Seed Quality Enhancement)模块,通过融合多模态信息与聚类-重采样策略,同时提升伪种子的精度和图结构覆盖的平衡性;理论分析进一步揭示了伪种子在对比学习中的吸引力与排斥力双重作用机制,并指出不平衡覆盖会导致模型偏向高密度区域而削弱对稀疏区域实体的学习能力,实验验证了该方法可作为即插即用模块显著提升基线模型性能。
链接: https://arxiv.org/abs/2602.22903
作者: Yunpeng Hong,Chenyang Bu,Jie Zhang,Yi He,Di Wu,Xindong Wu
机构: Hefei University of Technology (合肥工业大学); College of William and Mary (威廉玛丽学院); Southwest University (西南大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 2026 SIGKDD accept
Abstract:Multimodal Entity Alignment (MMEA) aims to identify equivalent entities across different data modalities, enabling structural data integration that in turn improves the performance of various large language model applications. To lift the requirement of labeled seed pairs that are difficult to obtain, recent methods shifted to an unsupervised paradigm using pseudo-alignment seeds. However, unsupervised entity alignment in multimodal settings remains underexplored, mainly because the incorporation of multimodal information often results in imbalanced coverage of pseudo-seeds within the knowledge graph. To overcome this, we propose PSQE (Pseudo-Seed Quality Enhancement) to improve the precision and graph coverage balance of pseudo seeds via multimodal information and clustering-resampling. Theoretical analysis reveals the impact of pseudo seeds on existing contrastive learning-based MMEA models. In particular, pseudo seeds can influence the attraction and the repulsion terms in contrastive learning at once, whereas imbalanced graph coverage causes models to prioritize high-density regions, thereby weakening their learning capability for entities in sparse regions. Experimental results validate our theoretical findings and show that PSQE as a plug-and-play module can improve the performance of baselines by considerable margins.
[IR-11] Generative Recommendation for Large-Scale Advertising
【速读】:该论文旨在解决大规模广告场景下生成式推荐(Generative Recommendation)在实时部署中面临的模型扩展性与推理效率瓶颈问题,尤其是在保持业务价值导向的前提下实现高吞吐量的在线服务。其核心解决方案是提出GR4AD(Generative Recommendation for ADdvertising),通过架构、学习和推理三方面协同设计:1)引入统一广告语义ID(UA-SID)实现复杂业务信息的高效编码;2)提出懒惰自回归解码器(LazyAR)缓解层间依赖,降低多候选生成的推理开销;3)采用基于价值感知的监督学习(VSL)与排序引导的软最大偏好优化(RSPO)算法,使模型优化更贴近线上业务指标;4)设计动态束宽服务机制(dynamic beam serving),按需调整生成过程中的计算资源分配,从而在固定算力预算下实现推理效率与效果的平衡。
链接: https://arxiv.org/abs/2602.22732
作者: Ben Xue,Dan Liu,Lixiang Wang,Mingjie Sun,Peng Wang,Pengfei Zhang,Shaoyun Shi,Tianyu Xu,Yunhao Sha,Zhiqiang Liu,Bo Kong,Bo Wang,Hang Yang,Jieting Xue,Junhao Wang,Shengyu Wang,Shuping Hui,Wencai Ye,Xiao Lin,Yongzhi Li,Yuhang Chen,Zhihui Yin,Quan Chen,Shiyang Wen,Wenjin Wu,Han Li,Guorui Zhou,Changcheng Li,Peng Jiang
机构: Kuaishou Technology(快手科技)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 13 pages, 6 figures, under review
Abstract:Generative recommendation has recently attracted widespread attention in industry due to its potential for scaling and stronger model capacity. However, deploying real-time generative recommendation in large-scale advertising requires designs beyond large-language-model (LLM)-style training and serving recipes. We present a production-oriented generative recommender co-designed across architecture, learning, and serving, named GR4AD (Generative Recommendation for ADdvertising). As for tokenization, GR4AD proposes UA-SID (Unified Advertisement Semantic ID) to capture complicated business information. Furthermore, GR4AD introduces LazyAR, a lazy autoregressive decoder that relaxes layer-wise dependencies for short, multi-candidate generation, preserving effectiveness while reducing inference cost, which facilitates scaling under fixed serving budgets. To align optimization with business value, GR4AD employs VSL (Value-Aware Supervised Learning) and proposes RSPO (Ranking-Guided Softmax Preference Optimization), a ranking-aware, list-wise reinforcement learning algorithm that optimizes value-based rewards under list-level metrics for continual online updates. For online inference, we further propose dynamic beam serving, which adapts beam width across generation levels and online load to control compute. Large-scale online A/B tests show up to 4.2% ad revenue improvement over an existing DLRM-based stack, with consistent gains from both model scaling and inference-time scaling. GR4AD has been fully deployed in Kuaishou advertising system with over 400 million users and achieves high-throughput real-time serving.
[IR-12] Vectorizing the Trie: Efficient Constrained Decoding for LLM -based Generative Retrieval on Accelerators
【速读】:该论文旨在解决生成式推荐(Generative Retrieval)在工业级部署中面临的两个核心问题:一是如何在保持高吞吐量的同时实现对输出空间的严格约束(如内容新鲜度或品类限制),而标准自回归解码无法原生支持此类约束;二是现有基于前缀树(Trie)的约束解码方法在TPU/GPU等硬件加速器上存在严重延迟开销。解决方案的关键在于提出STATIC(Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding),其通过将前缀树结构扁平化为静态压缩稀疏行(CSR)矩阵,将不规则的树遍历操作转化为完全向量化稀疏矩阵运算,从而显著提升硬件加速器上的计算效率。此设计使约束解码的延迟仅增加0.033毫秒/步,且在多种实际配置下保持极低运行时开销,实现了首个生产规模的严格约束生成式检索系统部署。
链接: https://arxiv.org/abs/2602.22647
作者: Zhengyang Su,Isay Katsman,Yueqi Wang,Ruining He,Lukasz Heldt,Raghunandan Keshavan,Shao-Chuan Wang,Xinyang Yi,Mingyan Gao,Onkar Dalal,Lichan Hong,Ed Chi,Ningren Han
机构: YouTube(YouTube); Google DeepMind(谷歌深度挖掘); Yale University(耶鲁大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 4 figures
Abstract:Generative retrieval has emerged as a powerful paradigm for LLM-based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high-throughput LLM-based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized sparse matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large-scale industrial video recommendation platform serving billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47-1033x speedup over a hardware-accelerated binary-search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production-scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval. Our code is available at this https URL.
[IR-13] Fine-grained Semantics Integration for Large Language Model-based Recommendation
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的推荐系统中两个关键挑战:一是语义无意义的初始化问题(Semantically Meaningless Initialization),即初始时语义标识符(Semantic Identifiers, SIDs)token随机初始化,导致SID空间与预训练语言空间之间缺乏语义关联;二是粗粒度对齐问题(Coarse-grained Alignment),现有监督微调(SFT-based)方法主要关注物品级优化,忽视了SID序列内部token的细粒度语义一致性。解决方案的关键在于提出TS-Rec框架,其核心由两个组件构成:(1) 语义感知嵌入初始化(Semantic-Aware embedding Initialization, SA-Init),通过教师模型提取关键词并采用均值池化初始化SID token嵌入,增强初始语义连贯性;(2) Token级语义对齐(Token-level Semantic Alignment, TS-Align),将SID序列中的每个token与对应物品聚类的共享语义对齐,从而实现细粒度语义融合。实验证明,该方法显著提升了LLM驱动的生成式推荐系统的性能。
链接: https://arxiv.org/abs/2602.22632
作者: Jiawen Feng,Xiaoyu Kong,Leheng Sheng,Bin Wu,Chao Yi,Feifang Yang,Xiang-Rong Sheng,Han Zhu,Xiang Wang,Jiancan Wu,Xiangnan He
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); Wuhan University (武汉大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have shifted in recommendation systems from the discriminative paradigm to the LLM-based generative paradigm, where the recommender autoregressively generates sequences of semantic identifiers (SIDs) for target items conditioned on historical interaction. While prevalent LLM-based recommenders have demonstrated performance gains by aligning pretrained LLMs between the language space and the SID space, modeling the SID space still faces two fundamental challenges: (1) Semantically Meaningless Initialization: SID tokens are randomly initialized, severing the semantic linkage between the SID space and the pretrained language space at start point, and (2) Coarse-grained Alignment: existing SFT-based alignment tasks primarily focus on item-level optimization, while overlooking the semantics of individual tokens within SID this http URL address these challenges, we propose TS-Rec, which can integrate Token-level Semantics into LLM-based Recommenders. Specifically, TS-Rec comprises two key components: (1) Semantic-Aware embedding Initialization (SA-Init), which initializes SID token embeddings by applying mean pooling to the pretrained embeddings of keywords extracted by a teacher model; and (2) Token-level Semantic Alignment (TS-Align), which aligns individual tokens within the SID sequence with the shared semantics of the corresponding item clusters. Extensive experiments on two real-world benchmarks demonstrate that TS-Rec consistently outperforms traditional and generative baselines across all standard metrics. The results demonstrate that integrating fine-grained semantic information significantly enhances the performance of LLM-based generative recommenders.
[IR-14] Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在零样本文档重排序(zero-shot document re-ranking)中因依赖生成式评分或输出概率 logits 而导致的推理延迟高和结果一致性差的问题。现有方法如上下文内重排序(In-Context Re-ranking, ICR)虽通过提取内部注意力信号实现 O(1) 复杂度,但未深入挖掘不同 Transformer 层间信号的贡献差异及其跨架构一致性。论文的关键解决方案是提出一种选择性 ICR(Selective-ICR)策略,基于发现的“钟形曲线”分布规律——即相关性信号在各层呈普遍的非均匀分布——仅选取关键层的注意力信号进行计算,从而在不牺牲效果的前提下将推理延迟降低 30%-50%。该方法重新定义了基于 LLM 的重排序任务中效率与效果的权衡边界,并揭示了内部注意力信号在复杂推理型排序任务中的潜在价值。
链接: https://arxiv.org/abs/2602.22591
作者: Haodong Chen,Shengyao Zhuang,Zheng Yao,Guido Zuccon,Teerapong Leelanupab
机构: The University of Queensland(昆士兰大学)
类目: Information Retrieval (cs.IR)
备注: 10 pages, 5 figures, 1 table. Code available at this https URL
Abstract:Zero-shot document re-ranking with Large Language Models (LLMs) has evolved from Pointwise methods to Listwise and Setwise approaches that optimize computational efficiency. Despite their success, these methods predominantly rely on generative scoring or output logits, which face bottlenecks in inference latency and result consistency. In-Context Re-ranking (ICR) has recently been proposed as an O(1) alternative method. ICR extracts internal attention signals directly, avoiding the overhead of text generation. However, existing ICR methods simply aggregate signals across all layers; layer-wise contributions and their consistency across architectures have been left unexplored. Furthermore, no unified study has compared internal attention with traditional generative and likelihood-based mechanisms across diverse ranking frameworks under consistent conditions. In this paper, we conduct an orthogonal evaluation of generation, likelihood, and internal attention mechanisms across multiple ranking frameworks. We further identify a universal “bell-curve” distribution of relevance signals across transformer layers, which motivates the proposed Selective-ICR strategy that reduces inference latency by 30%-50% without compromising effectiveness. Finally, evaluation on the reasoning-intensive BRIGHT benchmark shows that precisely capturing high-quality in-context attention signals fundamentally reduces the need for model scaling and reinforcement learning: a zero-shot 8B model matches the performance of 14B reinforcement-learned re-rankers, while even a 0.6B model outperforms state-of-the-art generation-based approaches. These findings redefine the efficiency-effectiveness frontier for LLM-based re-ranking and highlight the latent potential of internal signals for complex reasoning ranking tasks. Our code and results are publicly available at this https URL. Comments: 10 pages, 5 figures, 1 table. Code available at this https URL Subjects: Information Retrieval (cs.IR) ACMclasses: H.3.3 Cite as: arXiv:2602.22591 [cs.IR] (or arXiv:2602.22591v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.22591 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-15] Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agent ic RAG Training
【速读】:该论文旨在解决传统单轮检索增强生成(Retrieval-Augmented Generation, RAG)在复杂多步推理任务中表现不足的问题,以及当前基于强化学习(Reinforcement Learning, RL)的智能体式RAG(Agentic RAG)训练中存在的稀疏奖励信号和低样本效率问题。其解决方案的关键在于提出Search-P1框架,通过两个核心组件实现:(1) 路径中心奖励(Path-Centric Reward),以不依赖顺序的步骤覆盖率和软评分机制从失败样本中提取学习信号,从而缓解稀疏奖励问题;(2) 双轨路径评分机制(Dual-Track Path Scoring),结合离线生成的参考规划器,从自洽性和参考对齐两个角度评估推理路径质量,显著提升训练效率与模型性能。实验表明,Search-P1在多个问答基准上相较Search-R1等强基线平均准确率提升7.7点。
链接: https://arxiv.org/abs/2602.22576
作者: Tianle Xia,Ming Xu,Lingxiang Hu,Yiding Sun,Wenwei Li,Linfang Shang,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.
[IR-16] owards Dynamic Dense Retrieval with Routing Strategy
【速读】:该论文旨在解决密集检索(Dense Retrieval, DR)在新任务应用中的两大局限性:一是当训练数据有限时,难以将DR模型适配到新领域;二是旧模型被全新训练的新模型替换时,缺乏高效更新机制,尤其在频繁更新场景下成本高昂。解决方案的关键在于提出一种新型密集检索方法——动态密集检索(Dynamic Dense Retrieval, DDR),其核心是采用前缀微调(Prefix Tuning)构建针对特定领域的专用模块,并结合动态路由策略实现模块的组合式使用,从而在仅使用2%训练参数的情况下,在六个零样本下游任务上超越传统DR方法,显著提升了检索系统的灵活性与适应性。
链接: https://arxiv.org/abs/2602.22547
作者: Zhan Su,Fengran Mo,Jinghan Zhang,Yuchen Hui,Jia Ao Sun,Bingbing Wen,Jian-Yun Nie
机构: Université de Montréal(蒙特利尔大学); Clemson University(克莱姆森大学); University of Washington(华盛顿大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:The \textitde facto paradigm for applying dense retrieval (DR) to new tasks involves fine-tuning a pre-trained model for a specific task. However, this paradigm has two significant limitations: (1) It is difficult adapt the DR to a new domain if the training dataset is limited. (2) Old DR models are simply replaced by newer models that are trained from scratch when the former are no longer up to date. Especially for scenarios where the model needs to be updated frequently, this paradigm is prohibitively expensive. To address these challenges, we propose a novel dense retrieval approach, termed \textitdynamic dense retrieval (DDR). DDR uses \textitprefix tuning as a \textitmodule specialized for a specific domain. These modules can then be compositional combined with a dynamic routing strategy, enabling highly flexible domain adaptation in the retrieval part. Extensive evaluation on six zero-shot downstream tasks demonstrates that this approach can surpass DR while utilizing only 2% of the training parameters, paving the way to achieve more flexible dense retrieval in IR. We see it as a promising future direction for applying dense retrieval to various tasks. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2602.22547 [cs.IR] (or arXiv:2602.22547v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.22547 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-17] Generative Agents Navigating Digital Libraries
【速读】:该论文旨在解决数字图书馆研究中因隐私顾虑导致的用户搜索行为数据稀缺问题(scarcity of publicly available datasets on user search patterns due to privacy concerns)。其解决方案的关键在于提出Agent4DL,一个专为数字图书馆环境设计的用户搜索行为模拟器,能够生成具有真实用户特征和动态搜索会话的模拟数据,包括查询、点击和停止等行为,并通过与真实用户数据对比验证了其高保真度,尤其在生成多样化和情境感知的用户行为方面优于现有模拟工具如SimIIR 2.0。
链接: https://arxiv.org/abs/2602.22529
作者: Saber Zerhoudi,Michael Granitzer
机构: University of Passau (帕绍大学); IT:U Austria (IT:U 奥地利)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:
Abstract:In the rapidly evolving field of digital libraries, the development of large language models (LLMs) has opened up new possibilities for simulating user behavior. This innovation addresses the longstanding challenge in digital library research: the scarcity of publicly available datasets on user search patterns due to privacy concerns. In this context, we introduce Agent4DL, a user search behavior simulator specifically designed for digital library environments. Agent4DL generates realistic user profiles and dynamic search sessions that closely mimic actual search strategies, including querying, clicking, and stopping behaviors tailored to specific user profiles. Our simulator’s accuracy in replicating real user interactions has been validated through comparisons with real user data. Notably, Agent4DL demonstrates competitive performance compared to existing user search simulators such as SimIIR 2.0, particularly in its ability to generate more diverse and context-aware user behaviors.
[IR-18] FPS: A Temporal Filtration-enhanced Positive Sample Set Construction Method for Implicit Collaborative Filtering
【速读】:该论文旨在解决现有协同过滤(Collaborative Filtering, CF)推荐模型在基于隐式反馈训练时,负采样策略优化不足且忽视正样本质量的问题,尤其是当前方法未充分考虑用户行为的时间信息,导致难以准确捕捉用户的实时偏好。解决方案的关键在于提出一种新颖的时间滤波增强方法(Temporal Filtration-enhanced Positive Sampling, TFPS):首先构建基于交互时间间隔的时间衰减模型,将原始用户-物品二分图转化为加权图;随后通过预定义的滤波操作对加权图进行分层处理;最后设计层级增强策略,从各层子图中构建高质量的正样本集。该方法从数据层面提升正样本质量,并通过理论分析证明其能有效提高Recall@k和NDCG@k指标,同时具备良好的通用性,可与多种隐式CF推荐器或负采样方法集成以进一步提升性能。
链接: https://arxiv.org/abs/2602.22521
作者: Jiayi Wu,Zhengyu Wu,Xunkai Li,Rong-Hua Li,Guoren Wang
机构: Beijing Institute of Technology (北京理工大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:The negative sampling strategy can effectively train collaborative filtering (CF) recommendation models based on implicit feedback by constructing positive and negative samples. However, existing methods primarily optimize the negative sampling process while neglecting the exploration of positive samples. Some denoising recommendation methods can be applied to denoise positive samples within negative sampling strategies, but they ignore temporal information. Existing work integrates sequential information during model aggregation but neglects time interval information, hindering accurate capture of users’ current preferences. To address this problem, from a data perspective, we propose a novel temporal filtration-enhanced approach to construct a high-quality positive sample set. First, we design a time decay model based on interaction time intervals, transforming the original graph into a weighted user-item bipartite graph. Then, based on predefined filtering operations, the weighted user-item bipartite graph is layered. Finally, we design a layer-enhancement strategy to construct a high-quality positive sample set for the layered subgraphs. We provide theoretical insights into why TFPS can improve Recall@k and NDCG@k, and extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed method. Additionally, TFPS can be integrated with various implicit CF recommenders or negative sampling methods to enhance its performance.
[IR-19] MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation
【速读】:该论文旨在解决乳腺X线摄影(screening mammography)在临床实践中存在的高工作量、时间敏感性和文档密集性问题,尤其是放射科医生需将细微的视觉发现转化为一致的BI-RADS评估、乳腺密度分类和结构化报告的挑战。现有基于视觉语言模型(Vision Language Models, VLMs)的图像到文本报告生成方法多依赖封闭云平台或紧耦合架构,限制了隐私保护、可复现性和适应性。解决方案的关键在于提出MammoWise——一个本地部署的多模态流水线框架,能够将开源VLM(如MedGemma、LLaVA-Med和Qwen2.5-VL)转化为乳腺X线报告生成器与多任务分类器,支持零样本、少样本及思维链(Chain-of-Thought)提示,并可选地通过向量数据库实现针对病例的多模态检索增强生成(Retrieval Augmented Generation, RAG)。实验表明,结合参数高效微调(QLoRA)后,MedGemma在BI-RADS分类、乳腺密度和钙化检测上的准确率分别达到0.7545、0.8840和0.9341,同时保持高质量报告生成能力,从而提供了一个统一、可复现且可扩展的本地化VLM部署方案。
链接: https://arxiv.org/abs/2602.22462
作者: Raiyan Jahangir,Nafiz Imtiaz Khan,Amritanand Sudheerkumar,Vladimir Filkov
机构: University of California, Davis, CA, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: arXiv preprint (submitted 25 Feb 2026). Local multi-model pipeline for mammography report generation + classification using prompting, multimodal RAG (ChromaDB), and QLoRA fine-tuning; evaluates MedGemma, LLaVA-Med, Qwen2.5-VL on VinDr-Mammo and DMID; reports BERTScore/ROUGE-L and classification metrics
Abstract:Screening mammography is high volume, time sensitive, and documentation heavy. Radiologists must translate subtle visual findings into consistent BI-RADS assessments, breast density categories, and structured narrative reports. While recent Vision Language Models (VLMs) enable image-to-text reporting, many rely on closed cloud systems or tightly coupled architectures that limit privacy, reproducibility, and adaptability. We present MammoWise, a local multi-model pipeline that transforms open source VLMs into mammogram report generators and multi-task classifiers. MammoWise supports any Ollama-hosted VLM and mammography dataset, and enables zero-shot, few-shot, and Chain-of-Thought prompting, with optional multimodal Retrieval Augmented Generation (RAG) using a vector database for case-specific context. We evaluate MedGemma, LLaVA-Med, and Qwen2.5-VL on VinDr-Mammo and DMID datasets, assessing report quality (BERTScore, ROUGE-L), BI-RADS classification, breast density, and key findings. Report generation is consistently strong and improves with few-shot prompting and RAG. Classification is feasible but sensitive to model and dataset choice. Parameter-efficient fine-tuning (QLoRA) of MedGemma improves reliability, achieving BI-RADS accuracy of 0.7545, density accuracy of 0.8840, and calcification accuracy of 0.9341 while preserving report quality. MammoWise provides a practical and extensible framework for deploying local VLMs for mammography reporting within a unified and reproducible workflow.
[IR-20] RETLLM : Training and Data-Free MLLM s for Multimodal Information Retrieval
【速读】:该论文旨在解决多模态信息检索(Multimodal Information Retrieval, MMIR)中现有方法依赖大规模标注数据和微调训练所带来的预训练不一致性和高成本问题。其解决方案的关键在于提出一种无需训练和数据的框架RetLLM,通过将MMIR建模为相似度分数生成任务,利用多模态大语言模型(Multimodal Large Language Models, MLLMs)直接预测查询与候选项之间的检索得分,并采用“粗粒度到细粒度”的两阶段推理机制:第一阶段使用top-k过滤策略构建高质量候选池以聚焦语义相关项,第二阶段结合查询与候选输入MLLM进行精确得分预测;同时引入视觉增强模块,在推理过程中辅助模型重新关注被忽略的视觉信息,从而显著提升检索性能。实验表明,RetLLM在多个MMIR基准上超越了需微调的模型,验证了MLLM在无需训练情况下即可实现强大检索能力的潜力。
链接: https://arxiv.org/abs/2602.22278
作者: Dawei Su,Dongsheng Wang
机构: 未知
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 5 pages, 2 figure
Abstract:Multimodal information retrieval (MMIR) has gained attention for its flexibility in handling text, images, or mixed queries and candidates. Recent breakthroughs in multimodal large language models (MLLMs) boost MMIR performance by incorporating MLLM knowledge under the contrastive finetuning framework. However, they suffer from pre-training inconsistency and require large datasets. In this work, we introduce a novel framework, RetLLM, designed to query MLLMs for MMIR in a training- and data-free manner. Specifically, we formulate MMIR as a similarity score generation task and prompt MLLMs to directly predict retrieval scores in a coarse-then-fine pipeline. At the coarse stage, a top-k filtering strategy builds a small yet high-quality candidate pool for each query, enabling MLLMs to focus on semantically relevant candidates. Subsequently, the retrieval score is predicted by feeding both the query and candidate into MLLMs at the fine stage. Importantly, we propose a visual enhancement module during reasoning to help MLLMs re-pick forgotten visuals, improving retrieval. Extensive experiments on MMIR benchmarks show that RetLLM outperforms fine-tuned models. Ablation studies further verify each component. Our work demonstrates that MLLMs can achieve strong MMIR performance without any training, highlighting their inherent multimodal reasoning ability in a simple, scalable framework. We release our code at: this https URL
[IR-21] SEGB: Self-Evolved Generative Bidding with Local Autoregressive Diffusion
【速读】:该论文旨在解决当前生成式自动出价(Generative Auto-Bidding)策略在离线训练中缺乏短期前瞻性、难以适应动态市场变化,且通常依赖模拟器或外部专家进行后期优化的问题。解决方案的关键在于提出Self-Evolved Generative Bidding (SEGB) 框架,其核心机制包括两部分:首先通过合成短期未来状态来引导每次出价,赋予智能体关键的动态前瞻能力;其次通过价值驱动的策略精炼(value-guided policy refinement)实现无需外部干预的迭代优化,从而完全在离线数据上完成策略的自我进化与提升。这一自洽闭环设计使得SEGB能够在静态数据基础上实现稳健的策略改进,并在AuctionNet基准和大规模线上A/B测试中显著优于现有最优方法,最终带来10.19%的目标成本提升。
链接: https://arxiv.org/abs/2602.22226
作者: Yulong Gao,Wan Jiang,Mingzhe Cao,Xuepu Wang,Zeyu Pan,Haonan Yang,Ye Liu,Xin Yang
机构: JD.com(京东); Beijing(北京); China(中国)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:In the realm of online advertising, automated bidding has become a pivotal tool, enabling advertisers to efficiently capture impression opportunities in real-time. Recently, generative auto-bidding has shown significant promise, offering innovative solutions for effective ad optimization. However, existing offline-trained generative policies lack the near-term foresight required for dynamic markets and usually depend on simulators or external experts for post-training improvement. To overcome these critical limitations, we propose Self-Evolved Generative Bidding (SEGB), a framework that plans proactively and refines itself entirely offline. SEGB first synthesizes plausible short-horizon future states to guide each bid, providing the agent with crucial, dynamic foresight. Crucially, it then performs value-guided policy refinement to iteratively discover superior strategies without any external intervention. This self-contained approach uniquely enables robust policy improvement from static data alone. Experiments on the AuctionNet benchmark and a large-scale A/B test validate our approach, demonstrating that SEGB significantly outperforms state-of-the-art baselines. In a large-scale online deployment, it delivered substantial business value, achieving a +10.19% increase in target cost, proving the effectiveness of our advanced planning and evolution paradigm.
[IR-22] SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG
【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理长文档时因静态分块(static chunking)和扁平化检索(flat retrieval)所导致的准确性下降、噪声引入及扩展性差的问题。其核心解决方案是提出SmartChunk检索框架,关键在于两个创新组件:一是基于强化学习的规划器(STITCH),能够根据查询动态预测最优的分块抽象层级;二是轻量级压缩模块,在无需重复摘要的情况下生成高层语义嵌入(high-level chunk embeddings),从而实现检索粒度的实时自适应调整。此设计有效平衡了准确性和效率,并在多个基准测试和跨域数据集上展现出显著优于现有RAG基线的效果。
链接: https://arxiv.org/abs/2602.22225
作者: Xuechen Zhang,Koustava Goswami,Samet Oymak,Jiasi Chen,Nedim Lipka
机构: University of Michigan (密歇根大学); Adobe Research (Adobe 研究院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 pages, 10 figures
Abstract:Retrieval-augmented generation (RAG) has strong potential for producing accurate and factual outputs by combining language models (LMs) with evidence retrieved from large text corpora. However, current pipelines are limited by static chunking and flat retrieval: documents are split into short, predetermined, fixed-size chunks, embeddings are retrieved uniformly, and generation relies on whatever chunks are returned. This design brings challenges, as retrieval quality is highly sensitive to chunk size, often introduces noise from irrelevant or misleading chunks, and scales poorly to large corpora. We present SmartChunk retrieval, a query-adaptive framework for efficient and robust long-document question answering (QA). SmartChunk uses (i) a planner that predicts the optimal chunk abstraction level for each query, and (ii) a lightweight compression module that produces high-level chunk embeddings without repeated summarization. By adapting retrieval granularity on the fly, SmartChunk balances accuracy with efficiency and avoids the drawbacks of fixed strategies. Notably, our planner can reason about chunk abstractions through a novel reinforcement learning scheme, STITCH, which boosts accuracy and generalization. To reflect real-world applications, where users face diverse document types and query styles, we evaluate SmartChunk on five QA benchmarks plus one out-of-domain dataset. Across these evaluations, SmartChunk outperforms state-of-the-art RAG baselines, while reducing cost. Further analysis demonstrates strong scalability with larger corpora and consistent gains on out-of-domain datasets, highlighting its effectiveness as a general framework for adaptive retrieval.
[IR-23] DS SERVE: A Framework for Efficient and Scalable Neural Retrieval
【速读】:该论文旨在解决大规模文本数据集在实际应用中难以高效检索的问题,尤其是在生成式 AI (Generative AI) 等场景下对低延迟、高精度和多样化结果的需求。解决方案的关键在于提出 DS-Serve 框架,该框架将包含半万亿 token 的大规模文本数据集转化为高性能神经检索系统,通过优化存储与计算结构,在单节点上实现低延迟与适度内存开销,并支持推理时在延迟、准确率和结果多样性之间进行灵活权衡。
链接: https://arxiv.org/abs/2602.22224
作者: Jinjian Liu,Yichuan Wang,Xinxi Lyu,Rulin Shao,Joseph E. Gonzalez,Matei Zaharia,Sewon Min
机构: 1. University of California, Berkeley (加州大学伯克利分校); 2. Microsoft Research (微软研究院); 3. University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We present DS-Serve, a framework that transforms large-scale text datasets, comprising half a trillion tokens, into a high-performance neural retrieval system. DS-Serve offers both a web interface and API endpoints, achieving low latency with modest memory overhead on a single node. The framework also supports inference-time trade-offs between latency, accuracy, and result diversity. We anticipate that DS-Serve will be broadly useful for a range of applications, including large-scale retrieval-augmented generation (RAG), training data attribution, training search agents, and beyond.
[IR-24] SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas
【速读】:该论文旨在解决当前文本到SQL(text-to-SQL)模型在泛化能力上的瓶颈问题,其核心挑战在于缺乏大规模、高复杂度且覆盖多领域的真实数据集,这些数据集需具备足够的模式(schema)多样性、查询复杂性及任务多样性。为应对这一问题,作者提出SQaLe——一个基于真实世界模式集合SchemaPile扩展而来的大型半合成text-to-SQL数据集,包含135,875个关系型数据库模式和517,676个高质量的(问题,模式,SQL查询)三元组。解决方案的关键在于构建了一个系统性的生成流水线,融合了模式采样、问题合成与SQL构造三个步骤,从而在保持执行正确性的前提下,有效模拟真实场景中的模式规模变化、多样化的查询模式以及自然语言歧义,显著提升了数据集的真实性与实用性,推动了text-to-SQL研究中数据规模扩展与模型泛化能力的发展。
链接: https://arxiv.org/abs/2602.22223
作者: Cornelius Wolff,Daniel Gomm,Madelon Hulsebos
机构: Centrum Wiskunde & Informatica (荷兰数学与计算机科学研究中心); University of Amsterdam (阿姆斯特丹大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at the AI for Tabular Data workshop at EurIPS 2025
Abstract:Advances in large language models have accelerated progress in text-to-SQL, methods for converting natural language queries into valid SQL queries. A key bottleneck for developing generalizable text-to-SQL models is the lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity. We introduce SQaLe: a large-scale semi-synthetic text-to-SQL dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas, SchemaPile. We establish a principled generation pipeline which combines schema sampling, question synthesis, and SQL construction, and produce 517,676 high-quality (question, schema, query) triples. The SQaLe dataset captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity. We provide an analysis of its contents and characteristics, and find that SQaLe introduces the most realistic large-scale text-to-SQL dataset to date in comparison with existing benchmarks and datasets. We discuss how SQaLe enables our vision for data scaling and model generalization in text-to-SQL research. The dataset is accessible at: this https URL.
[IR-25] Misinformation Exposure in the Chinese Web: A Cross-System Evaluation of Search Engines LLM s and AI Overviews
【速读】:该论文旨在解决生成式 AI(Generative AI)在非英语网络生态系统中事实可靠性不足的问题,特别是针对中文用户真实搜索查询的回答准确性缺乏系统评估的现状。其解决方案的关键在于构建一个基于真实在线搜索日志的中文 Yes/No 事实核查数据集(包含12~161个问题),并设计统一的评估流程,对比传统搜索引擎、独立大语言模型(LLMs)以及AI生成摘要模块三种信息获取范式在事实准确性与主题层面差异的表现。通过结合百度指数等实际用户行为数据,进一步量化了不同地区用户可能暴露于错误信息的风险,从而揭示AI中介搜索中的结构性风险,并强调发展更可靠、透明的信息获取工具的重要性。
链接: https://arxiv.org/abs/2602.22221
作者: Geng Liu,Junjie Mu,Li Feng,Mengxiao Zhu,Francesco Pierri
机构: Politecnico di Milano (米兰理工大学); University of Science and Technology of China (中国科学技术大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Large Language Models (LLMs) are increasingly integrated into search services, providing direct answers that can reduce users’ reliance on traditional result pages. Yet their factual reliability in non-English web ecosystems remains poorly understood, particularly when answering real user queries. We introduce a fact-checking dataset of 12~161 Chinese Yes/No questions derived from real-world online search logs and develop a unified evaluation pipeline to compare three information-access paradigms: traditional search engines, standalone LLMs, and AI-generated overview modules. Our analysis reveals substantial differences in factual accuracy and topic-level variability across systems. By combining this performance with real-world Baidu Index statistics, we further estimate potential exposure to incorrect factual information of Chinese users across regions. These findings highlight structural risks in AI-mediated search and underscore the need for more reliable and transparent information-access tools for the digital world.
[IR-26] What Makes an Ideal Quote? Recommending “Unexpected yet Rational” Quotations via Novelty
【速读】:该论文旨在解决现有引文推荐系统仅关注表面主题相关性,而忽视引文深层语义与审美特性导致推荐结果缺乏记忆点的问题。其解决方案的关键在于提出 NovelQR 框架,通过引入“语境新颖但语义一致”的推荐目标,利用生成式标签代理(generative label agent)将引文及其上下文转化为多维深层含义标签以增强检索,并结合词级新颖度估计器对候选引文进行重排序,从而缓解自回归续写偏差,最终实现更贴切、新颖且更具吸引力的引文推荐效果。
链接: https://arxiv.org/abs/2602.22220
作者: Bowei Zhang,Jin Xiao,Guanglei Yue,Qianyu He,Yanghua Xiao,Deqing Yang,Jiaqing Liang
机构: Fudan University (复旦大学); Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence (上海市数据科学重点实验室,计算机科学与人工智能学院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 36 pages, 16 figures and 13 tables
Abstract:Quotation recommendation aims to enrich writing by suggesting quotes that complement a given context, yet existing systems mostly optimize surface-level topical relevance and ignore the deeper semantic and aesthetic properties that make quotations memorable. We start from two empirical observations. First, a systematic user study shows that people consistently prefer quotations that are ``unexpected yet rational’’ in context, identifying novelty as a key desideratum. Second, we find that strong existing models struggle to fully understand the deep meanings of quotations. Inspired by defamiliarization theory, we therefore formalize quote recommendation as choosing contextually novel but semantically coherent quotations. We operationalize this objective with NovelQR, a novelty-driven quotation recommendation framework. A generative label agent first interprets each quotation and its surrounding context into multi-dimensional deep-meaning labels, enabling label-enhanced retrieval. A token-level novelty estimator then reranks candidates while mitigating auto-regressive continuation bias. Experiments on bilingual datasets spanning diverse real-world domains show that our system recommends quotations that human judges rate as more appropriate, more novel, and more engaging than other baselines, while matching or surpassing existing methods in novelty estimation.
[IR-27] Comparative Analysis of Neural Retriever-Reranker Pipelines for Retrieval-Augmented Generation over Knowledge Graphs in E-commerce Applications
【速读】:该论文旨在解决在电商场景下,如何将结构化知识图谱(Knowledge Graph, KG)有效集成到检索增强生成(Retrieval-Augmented Generation, RAG)系统中以提升自然语言查询的准确性与上下文相关性的问题。其核心挑战在于:一是如何在大规模连接的知识图谱上高效扩展检索能力,二是如何在生成响应时保持实体间的关系连贯性;三是现有交叉编码器(Cross-encoder)方法在结构化数据上的整合尚不充分。解决方案的关键在于设计并比较多种“检索器-重排序器”(Retriever-Reranker)流水线配置,并基于STaRK半结构化知识库(Semi-structured Knowledge Base, SKB)进行优化与评估,从而显著提升Hit@1和平均倒数排名(MRR)指标,为生产环境中部署面向特定领域的RAG系统提供了可落地的技术框架。
链接: https://arxiv.org/abs/2602.22219
作者: Teri Rumble,Zbyněk Gazdík,Javad Zarrin,Jagdeep Ahluwalia
机构: Abertay University (阿伯泰大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This manuscript is under review at the Springer journal Knowledge and Information Systems
Abstract:Recent advancements in Large Language Models (LLMs) have transformed Natural Language Processing (NLP), enabling complex information retrieval and generation tasks. Retrieval-Augmented Generation (RAG) has emerged as a key innovation, enhancing factual accuracy and contextual grounding by integrating external knowledge sources with generative models. Although RAG demonstrates strong performance on unstructured text, its application to structured knowledge graphs presents challenges: scaling retrieval across connected graphs and preserving contextual relationships during response generation. Cross-encoders refine retrieval precision, yet their integration with structured data remains underexplored. Addressing these challenges is crucial for developing domain-specific assistants that operate in production environments. This study presents the design and comparative evaluation of multiple Retriever-Reranker pipelines for knowledge graph natural language queries in e-Commerce contexts. Using the STaRK Semi-structured Knowledge Base (SKB), a production-scale e-Commerce dataset, we evaluate multiple RAG pipeline configurations optimized for language queries. Experimental results demonstrate substantial improvements over published benchmarks, achieving 20.4% higher Hit@1 and 14.5% higher Mean Reciprocal Rank (MRR). These findings establish a practical framework for integrating domain-specific SKBs into generative systems. Our contributions provide actionable insights for the deployment of production-ready RAG systems, with implications that extend beyond e-Commerce to other domains that require information retrieval from structured knowledge bases.
[IR-28] Cybersecurity Data Extraction from Common Crawl
【速读】:该论文旨在解决网络安全领域中高质量域名数据集构建的难题,传统方法依赖迭代内容评分(如DeepSeekMath)效率较低且易受噪声干扰。其解决方案的关键在于利用社区检测(community detection)技术,从Common Crawl网络图结构中一次性挖掘出可信域名,仅需20个可信种子域名即可实现高效、精准的高质量域名提取,从而显著提升数据采集的可靠性和可扩展性。
链接: https://arxiv.org/abs/2602.22218
作者: Ashim Mahara
机构: 未知
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:
Abstract:Alpha-Root is a cybersecurity-focused dataset collected in a single shot from the Common Crawl web graph using community detection. Unlike iterative content-scoring approaches like DeepSeekMath, we mine quality domains directly from the web graph, starting from just 20 trusted seed domains.
[IR-29] RAG db: A Zero-Dependency Embeddable Architecture for Multimodal Retrieval-Augmented Generation on the Edge
【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)架构在边缘计算、离线环境及隐私敏感场景中因基础设施复杂性过高而难以部署的问题。现有RAG系统依赖云托管的向量数据库、深度学习框架(如PyTorch、CUDA)和高延迟嵌入推理服务,导致“基础设施膨胀”,限制了其在本地化、轻量化AI应用中的落地。解决方案的关键在于提出RAGdb——一种单体式架构,将多模态数据自动摄入、基于ONNX的特征提取与混合向量检索集成于一个可移植的SQLite容器中,并设计了一种确定性的混合评分函数(Hybrid Scoring Function, HSF),结合次线性TF-IDF向量化与精确子串匹配提升,从而在查询时无需GPU推理。实验表明,该方案在消费级笔记本上实现了100% Recall@1的实体召回率,增量更新效率提升31.6倍,且磁盘占用减少约99.5%,确立了“单文件知识容器”作为去中心化、本地优先AI的新范式。
链接: https://arxiv.org/abs/2602.22217
作者: Ahmed Bin Khalid
机构: SKAS IT
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 tables
Abstract:Retrieval-Augmented Generation (RAG) has established itself as the standard paradigm for grounding Large Language Models (LLMs) in domain-specific, up-to-date data. However, the prevailing architecture for RAG has evolved into a complex, distributed stack requiring cloud-hosted vector databases, heavy deep learning frameworks (e.g., PyTorch, CUDA), and high-latency embedding inference servers. This infrastructure bloat'' creates a significant barrier to entry for edge computing, air-gapped environments, and privacy-constrained applications where data sovereignty is paramount. This paper introduces RAGdb, a novel monolithic architecture that consolidates automated multimodal ingestion, ONNX-based extraction, and hybrid vector retrieval into a single, portable SQLite container. We propose a deterministic Hybrid Scoring Function (HSF) that combines sublinear TF-IDF vectorization with exact substring boosting, eliminating the need for GPU inference at query time. Experimental evaluation on an Intel i7-1165G7 consumer laptop demonstrates that RAGdb achieves 100\% Recall@1 for entity retrieval and an ingestion efficiency gain of 31.6x during incremental updates compared to cold starts. Furthermore, the system reduces disk footprint by approximately 99.5\% compared to standard Docker-based RAG stacks, establishing the Single-File Knowledge Container’’ as a viable primitive for decentralized, local-first AI. Keywords: Edge AI, Retrieval-Augmented Generation, Vector Search, Green AI, Serverless Architecture, Knowledge Graphs, Efficient Computing. Comments: 6 pages, 2 tables Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.22217 [cs.IR] (or arXiv:2602.22217v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.22217 Focus to learn more arXiv-issued DOI via DataCite
[IR-30] Retrieval-Augmented Generation Assistant for Anatomical Pathology Laboratories
【速读】:该论文旨在解决解剖病理学(Anatomical Pathology, AP)实验室中静态文档(如纸质手册或PDF)存在过时、碎片化和难检索等问题,这些问题可能导致操作流程错误和诊断延迟。为应对这一挑战,研究提出并评估了一种专为AP实验室设计的检索增强生成(Retrieval-Augmented Generation, RAG)助手,其核心解决方案在于:构建了一个包含99个AP协议的新颖语料库及323个问答对,并通过系统性实验验证不同分块策略、检索方法与嵌入模型的效果。关键发现表明,采用递归分块(recursive chunking)与混合检索(hybrid retrieval)可实现最佳基线性能,而引入生物医学专用嵌入模型(MedEmbed)显著提升了答案相关性(0.74)、忠实度(0.70)和上下文召回率(0.77),同时单条最相关片段检索(top-k=1)在效率与准确性之间取得最优平衡,凸显了AP协议模块化结构的特点。此方案有效将静态文档转化为动态、可靠的智能知识助手,有助于提升实验室工作效率并保障患者安全。
链接: https://arxiv.org/abs/2602.22216
作者: Diogo Pires,Yuriy Perezhohin,Mauro Castelli
机构: Nova Information Management School (NOVA IMS), Universidade Nova de Lisboa, Campus de Campolide, 1070-312 Lisboa, Portugal
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate and efficient access to laboratory protocols is essential in Anatomical Pathology (AP), where up to 70% of medical decisions depend on laboratory diagnoses. However, static documentation such as printed manuals or PDFs is often outdated, fragmented, and difficult to search, creating risks of workflow errors and diagnostic delays. This study proposes and evaluates a Retrieval-Augmented Generation (RAG) assistant tailored to AP laboratories, designed to provide technicians with context-grounded answers to protocol-related queries. We curated a novel corpus of 99 AP protocols from a Portuguese healthcare institution and constructed 323 question-answer pairs for systematic evaluation. Ten experiments were conducted, varying chunking strategies, retrieval methods, and embedding models. Performance was assessed using the RAGAS framework (faithfulness, answer relevance, context recall) alongside top-k retrieval metrics. Results show that recursive chunking and hybrid retrieval delivered the strongest baseline performance. Incorporating a biomedical-specific embedding model (MedEmbed) further improved answer relevance (0.74), faithfulness (0.70), and context recall (0.77), showing the importance of domain-specialised embeddings. Top-k analysis revealed that retrieving a single top-ranked chunk (k=1) maximized efficiency and accuracy, reflecting the modular structure of AP protocols. These findings highlight critical design considerations for deploying RAG systems in healthcare and demonstrate their potential to transform static documentation into dynamic, reliable knowledge assistants, thus improving laboratory workflow efficiency and supporting patient safety.
[IR-31] Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在科学创意生成中缺乏可控学术背景和可追溯灵感路径的问题。为实现这一目标,其核心解决方案在于构建一个名为GYWI的科学创意生成系统,该系统通过融合作者知识图谱与检索增强生成(Retrieval-Augmented Generation, RAG),形成外部知识库以提供可控的上下文和灵感溯源路径。关键创新包括:基于作者中心的知识图谱构建方法与灵感来源采样算法、结合RAG与GraphRAG的混合检索机制以获取深度与广度兼具的知识内容、以及引入强化学习原理优化提示(Prompt)策略,从而引导LLMs基于混合上下文自动改进生成结果。
链接: https://arxiv.org/abs/2602.22215
作者: Pengzhen Xie,Huizhi Liang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 15 pages, 10 figures. Submitted to [RAAI]
Abstract:Large Language Models (LLMs) demonstrate potential in the field of scientific idea generation. However, the generated results often lack controllable academic context and traceable inspiration pathways. To bridge this gap, this paper proposes a scientific idea generation system called GYWI, which combines author knowledge graphs with retrieval-augmented generation (RAG) to form an external knowledge base to provide controllable context and trace of inspiration path for LLMs to generate new scientific ideas. We first propose an author-centered knowledge graph construction method and inspiration source sampling algorithms to construct external knowledge base. Then, we propose a hybrid retrieval mechanism that is composed of both RAG and GraphRAG to retrieve content with both depth and breadth knowledge. It forms a hybrid context. Thirdly, we propose a Prompt optimization strategy incorporating reinforcement learning principles to automatically guide LLMs optimizing the results based on the hybrid context. To evaluate the proposed approaches, we constructed an evaluation dataset based on arXiv (2018-2023). This paper also develops a comprehensive evaluation method including empirical automatic assessment in multiple-choice question task, LLM-based scoring, human evaluation, and semantic space visualization analysis. The generated ideas are evaluated from the following five dimensions: novelty, feasibility, clarity, relevance, and significance. We conducted experiments on different LLMs including GPT-4o, DeepSeek-V3, Qwen3-8B, and Gemini 2.5. Experimental results show that GYWI significantly outperforms mainstream LLMs in multiple metrics such as novelty, reliability, and relevance.
[IR-32] Adaptive Prefiltering for High-Dimensional Similarity Search: A Frequency-Aware Approach
【速读】:该论文旨在解决高维相似性搜索(high-dimensional similarity search)中因查询分布异质性导致的静态搜索策略效率低下问题。现有方法采用统一的搜索策略,无法根据实际查询频率和局部密度特征动态调整计算资源分配,从而造成冗余计算或召回率下降。解决方案的关键在于提出一种自适应预过滤框架(adaptive prefiltering framework),通过识别查询频率模式(遵循Zipf分布)与聚类一致性度量(cluster coherence metrics),将查询空间划分为不同频率层级,并基于历史访问模式和局部密度特性分配差异化的搜索策略,实现计算预算的动态优化。实验表明,该方法在ImageNet-1k数据集上使用CLIP嵌入时,在保持亚毫秒级延迟的同时,相较固定nprobe选择可减少20.4%的距离计算量且维持相当的召回率。
链接: https://arxiv.org/abs/2602.22214
作者: Teodor-Ioan Calin
机构: Vulture Labs, Inc.(Vulture Labs公司)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-dimensional similarity search underpins modern retrieval systems, yet uniform search strategies fail to exploit the heterogeneous nature of real-world query distributions. We present an adaptive prefiltering framework that leverages query frequency patterns and cluster coherence metrics to dynamically allocate computational budgets. Our approach partitions the query space into frequency tiers following Zipfian distributions and assigns differentiated search policies based on historical access patterns and local density characteristics. Experiments on ImageNet-1k using CLIP embeddings demonstrate that frequency-aware budget allocation achieves equivalent recall with 20.4% fewer distance computations compared to static nprobe selection, while maintaining sub-millisecond latency on GPU-accelerated FAISS indices. The framework introduces minimal overhead through lightweight frequency tracking and provides graceful degradation for unseen queries through coherence-based fallback policies.
[IR-33] Enriching Taxonomies Using Large Language Models ECAI2025
【速读】:该论文旨在解决现有分类体系(taxonomy)在覆盖范围有限、节点过时或语义模糊等方面的问题,从而影响知识检索的效率与准确性。其解决方案的关键在于提出了一种名为Taxoria的分类体系增强流水线,该方法以一个已有的分类体系作为种子,利用大语言模型(Large Language Models, LLMs)生成候选节点,并通过验证机制消除幻觉、确保语义相关性后进行整合,最终输出带有溯源信息并可可视化分析的增强型分类体系。
链接: https://arxiv.org/abs/2602.22213
作者: Zeinab Ghamlouch,Mehwish Alam
机构: Télécom Paris, Institut Polytechnique de Paris, France
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published in ECAI 2025 Demo Track
Abstract:Taxonomies play a vital role in structuring and categorizing information across domains. However, many existing taxonomies suffer from limited coverage and outdated or ambiguous nodes, reducing their effectiveness in knowledge retrieval. To address this, we present Taxoria, a novel taxonomy enrichment pipeline that leverages Large Language Models (LLMs) to enhance a given taxonomy. Unlike approaches that extract internal LLM taxonomies, Taxoria uses an existing taxonomy as a seed and prompts an LLM to propose candidate nodes for enrichment. These candidates are then validated to mitigate hallucinations and ensure semantic relevance before integration. The final output includes an enriched taxonomy with provenance tracking and visualization of the final merged taxonomy for analysis.
人机交互
[HC-0] LLM Novice Uplift on Dual-Use In Silico Biology Tasks
【速读】:该论文旨在解决一个关键科学与安全问题:大型语言模型(Large Language Models, LLMs)是否能够提升新手用户在生物学任务中的表现,使其超越仅依赖互联网资源的水平,从而影响科研效率和生物安全风险。研究通过多模型、多基准的人类提升实验设计,对比了具备LLM访问权限的新手与仅使用互联网的新手在八个生物安全相关任务集上的表现,发现LLM显著提升了新手的准确性(平均提升4.16倍),且在部分任务中甚至超过专家水平。解决方案的关键在于采用真实场景下的复杂任务评估(最长耗时达13小时),并引入“人类 uplift”这一核心指标,揭示LLM不仅增强个体能力,还可能降低专业门槛,进而凸显对生成式AI(Generative AI)双用途风险的系统性挑战。
链接: https://arxiv.org/abs/2602.23329
作者: Chen Bo Calvin Zhang,Christina Q. Knight,Nicholas Kruus,Jason Hausenloy,Pedro Medeiros,Nathaniel Li,Aiden Kim,Yury Orlovskiy,Coleman Breen,Bryce Cai,Jasper Götting,Andrew Bo Liu,Samira Nedungadi,Paula Rodriguez,Yannis Yiming He,Mohamed Shaaban,Zifan Wang,Seth Donoughe,Julian Michael
机构: Scale AI; SecureBio; University of Oxford; UC Berkeley
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 59 pages, 33 figures
Abstract:Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users – i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.
[HC-1] Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
【速读】:该论文旨在解决资源受限的移动和服务机器人在人-机器人交互(HRI)中实时分配领导-跟随角色的挑战。当前大型语言模型(Large Language Models, LLMs)虽具备自然对话能力,但其计算开销限制了边缘部署;而小型语言模型(Small Language Models, SLMs)作为潜在替代方案,尚未被系统评估用于HRI中的角色分类任务。解决方案的关键在于通过构建一个基于公开数据库并引入合成样本的新颖数据集,对SLMs进行两种适应策略——提示工程(prompt engineering)与微调(fine-tuning),并在零样本(zero-shot)和单样本(one-shot)交互模式下进行对比实验。结果表明,零样本微调在保持低延迟(每样本22.2毫秒)的同时实现了86.66%的高准确率,显著优于基线和提示工程方法,验证了微调后的SLMs在边缘设备上实现高效、可靠角色分配的有效性,同时也揭示了上下文长度增加导致的一次性交互模式下性能下降问题,凸显了对话复杂度与分类可靠性之间的权衡。
链接: https://arxiv.org/abs/2602.23312
作者: Rafael R. Baptista,André de Lima Salgado,Ricardo V. Godoy,Marcelo Becker,Thiago Boaventura,Gustavo J. G. Lahr
机构: University of Sao Paulo (圣保罗大学); Federal University of Lavras (拉弗拉联邦大学); Instituto Israelita de Ensino e Pesquisa, Hospital Israelita Albert Einstein (以色列教育与研究研究所,艾萨克·爱因斯坦医院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
备注:
Abstract:Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model’s architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.
[HC-2] BRIDGE: Borderless Reconfiguration for Inclusive and Diverse Gameplay Experience via Embodiment Transformation
【速读】:该论文旨在解决残奥体育(Parasports)中训练资源匮乏的问题,这限制了运动员和教练员对专项运动动作及战术协同的练习与理解。其解决方案的关键在于提出BRIDGE系统,该系统融合了两个核心组件:一是基于广播视频的重建流程(reconstruction pipeline),可检测并追踪运动员在比赛中的行为,生成三维运动序列;二是具身感知可视化框架(embodiment-aware visualization framework),通过分解头部、躯干和轮椅基座的姿态方向,分别表征注意力(attention)、意图(intent)与移动能力(mobility)。实验证明,该方法显著提升了运动员姿态的真实感,并增强了战术意图的可理解性,同时支持功能分级评估,从而提升使用者的自我效能感,推动包容性体育学习与可及性教练实践的发展。
链接: https://arxiv.org/abs/2602.23288
作者: Hayato Saiki,Chunggi Lee,Hikari Takahashi,Tica Lin,Hidetada Kishi,Kaori Tachibana,Yasuhiro Suzuki,Hanspeter Pfister,Kenji Suzuki
机构: University of Tsukuba (筑波大学); Harvard University (哈佛大学); Mejiro University (明治大学); Ibaraki Prefectural University of Health Sciences (茨城县立保健科学院); Institute of Systems and Information Engineering (系统与信息工程研究所); Dolby Laboratories (杜比实验室)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Training resources for parasports are limited, reducing opportunities for athletes and coaches to engage with sport-specific movements and tactical coordination. To address this gap, we developed BRIDGE, a system that integrates a reconstruction pipeline, which detects and tracks players from broadcast video to generate 3D play sequences, with an embodiment-aware visualization framework that decomposes head, trunk, and wheelchair base orientations to represent attention, intent, and mobility. We evaluated BRIDGE in two controlled studies with 20 participants (10 national wheelchair basketball team players and 10 amateur players). The results showed that BRIDGE significantly enhanced the perceived naturalness of player postures and made tactical intentions easier to understand. In addition, it supported functional classification by realistically conveying players’ capabilities, which in turn improved participants’ sense of self-efficacy. This work advances inclusive sports learning and accessible coaching practices, contributing to more equitable access to tactical resources in parasports.
[HC-3] VRSL:Exploring the Comprehensibility of 360-Degree Camera Feeds for Sign Language Communication in Virtual Reality
【速读】:该论文旨在解决如何在虚拟现实(Virtual Reality, VR)环境中有效集成手语(如美国手语,American Sign Language, ASL),以提升听障和重听(Deaf and Hard of Hearing, DHH)用户在虚拟空间中的沟通可及性与体验。其解决方案的关键在于利用佩戴式360度摄像机从不同身体部位(头部、肩部和胸部)录制ASL视频,并评估这些视频在VR中的可理解性和用户体验。研究发现肩部佩戴位置在识别准确率上表现最佳(85%),尽管统计上差异不显著,但结果表明视频基的手语通信在VR中具有可行性(整体理解成功率达83.3%)。同时,研究指出需优化摄像角度、减少全景视频的周边畸变,以进一步改善清晰度,从而推动面向DHH用户的沉浸式协作环境的发展。
链接: https://arxiv.org/abs/2602.23265
作者: Gauri Umesh Rajmane,Ziming Li,Tae Oh,Roshan Peiris
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 16 pages, 3 figures
Abstract:This study explores integrating sign language into virtual reality (VR) by examining the comprehensibility and user experience of viewing American Sign Language (ASL) videos captured with body-mounted 360-degree cameras. Ten participants identified ASL signs from videos recorded at three body-mounted positions: head, shoulder, and chest. Results showed the shoulder-mounted camera achieved the highest accuracy (85%), though differences between positions were not statistically significant. Participants noted that peripheral distortion in 360-degree videos impacted clarity, highlighting areas for improvement. Despite challenges, the overall comprehension success rate of 83.3% demonstrates the potential of video-based ASL communication in VR. Feedback emphasized the need to refine camera angles, reduce distortion, and explore alternative mounting positions. Participants expressed a preference for signing over text-based communication in VR, highlighting the importance of developing this approach to enhance accessibility and collaboration for Deaf and Hard of Hearing (DHH) users in virtual environments.
[HC-4] FuturePrism: Supporting Adolescence in Collaborative Storytelling to Cope with Future Uncertainty
【速读】:该论文旨在解决青少年因执行功能不足而导致对未来不确定性产生焦虑的问题,进而影响其制定具体人生路径的能力。解决方案的关键在于设计了一种由生成式 AI (Generative AI) 驱动的协作叙事系统 FuturePrism,该系统基于 Snyder 的希望理论(Hope Theory),通过三角色扮演机制将认知过程外化为四个叙事章节:目标(The Goal)、机会(The Opportunity)、挑战(The Challenge)和能动性(The Agency)。实验表明,该系统显著提升了青少年的即时希望水平,尤其在能动性维度上效果明显,同时增强了叙事沉浸感与系统可用性,使青少年能够以积极态度应对未来挑战。
链接: https://arxiv.org/abs/2602.23108
作者: Yonglin Chen,Pengcheng An,Xueliang Li
机构: Southern University of Science and Technology (南方科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:FuturePrism is a GenAI-empowered collaborative storytelling system designed to scaffold adolescents to navigate future life challenges. Adolescents often suffer from anxiety related to future uncertainty for lacking the executive function to develop concrete pathways. Operationalizing Snyder’s Hope Theory, the system utilizes a triadic role-play mechanics to externalize cognitive processes through four narrative chapters: The Goal, The Opportunity, The Challenge, and The Agency. An evaluation workshop with 20 adolescents demonstrated that FuturePrism significantly enhances momentary hope levels, particularly in the Agency dimension. Participants reported high levels of narrative immersion and positive feedback towards system usability. Participants also confirmed that the AI-scaffolded collaborative storytelling empowered them to develop positive attitudes towards future challenges.
[HC-5] aleBot: A Tangible AI Companion to Support Children in Co-creative Storytelling for Resilience Cultivation
【速读】:该论文试图解决儿童心理韧性(resilience)培养不足的问题,尤其是在日常 adversarial experiences(逆境经历)中缺乏有效的支持机制。当前人机交互(HCI)研究对如何通过技术手段帮助儿童在面对挑战时建立心理韧性关注有限。解决方案的关键在于设计并实现一个名为TaleBot的生成式AI(Generative AI)赋能系统,该系统允许儿童在教师引导下与AI共同创作关于克服日常困境的故事,从而促进其情感表达和自我认知,并为教师提供个性化干预依据、为家长提供理解家庭沟通对孩子心理健康影响的窗口。
链接: https://arxiv.org/abs/2602.23095
作者: Yonglin Chen,Jingjing Zhang,Kezhuo Wang,Pengcheng An,Xueliang Li
机构: Southern University of Science and Technology (南方科技大学); Picto AI Tech (Picto AI科技)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Resilience is a key factor affecting children’s mental wellbeing and future development. Yet, limited HCI research has explored how to help children build resilience through adversarial experiences. Informed by a formative study with elementary school teachers and professional psychologists, we design TaleBot, an AI-empowered system that supports children to co-create stories about overcoming everyday adversities tailored to their personal situations. We evaluated the system with 12 elementary children in school counseling rooms under teacher guidance and conducted reflective interviews with parents upon the Child-AI co-created stories. The findings show that TaleBot encourages children in self-expression of feelings and thoughts, creating opportunities for teachers to provide personalized support and for parents to better understand the profound impact of family communication on children’s mental wellbeing. We conclude with design implications for using generative AI to support children’s mental health education and interventions across school and family contexts.
[HC-6] Beyond Faders: Understanding 6DoF Gesture Ecologies in Music Mixing
【速读】:该论文旨在解决扩展现实(Extended Reality, XR)环境中音乐混音工作流程的交互设计问题,即如何将传统的二维推子操作转化为六自由度(Six-Degree-of-Freedom, 6DoF)手势交互,同时确保这些手势与真实混音实践一致,并支持可管理的认知负荷和积极的用户体验。解决方案的关键在于通过与经验混音师的设计工作坊获取核心音频任务(增益、压缩、均衡和自动化)的手势概念,并在XR原型中实现这些手势;随后通过用户研究(n=12)验证其生态效度,结果表明参与者普遍认为6DoF手势直观且映射合理,认知负荷差异小,沉浸感强,且偏好受工作流熟悉度和控制感知影响,从而为设计兼具表现力、精度与生态有效性的XR混音工具提供了实证依据。
链接: https://arxiv.org/abs/2602.23090
作者: Jeremy Wertheim Co Chen,Rendell Christian Ngo,Cedric Matthew Yu,Hans Emilio Lumagui,Ethan Badayos,Jordan Aiko Deja
机构: De La Salle University - Manila (德拉萨大学-马尼拉校区)
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 2 figuresl, CHI 2026 Poster
Abstract:Extended reality (XR) enables new music-mixing workflows by moving beyond 2D faders toward embodied, spatial interaction. However, it remains unclear which six-degree-of-freedom (6DoF) gestures align with real-world mixing practices and whether such interactions support manageable cognitive load and positive user experience. We conducted a design workshop with experienced mixers to elicit gesture concepts for core audio tasks gain, compression, equalization, and automation, and implemented these in an XR prototype. A user study (n=12) evaluated the ecological validity of the gestures using cognitive load measures, user-experience ratings, and interviews. Participants generally found 6DoF gestures intuitive and well-mapped to mixing tasks, reporting strong immersion and a sense of connection with the audio environment. Cognitive load differences across gestures were minimal, though participants expressed preferences shaped by workflow familiarity and perceived control. We discuss implications for designing XR mixing tools that balance expressiveness, precision, and ecological validity.
[HC-7] Understanding Older Adults Experiences of Support Concerns and Risks from Kinship-Role AI-Generated Influencers
【速读】:该论文旨在解决当前中国短视频平台上AI生成的虚拟人物(如AI孙辈)日益流行背景下,其设计策略、老年用户使用体验及其带来的潜在益处与风险尚不明确的问题。解决方案的关键在于结合社交媒体内容分析与深度访谈方法,系统揭示了这些AI Influencers如何通过视觉和对话线索构建亲属角色,并促使老年人产生情感共鸣与角色扮演行为;同时识别出其在满足信息与情感需求方面的积极作用,以及引发的情感替代和不对等情感投入等风险,从而为理解虚拟身份与现实家庭关系之间的复杂互动提供实证依据,并提出在特定文化语境下利用AI增强社会支持的同时规避潜在伦理风险的路径。
链接: https://arxiv.org/abs/2602.22993
作者: Tianqi Song,Black Sun,Jingshu Li,Han Li,Chi-Lan Yang,Yijia Xu,Yi-Chieh Lee
机构: National University of Singapore(新加坡国立大学); Aarhus University(奥胡斯大学); Cornell University(康奈尔大学); The University of Tokyo(东京大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:AI-generated influencers are rapidly gaining popularity on Chinese short-video platforms, often adopting kinship-based roles such as AI grandchildren to attract older adults. Although this trend has raised public concern, little is known about the design strategies behind these influencers, how older adults experience them, and the benefits and risks involved. In this study, we combined social media analysis with interviews to unpack the above questions. Our findings show that influencers use both visual and conversational cues to enact kinship roles, prompting audiences to engage in kinship-based role-play. Interviews further show that these cues arouse emotional resonance, help fulfill older adults’ informational and emotional needs, while also raising concerns about emotional displacement and unequal emotional investment. We highlight the complex relationship between virtual avatars and real family ties, shaped by broader sociocultural norms, and discuss how AI might strengthen social support for older adults while mitigating risks within cultural contexts.
[HC-8] ableTale: Reviving the Narrative Interplay Between Data Tables and Text in Scientific Papers
【速读】:该论文旨在解决科学论文中数据表(data table)与文本之间通过叙事互动共同构建语义所导致的读者认知负荷过高的问题。其解决方案的关键在于提出一种名为TableTale的增强阅读界面,该界面基于对文本-表格叙事互作机制(包括链接机制、多粒度对齐和提及类型)的深入理解,自动构建文档级链接结构,并以渐进式视觉线索在文本和表格间进行多粒度(段落、句子、提及)的信息映射与呈现,从而降低认知负担并提升阅读效率。
链接: https://arxiv.org/abs/2602.22908
作者: Liangwei Wang,Zhengxuan Zhang,Yifan Cao,Fugee Tsung,Yuyu Luo
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州) ); The Hong Kong University of Science and Technology(香港科技大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Data tables play a central role in scientific papers. However, their meaning is often co-constructed with surrounding text through narrative interplay, making comprehension cognitively demanding for readers. In this work, we explore how interfaces can better support this reading process. We conducted a formative study that revealed key characteristics of text-table narrative interplay, including linking mechanisms, multi-granularity alignments, and mention typologies, as well as a layered framework of readers’ intents. Informed by these insights, we present TableTale, an augmented reading interface that enriches text with data tables at multiple granularities, including paragraphs, sentences, and mentions. TableTale automatically constructs a document-level linking schema within the paper and progressively renders cascade visual cues on text and tables that unfold as readers move through the text. A within-subject study with 24 participants showed that TableTale reduced cognitive workload and improved reading efficiency, demonstrating its potential to enhance paper reading and inform future reading interface design.
[HC-9] InfoAlign: A Human-AI Co-Creation System for Storytelling with Infographics
【速读】:该论文旨在解决现有信息图(infographic)设计工具在创作过程中缺乏对故事连贯性(story consistency)维护及与用户故事目标对齐的问题。其核心解决方案是提出一种以叙事为中心的工作流(narrative-centric workflow),包含故事构建、视觉编码和空间布局三个阶段,并基于此开发了InfoAlign系统——一个支持人机协同创作的工具,能够将长文本或非结构化文本转化为结构化故事,推荐语义一致的视觉设计方案,并生成布局蓝图,同时允许用户在任意阶段介入调整,从而保障设计意图的完整性与创作过程的透明度。
链接: https://arxiv.org/abs/2602.22901
作者: Jielin Feng,Xinwu Ye,Qianhui Li,Verena Ingrid Prantl,Jun-Hsiang Yao,Yuheng Zhao,Yun Wang,Siming Chen
机构: Fudan University (复旦大学); The University of Hong Kong (香港大学); University of Vienna (维也纳大学); Microsoft Research Asia (微软亚洲研究院)
类目: Human-Computer Interaction (cs.HC)
备注: 20 pages, 8 figures, 1 table
Abstract:Storytelling infographics are a powerful medium for communicating data-driven stories through visual presentation. However, existing authoring tools lack support for maintaining story consistency and aligning with users’ story goals throughout the design process. To address this gap, we conducted formative interviews and a quantitative analysis to identify design needs and common story-informed layout patterns in infographics. Based on these insights, we propose a narrative-centric workflow for infographic creation consisting of three phases: story construction, visual encoding, and spatial composition. Building on this workflow, we developed InfoAlign, a human-AI co-creation system that transforms long or unstructured text into stories, recommends semantically aligned visual designs, and generates layout blueprints. Users can intervene and refine the design at any stage, ensuring their intent is preserved and the infographic creation process remains transparent. Evaluations show that InfoAlign preserves story coherence across authoring stages and effectively supports human-AI co-creation for storytelling infographic design.
[HC-10] hey Think AI Can Do More Than It Actually Can: Practices Challenges Opportunities of AI-Supported Reporting In Local Journalism
【速读】:该论文旨在解决本地新闻机构在报纸收入下降背景下,如何有效利用人工智能(AI)支持数据驱动报道的问题,以提升新闻生产效率并维持社区信息供给。其关键解决方案在于通过21名德国本地记者的半结构化访谈,从社会技术视角出发,深入理解记者对数据与AI的实际使用情况、面临挑战及对AI赋能报道的潜在机会认知,并基于此提出针对性改进策略,从而推动AI支持的本地新闻报道系统设计更贴合实践需求。
链接: https://arxiv.org/abs/2602.22887
作者: Besjon Cifliku,Hendrik Heuer
机构: Center For Advanced Internet Studies (CAIS)(高级互联网研究中心); University of Wuppertal(伍珀塔尔大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Conditionally Accepted CHI’26 (CHI26) SIGCHI ACM
Abstract:Declining newspaper revenues prompt local newsrooms to adopt automation to maintain efficiency and keep the community informed. However, current research provides a limited understanding of how local journalists work with digital data and which newsroom processes would benefit most from AI-supported (data) reporting. To bridge this gap, we conducted 21 semi-structured interviews with local journalists in Germany. Our study investigates how local journalists use data and AI (RQ1); the challenges they encounter when interacting with data and AI (RQ2); and the self-perceived opportunities of AI-supported reporting systems through the lens of discursive design (RQ3). Our findings reveal that local journalists do not fully leverage AI’s potential to support data-related work. Despite local journalists’ limited awareness of AI’s capabilities, they are willing to use it to process data and discover stories. Finally, we provide recommendations for improving AI-supported reporting in the context of local news, grounded in the journalists’ socio-technical perspective and their imagined AI future capabilities.
[HC-11] When Should an AI Act? A Human-Centered Model of Scene Context and Behavior for Agent ic AI Design
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在主动干预用户时缺乏原则性判断的问题,即系统难以确定何时、为何以及是否应当采取行动。其解决方案的关键在于提出一个整合“场景(Scene)”、“情境(Context)”与“人类行为因素(Human Behavior Factors)”的三元概念模型,将可观测的外部环境与用户主观建构的意义区分开来,并解释相同场景如何因情境差异而产生不同的行为意义与结果。基于此模型,进一步提炼出五项代理设计原则(行为对齐、情境敏感性、时间适当性、动机校准和代理保留),为干预的深度、时机、强度与克制提供可操作的设计指导,从而实现具有情境敏感性和判断力的智能体交互设计。
链接: https://arxiv.org/abs/2602.22814
作者: Soyoung Jung,Daehoo Yoon,Sung Gyu Koh,Young Hwan Kim,Yehan Ahn,Sung Park
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Agentic AI increasingly intervenes proactively by inferring users’ situations from contextual data yet often fails for lack of principled judgment about when, why, and whether to act. We address this gap by proposing a conceptual model that reframes behavior as an interpretive outcome integrating Scene (observable situation), Context (user-constructed meaning), and Human Behavior Factors (determinants shaping behavioral likelihood). Grounded in multidisciplinary perspectives across the humanities, social sciences, HCI, and engineering, the model separates what is observable from what is meaningful to the user and explains how the same scene can yield different behavioral meanings and outcomes. To translate this lens into design action, we derive five agent design principles (behavioral alignment, contextual sensitivity, temporal appropriateness, motivational calibration, and agency preservation) that guide intervention depth, timing, intensity, and restraint. Together, the model and principles provide a foundation for designing agentic AI systems that act with contextual sensitivity and judgment in interactions.
[HC-12] Input-Envelope-Output: Auditable Generative Music Rewards in Sensory-Sensitive Contexts
【速读】:该论文旨在解决感官敏感情境下生成式反馈(generative feedback)的设计难题,尤其聚焦于自闭症谱系障碍(autism spectrum disorder, ASD)中个体间显著差异的听觉敏感性问题——如何在不牺牲安全性的前提下维持用户参与度。现有交互式音乐系统通常通过隐式编码的方式将安全性嵌入直接的输入-输出(Input-Output, I-O)映射中,虽能保持新颖性但导致系统行为难以预测与审计。论文提出一种“约束优先”的输入-包络-输出(Input-Envelope-Output, I-E-O)框架,其关键在于引入一个位于用户输入与音频输出之间的低风险包络层(envelope layer),明确界定并确定性地执行安全边界,同时记录干预日志以支持可审计性,从而在保障安全的同时保留动作与输出之间的因果关系。
链接: https://arxiv.org/abs/2602.22813
作者: Cong Ye,Songlin Shang,Xiaoxu Ma,Xiangbo Zhang
机构: Wenzhou-Kean University (温州肯恩大学); University of Minnesota (明尼苏达大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 3 figures. Accepted to CHI EA '26 (Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems), Barcelona, Spain
Abstract:Generative feedback in sensory-sensitive contexts poses a core design challenge: large individual differences in sensory tolerance make it difficult to sustain engagement without compromising safety. This tension is exemplified in autism spectrum disorder (ASD), where auditory sensitivities are common yet highly heterogeneous. Existing interactive music systems typically encode safety implicitly within direct input-output (I-O) mappings, which can preserve novelty but make system behavior hard to predict or audit. We instead propose a constraint-first Input-Envelope-Output (I-E-O) framework that makes safety explicit and verifiable while preserving action-output causality. I-E-O introduces a low-risk envelope layer between user input and audio output to specify safe bounds, enforce them deterministically, and log interventions for audit. From this architecture, we derive four verifiable design principles and instantiate them in MusiBubbles, a web-based prototype. Contributions include the I-E-O architecture, MusiBubbles as an exemplar implementation, and a reproducibility package to support adoption in ASD and other sensory-sensitive domains.
[HC-13] herapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
【速读】:该论文旨在解决当前心理健康聊天机器人(mental health chatbots)在安全评估中忽视“关系安全性”(relational safety)的问题,即如何设计能够保障对话过程中长期互动质量的机制,而非仅关注单轮回复的正确性。其解决方案的关键在于提出了一种名为TherapyProbe的设计探针方法(design probe methodology),通过对抗性多智能体模拟系统地探索聊天机器人的对话轨迹,识别出诸如“验证螺旋”(validation spirals)和“共情疲劳”(empathy fatigue)等关系安全性失效模式,并将其归纳为包含23种失败原型的安全模式库(Safety Pattern Library),从而为开发者、临床人员及政策制定者提供可操作的设计建议。
链接: https://arxiv.org/abs/2602.22775
作者: Joydeep Chandra,Satyam Kumar Navneet,Yong Zhang
机构: BNRIST (北京神经科学研究所); Tsinghua University (清华大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness of individual responses? Current safety evaluations assess single-turn crisis responses, missing the therapeutic dynamics that determine whether chatbots help or harm over time. We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi-agent simulation. Using open-source models, TherapyProbe surfaces relational safety failures interaction patterns like “validation spirals” where chatbots progressively reinforce hopelessness, or “empathy fatigue” where responses become mechanical over turns. Our contribution is translating these failures into a Safety Pattern Library of 23 failure archetypes with corresponding design recommendations. We contribute: (1) a replicable methodology requiring no API costs, (2) a clinically-grounded failure taxonomy, and (3) design implications for developers, clinicians, and policymakers.
[HC-14] An AI-Based Structured Semantic Control Model for Stable and Coherent Dynamic Interactive Content Generation
【速读】:该论文旨在解决生成式模型在复杂交互场景中难以平衡灵活性、稳定性和可控性的问题。其解决方案的关键在于提出了一种可控生成框架,通过构建结构化的语义状态空间,将用户输入、环境条件和历史上下文编码为可操作的潜在表示,并生成方向性控制向量以引导内容生成过程;同时引入多层级约束机制,包括语义一致性约束、结构稳定性约束和语义漂移惩罚,从而确保动态环境中内容的语义清晰路径与逻辑连贯性,防止内容偏离、语气不稳定或结构断裂。
链接: https://arxiv.org/abs/2602.22762
作者: Rui Liu
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This study addresses the challenge that generative models struggle to balance flexibility, stability, and controllability in complex interactive scenarios. It proposes a controllable generation framework for dynamic interactive content construction. The framework builds a structured semantic state space that encodes user input, environmental conditions, and historical context into actionable latent representations and generates directional control vectors to guide the content generation process. It introduces multilevel constraints, including semantic consistency constraints, structural stability constraints, and semantic drift penalties, which help the model maintain clear semantic paths and coherent logic in dynamic environments. These constraints prevent content deviation, unstable tone, or structural breaks. Based on these components, the study designs a systematic controllable generation pipeline in which semantic modeling, control signals, and generation strategies work together within one framework. Sensitivity analyses on control vector dimension, hidden layer size, noise intensity, and training sample scale are conducted on a public dialogue dataset to validate the framework. The results show that the approach improves semantic structure, contextual consistency, and controllable expression, providing a structured and effective solution for interactive content generation.
[HC-15] Simulation-based Optimization for Augmented Reading
【速读】:该论文旨在解决当前增强阅读系统(augmented reading systems)设计中依赖启发式规则、黑箱数据驱动模型或频繁人工参与设计循环的问题,这些问题限制了系统的适应性、可解释性和扩展性。解决方案的关键在于将增强阅读建模为基于资源理性(resource-rational)人类阅读模型的仿真优化问题,通过构建模拟读者来系统评估文本用户界面;并提出两种互补的优化流程:离线管道利用模拟读者探索设计选项,在线管道则基于实时交互数据个性化阅读界面,从而实现无需完全依赖人工测试的自适应、可解释且可扩展的增强阅读设计。
链接: https://arxiv.org/abs/2602.22735
作者: Yunpeng Bai,Shengdong Zhao,Antti Oulasvirta
机构: National University of Singapore (新加坡国立大学); City University of Hong Kong (香港城市大学); Aalto University (阿尔托大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Augmented reading systems aim to adapt text presentation to improve comprehension and task performance, yet existing approaches rely heavily on heuristics, opaque data-driven models, or repeated human involvement in the design loop. We propose framing augmented reading as a simulation-based optimization problem grounded in resource-rational models of human reading. These models instantiate a simulated reader that allocates limited cognitive resources, such as attention, memory, and time under task demands, enabling systematic evaluation of text user interfaces. We introduce two complementary optimization pipelines: an offline approach that explores design alternatives using simulated readers, and an online approach that personalizes reading interfaces in real time using ongoing interaction data. Together, this perspective enables adaptive, explainable, and scalable augmented reading design without relying solely on human testing.
[HC-16] Same Words Different Judgments: Modality Effects on Preference Alignment INTERSPEECH2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在语音模态中缺乏有效对齐方法的问题,特别是如何通过偏好标注(preference annotation)实现语音模型与人类偏好的对齐。其关键解决方案在于开展一项受控的跨模态研究,系统比较了人类和合成偏好标注在文本与音频两种模态下对相同语义内容的评估表现,发现音频偏好具有与文本相当的可靠性(ICC ≈ 0.80),且音频评价表现出更窄的决策阈值、更低的长度偏差及更贴近用户导向的判断标准;同时,合成标注不仅与人类判断高度一致,还能预测人与人之间的评分一致性,从而支持其用于筛选模糊样本或作为人类标注的完整替代方案。
链接: https://arxiv.org/abs/2602.22710
作者: Aaron Broukhim,Nadir Weibel,Eshin Jolly
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Submitted to Interspeech 2026 for review
Abstract:Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored. We present a controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. Audio preferences prove as reliable as text, with inter-rater agreement reaching good levels (ICC(2,k) \approx .80) at \sim 9 raters – the first ICC-based reliability characterization in the preference annotation literature for either modality. However, modality reshapes how people judge: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. Synthetic ratings further align with human judgments and predict inter-rater agreement, supporting their use both for triaging ambiguous pairs and as full replacements for human annotations.
[HC-17] CoLyricist: Enhancing Lyric Writing with AI through Workflow-Aligned Support
【速读】:该论文旨在解决现有AI辅助歌词创作工具未能充分适配专业词作者典型工作流程的问题,导致其难以融入实际创作实践。研究表明,词作者普遍经历四个关键阶段:主题设定(Theme Setting)、创意生成(Ideation)、歌词初稿撰写(Drafting Lyrics)和旋律匹配(Melody Fitting),而现有工具缺乏对这些阶段及其挑战的深入理解,因而设计效果不佳。解决方案的关键在于提出CoLyricist——一个针对上述四阶段定制化开发的AI辅助歌词写作工具,通过在每个阶段提供针对性的AI支持,显著提升创作效率与体验,并在用户研究中验证了其对不同技能水平用户的普适性价值。
链接: https://arxiv.org/abs/2602.22606
作者: Masahiro Yoshida,Bingxuan Li,Songyan Zhao,Qinyi Zhou,Shiwei Hu,Xiang Anthony Chen,Nanyun Peng
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose CoLyricist, an AI-assisted lyric writing tool designed to support the typical workflows of experienced lyricists and enhance their creative efficiency. While lyricists have unique processes, many follow common stages. Tools that fail to accommodate these stages challenge integration into creative practices. Existing research and tools lack sufficient understanding of these songwriting stages and their associated challenges, resulting in ineffective designs. Through a formative study involving semi-structured interviews with 10 experienced lyricists, we identified four key stages: Theme Setting, Ideation, Drafting Lyrics, and Melody Fitting. CoLyricist addresses these needs by incorporating tailored AI-driven support for each stage, optimizing the lyric writing process to be more seamless and efficient. To examine whether this workflow-aligned design also benefits those without prior experience, we conducted a user study with 16 participants, including both experienced and novice lyricists. Results showed that CoLyricist enhances the songwriting experience across skill levels. Novice users especially appreciated the Melody-Fitting feature, while experienced users valued the Ideation support.
[HC-18] DuoMorph: Synergistic Integration of FDM Printing and Pneumatic Actuation for Shape-Changing Interfaces
【速读】:该论文旨在解决传统形状变换界面设计中功能受限、制造流程复杂的问题,尤其是如何实现结构与驱动机制的高效协同。其解决方案的关键在于提出DuoMorph方法,通过融合熔融沉积成型(Fused Deposition Modeling, FDM)打印与气动驱动(pneumatic actuation)技术,使打印结构与热封气动元件在设计阶段即相互约束与协同作用,从而实现单一组件难以达成的功能;同时,整个混合结构可通过标准FDM打印机完成一体化制造,涵盖热封、3D和4D打印,显著简化工艺流程并提升可扩展性。
链接: https://arxiv.org/abs/2602.22604
作者: Xueqing Li,Danqi huang,Tianyu Yu,Shuzi Yin,Bingjie Gao,Anna Matsumoto,Zhihao Yao,Yiwei Zhao,Shiqing Lyu,Yuchen Tian,Lining Yao,Haipeng Mi,Qiuyu Lu
机构: Tsinghua University (清华大学); The Hong Kong Polytechnic University (香港理工大学); University of California, Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学); Interbeing Lab, School of Design, The Hong Kong Polytechnic University (香港理工大学设计学院互为主体实验室)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:We introduce DuoMorph, a design and fabrication method that synergistically integrates Fused Deposition Modeling (FDM) printing and pneumatic actuation to create novel shape-changing interfaces. In DuoMorph, the printed structures and heat-sealed pneumatic elements are mutually designed to actuate and constrain each other, enabling functions that are difficult for either component to achieve in isolation. Moreover, the entire hybrid structure can be fabricated through a single, seamless process using only a standard FDM printer, including both heat-sealing and 3D and 4D printing. In this paper, we define a design space including four primitive categories that capture the fundamental ways in which printed and pneumatic components can interact. To support this process, we present a fabrication method and an accompanying design tool. Finally, we demonstrate the potential of DuoMorph through a series of example applications and performance demonstrations.
[HC-19] Addressing Climate Action Misperceptions with Generative AI
【速读】:该论文旨在解决气候关切个体在减排行为选择上普遍存在认知偏差的问题,即他们常对不同气候行动的碳减排效果存在误判。解决方案的关键在于利用具备气候知识并被提示提供个性化回应的大语言模型(Large Language Model, LLM),通过定制化交互提升参与者对气候行动影响的认知准确性,并增强其采纳高影响力可行行为的意愿。实证结果表明,相较于网络搜索、通用LLM对话或无干预组,个性化气候LLM显著提升了知识水平和行为意图,凸显了LLM在提供精准、可操作建议方面的潜力,从而更有效地推动实质性低碳行为改变。
链接: https://arxiv.org/abs/2602.22564
作者: Miriam Remshard,Yara Kyrychenko,Sander van der Linden,Matthew H. Goldberg,Anthony Leiserowitz,Elena Savoia,Jon Roozenbeek
机构: University of Cambridge (剑桥大学); Yale University (耶鲁大学); Harvard T.H. Chan School of Public Health (哈佛大学陈-冯公共卫生学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 11 pages; 2 figures; for study materials, data and supplement, see this https URL
Abstract:Mitigating climate change requires behaviour change. However, even climate-concerned individuals often hold misperceptions about which actions most reduce carbon emissions. We recruited 1201 climate-concerned individuals to examine whether discussing climate actions with a large language model (LLM) equipped with climate knowledge and prompted to provide personalised responses would foster more accurate perceptions of the impacts of climate actions and increase willingness to adopt feasible, high-impact behaviours. We compared this to having participants run a web search, have a conversation with an unspecialised LLM, and no intervention. The personalised climate LLM was the only condition that led to increased knowledge about the impacts of climate actions and greater intentions to adopt impactful behaviours. While the personalised climate LLM did not outperform a web search in improving understanding of climate action impacts, the ability of LLMs to deliver personalised, actionable guidance may make them more effective at motivating impactful pro-climate behaviour change.
[HC-20] Relational Appliances: A Robot in the Refrigerator for Home-Based Health Promotion
【速读】:该论文试图解决的问题是如何在家庭环境中通过日常家电(如冰箱)实现有效的健康促进,尤其是在饮食决策这一关键场景中。现有研究多忽视了厨房电器作为健康干预载体的潜力,而本文提出“关系型家电”(relational appliances)的概念,即设计具有拟人化特征、能与用户进行持续个性化互动的家用设备,以增强用户的健康行为改变意愿。其解决方案的关键在于利用冰箱固有的传感器丰富环境、私密交互空间和与食物紧密耦合的特点,部署一个拟人化机器人头部,在用户取用零食时开展情境化的对话式干预;实验结果显示,参与者对机器人表现出高度接受度和信任感,且健康推荐得到较高遵从率,表明通过拟人化设计建立长期人机关系可有效推动居家健康行为干预。
链接: https://arxiv.org/abs/2602.22542
作者: Timothy Bickmore,Mehdi Arjmand,Yunus Terzioglu
机构: Northeastern University (东北大学)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:Kitchen appliances are frequently used domestic artifacts situated at the point of everyday dietary decision making, making them a promising but underexplored site for health promotion. We explore the concept of relational appliances: everyday household devices designed as embodied social actors that engage users through ongoing, personalized interaction. We focus on the refrigerator, whose unique affordances, including a fixed, sensor-rich environment, private interaction space, and close coupling to food items, support contextualized, conversational engagement during snack choices. We present an initial exploration of this concept through a pilot study deploying an anthropomorphic robotic head inside a household refrigerator. In a home-lab apartment, participants repeatedly retrieved snacks during simulated TV “commercial breaks” while interacting with a human-sized robotic head. Participants were randomized to either a health-promotion condition, in which the robot made healthy snack recommendations, or a social-chat control condition. Outcomes included compliance with recommendations, nutritional quality of selected snacks, and psychosocial measures related to acceptance of the robot. Results suggest that participants found the robot persuasive, socially engaging, and increasingly natural over time, often describing it as helpful, aware, and companionable. Most participants reported greater awareness of their snack decisions and expressed interest in having such a robot in their own home. We discuss implications for designing relational appliances that leverage anthropomorphism, trust, and long-term human-technology relationships for home-based health promotion.
[HC-21] Skewed Dual Normal Distribution Model: Predicting 1D Touch Pointing Success Rate for Targets Near Screen Edges
【速读】:该论文旨在解决典型点击成功率预测模型在靠近屏幕边缘的目标上失效的问题,这类目标常因设计约束或滚动界面中元素位置变化而出现。传统模型未考虑边缘对触控分布的影响,导致预测不准确。其解决方案的关键在于提出偏斜双正态分布模型(Skewed Dual Normal Distribution Model),该模型假设边缘会扭曲点击坐标的分布形态,使峰值向边缘偏移、尾部向外延伸,并能准确预测包括边缘邻近目标在内的全屏范围内的成功概率,从而提升用户界面(UI)设计支持工具的适用性与精度。
链接: https://arxiv.org/abs/2602.22454
作者: Nobuhito Kasahara,Shota Yamanaka,Homei Miyashita
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: To appear at CHI 2026
Abstract:Typical success-rate prediction models for tapping exclude targets near screen edges; however, design constraints often force such placements. Additionally, in scrollable UIs any element can move close to an edge. In this work, we model how target–edge distance affects 1D touch pointing accuracy. We propose the Skewed Dual Normal Distribution Model, which assumes the tap coordinate distribution is skewed by a nearby edge. The results of two smartphone experiments showed that, as targets approached the edge, the distribution’s peak shifted toward the edge and its tail extended away. In contrast to prior reports, the success rate improved when the target touched the edge, suggesting a strategy of ``tapping the target together with the edge.‘’ By accounting for skew, our model predicts success rates across a wide range of conditions, including edge-adjacent targets, thus extending coverage to the whole screen and informing UI design support tools.
[HC-22] he Way We Notice Thats What Really Matters: Instantiating UI Components with Distinguishing Variations
【速读】:该论文旨在解决前端开发中UI组件实例化困难的问题,即开发者在参数化设计时需面对大量属性值及其交互关系,难以高效探索组件的设计空间并提供真实、自然的属性取值。解决方案的关键在于引入“区分性变异(distinguishing variations)”,即既具代表性又具差异性的组件实例;其生成方法结合符号推理识别视觉关键属性与基于大语言模型(LLM)的模拟采样器,利用世界知识生成符合领域语境的实例,从而将组件实例化从手动过程转变为结构化的探索活动。
链接: https://arxiv.org/abs/2602.22436
作者: Priyan Vaithilingam,Alan Leung,Jeffrey Nichols,Titus Barik
机构: Apple(苹果)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Front-end developers author UI components to be broadly reusable by parameterizing visual and behavioral properties. While flexible, this makes instantiation harder, as developers must reason about numerous property values and interactions. In practice, they must explore the component’s large design space and provide realistic and natural values to properties. To address this, we introduce distinguishing variations: variations that are both mimetic and distinct. We frame distinguishing variation generation as design-space sampling, combining symbolic inference to identify visually important properties with an LLM-driven mimetic sampler to produce realistic instantiations from its world knowledge. We instantiate distinguishing variations in Celestial, a tool that helps developers explore and visualize distinguishing variations. In a study with front-end developers (n=12), participants found these variations useful for comparing and mapping component design spaces, reported that mimetic instantiations were domain-relevant, and validated that Celestial transformed component instantiation from a manual process into a structured, exploratory activity.
[HC-23] Seeing Graphs Like Humans: Benchmarking Computational Measures and MLLM s for Similarity Assessment
【速读】:该论文旨在解决当前视觉分析系统中图形相似性度量与人类主观视觉感知之间存在不一致的问题,这种不一致性可能导致分析建议与用户直觉冲突,增加认知负荷而非降低。解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)作为人类感知的代理指标,通过三组实验验证其在图形相似性判断上的有效性:首先收集32名参与者对1,881个节点-边图的相对相似性判断及理由,发现人类更关注全局形状和边密度;其次对比16种传统计算度量,确认Portrait divergence表现最优但仅具中等一致性;最后评估GPT-5、Gemini 2.5 Pro和Claude Sonnet 4.5三种前沿MLLMs,结果表明MLLMs尤其是GPT-5显著优于传统方法,且能提供可解释的推理过程,从而实现更贴近人类感知的智能引导。
链接: https://arxiv.org/abs/2602.22416
作者: Seokweon Jung,Jeongmin Rhee,Seoyoung Doh,Hyeon Jeon,Ghulam Jilani Quadri,Jinwook Seo
机构: Korea Advanced Institute of Science and Technology (KAIST); Seoul National University; University of Oklahoma
类目: Human-Computer Interaction (cs.HC)
备注: 21 pages including 1 page of appendix, 9 figures, 4 tables
Abstract:Comparing graphs to identify similarities is a fundamental task in visual analytics of graph data. To support this, visual analytics systems frequently employ quantitative computational measures to provide automated guidance. However, it remains unclear how well these measures align with subjective human visual perception, thereby offering recommendations that conflict with analysts’ intuitive judgments, potentially leading to confusion rather than reducing cognitive load. Multimodal Large Language Models (MLLMs), capable of visually interpreting graphs and explaining their reasoning in natural language, have emerged as a potential alternative to address this challenge. This paper bridges the gap between human and machine assessment of graph similarity through three interconnected experiments using a dataset of 1,881 node-link diagrams. Experiment 1 collects relative similarity judgments and rationales from 32 human participants, revealing consensus on graph similarity while prioritizing global shapes and edge densities over exact topological details. Experiment 2 benchmarks 16 computational measures against these human judgments, identifying Portrait divergence as the best-performing metric, though with only moderate alignment. Experiment 3 evaluates the potential of three state-of-the-art MLLMs (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5) as perceptual proxies. The results demonstrate that MLLMs, particularly GPT-5, significantly outperform traditional measures in aligning with human graph similarity perception and provide interpretable rationales for their decisions, whereas Claude Sonnet 4.5 shows the best computational efficiency. Our findings suggest that MLLMs hold significant promise not only as effective, explainable proxies for human perception but also as intelligent guides that can uncover subtle nuances that might be overlooked by human analysts in visual analytics systems.
[HC-24] A Learning-Based Hybrid Decision Framework for Matching Systems with User Departure Detection
【速读】:该论文旨在解决匹配市场(如肾脏交换和货运交易)中固定匹配策略在动态环境中灵活性不足的问题,尤其关注延迟匹配对整体市场效率的影响。其核心挑战在于如何权衡延迟匹配带来的效率提升与由此引发的等待时间延长及市场拥堵等成本。解决方案的关键在于提出一种基于学习的混合(Hybrid)框架,该框架通过持续收集用户退出数据并利用回归方法估计潜在的退出分布,进而根据一个决策阈值动态判断是否延迟匹配,从而在保持有限匹配效率损失的前提下显著降低等待时间和市场拥堵,实现贪婪型与耐心型匹配策略之间的灵活过渡。
链接: https://arxiv.org/abs/2602.22412
作者: Ruiqi Zhou,Donghao Zhu,Houcai Shen
机构: 未知
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Information Theory (cs.IT); General Economics (econ.GN)
备注: Accepted at HCII 2026
Abstract:In matching markets such as kidney exchanges and freight exchanges, delayed matching has been shown to improve overall market efficiency. The benefits of delay are highly sensitive to participants’ sojourn times and departure behavior, and delaying matches can impose significant costs, including longer waiting times and increased market congestion. These competing effects make fixed matching policies inherently inflexible in dynamic environments. We propose a learning-based Hybrid framework that adaptively combines immediate and delayed matching. The framework continuously collects data on user departures over time, estimates the underlying departure distribution via regression, and determines whether to delay matching in the subsequent period based on a decision threshold that governs the system’s tolerance for matching efficiency loss. The proposed framework can substantially reduce waiting times and congestion while sacrificing only a limited amount of matching efficiency. By dynamically adjusting its matching strategy, the Hybrid framework enables system performance to flexibly interpolate between purely greedy and purely patient policies, offering a robust and adaptive alternative to static matching mechanisms.
[HC-25] Contextual Memory Virtualisation: DAG-Based State Management and Structurally Lossless Trimming for LLM Agents
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长时间推理任务中因上下文窗口限制而导致的累积知识丢失问题,即模型在会话过程中形成的架构映射、权衡决策和代码库约定等上下文状态无法有效保留。其解决方案的核心是提出上下文记忆虚拟化(Contextual Memory Virtualisation, CMV),将LLM的会话历史建模为有向无环图(Directed Acyclic Graph, DAG),并引入版本控制语义下的快照(snapshot)、分支(branch)和修剪(trim)原语,实现跨独立并行会话的上下文复用;同时设计了一种三阶段结构无损修剪算法,在保持所有用户消息与助手回复原文不变的前提下,通过移除工具输出、Base64图像和元数据等机械冗余内容,平均降低20%、最高达86%的token消耗,从而显著提升长会话效率与经济性。
链接: https://arxiv.org/abs/2602.22402
作者: Cosmo Santoni
机构: Imperial College London (帝国理工学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Operating Systems (cs.OS)
备注: 11 pages. 6 figures. Introduces a DAG-based state management system for LLM agents. Evaluation on 76 coding sessions shows up to 86% token reduction (mean 20%) while remaining economically viable under prompt caching. Includes reference implementation for Claude Code
Abstract:As large language models engage in extended reasoning tasks, they accumulate significant state – architectural mappings, trade-off decisions, codebase conventions – within the context window. This understanding is lost when sessions reach context limits and undergo lossy compaction. We propose Contextual Memory Virtualisation (CMV), a system that treats accumulated LLM understanding as version-controlled state. Borrowing from operating system virtual memory, CMV models session history as a Directed Acyclic Graph (DAG) with formally defined snapshot, branch, and trim primitives that enable context reuse across independent parallel sessions. We introduce a three-pass structurally lossless trimming algorithm that preserves every user message and assistant response verbatim while reducing token counts by a mean of 20% and up to 86% for sessions with significant overhead by stripping mechanical bloat such as raw tool outputs, base64 images, and metadata. A single-user case-study evaluation across 76 real-world coding sessions demonstrates that trimming remains economically viable under prompt caching, with the strongest gains in mixed tool-use sessions, which average 39% reduction and reach break-even within 10 turns. A reference implementation is available at this https URL.
[HC-26] Vibe Researching as Wolf Coming: Can AI Agents with Skills Replace or Augment Social Scientists?
【速读】:该论文旨在解决人工智能代理(AI agents)在社会科学领域中如何重构研究流程的问题,特别是如何界定人类研究人员与AI代理之间的协作边界。其核心挑战在于识别哪些研究活动适合由AI代理执行,而哪些仍需人类主导,以避免对理论原创性和领域内隐性知识(tacit knowledge)的削弱。解决方案的关键在于提出一个基于“可编码性”(codifiability)与“隐性知识需求”(tacit knowledge requirement)双维度的认知任务框架,从而定义出贯穿整个研究流程的“委托边界”(delegation boundary),而非传统的阶段间分割;该框架揭示了AI代理在速度、覆盖范围和方法论支撑上的优势,同时指出其在理论创新和领域专长方面的局限性,并据此提出五项负责任的“ vibe researching”(vibe coding的学术平行概念)原则,以指导未来人机协同研究实践。
链接: https://arxiv.org/abs/2602.22401
作者: Yongjun Zhang
机构: Stony Brook University (石溪大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Commentary
Abstract:AI agents – systems that execute multi-step reasoning workflows with persistent state, tool access, and specialist skills – represent a qualitative shift from prior automation technologies in social science. Unlike chatbots that respond to isolated queries, AI agents can now read files, run code, query databases, search the web, and invoke domain-specific skills to execute entire research pipelines autonomously. This paper introduces the concept of vibe researching – the AI-era parallel to ``vibe coding’’ (Karpathy, 2025) – and uses scholar-skill, a 21-skill plugin for Claude Code covering the full research pipeline from idea to submission, as an illustrative case. I develop a cognitive task framework that classifies research activities along two dimensions – codifiability and tacit knowledge requirement – to identify a delegation boundary that is cognitive, not sequential: it cuts through every stage of the research pipeline, not between stages. I argue that AI agents excel at speed, coverage, and methodological scaffolding but struggle with theoretical originality and tacit field knowledge. The paper concludes with an analysis of three implications for the profession – augmentation with fragile conditions, stratification risk, and a pedagogical crisis – and proposes five principles for responsible vibe researching.
[HC-27] VoiceAlign: A Shimming Layer for Enhancing the Usability of Legacy Voice User Interface Systems
【速读】:该论文旨在解决当前主流操作系统中语音用户接口(Voice User Interface, VUI)因固定命令格式、严格超时机制和反馈不足等问题导致的使用效率低下与用户体验差的问题。其核心解决方案是提出一个名为 VoiceAlign 的自适应适配层,该层通过拦截自然语音指令,利用大语言模型(Large Language Model, LLM)将其转换为兼容现有 VUI 系统的语法,并通过虚拟音频通道透明传输,从而在不修改底层系统的情况下显著提升交互成功率与效率。关键创新在于结合本地部署的小型微调语言模型(<200ms 响应时间),实现低延迟、高准确率(>90%)的实时语音指令映射,避免对第三方 API 的依赖,使老旧 VUI 系统获得接近现代生成式 AI 的灵活性与鲁棒性。
链接: https://arxiv.org/abs/2602.22374
作者: Md Ehtesham-Ul-Haque,Syed Masum Billah
机构: Pennsylvania State University (宾夕法尼亚州立大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at IUI’26
Abstract:Voice user interfaces (VUIs) are rapidly transitioning from accessibility features to mainstream interaction modalities. Yet most operating systems’ built-in voice commands remain underutilized despite possessing robust technical capabilities. Through our analysis of four commercial VUI systems and a formative study with 16 participants, we found that fixed command formats require exact phrasing, restrictive timeout mechanisms discard input during planning pauses, and insufficient feedback hampers multi-step interactions. To address these challenges, we developed VoiceAlign, an adaptive shimming layer that mediates between users and legacy VUI systems. VoiceAlign intercepts natural voice commands, transforms them to match the required syntax using a large language model, and transmits these adapted commands through a virtual audio channel that remains transparent to the underlying system. In our evaluation with 12 participants, VoiceAlign reduced command failures by half, required 25% fewer commands per task, and significantly lowered cognitive and temporal demands when paired with an existing legacy VUI system. Furthermore, we created a synthetic dataset informed by our studies and fine-tuned a small language model that achieves over 90% accuracy with 200 ms response time when served locally, eliminating dependence on third-party APIs while enabling real-time interaction on edge devices. This work demonstrates how modern AI techniques can unlock the underutilized potential of legacy VUI systems without requiring system modifications, offering a practical solution without replacing existing infrastructure.
[HC-28] E3VA: Enhancing Emotional Expressiveness in Virtual Conversational Agents
【速读】:该论文试图解决当前基于生成式AI(Generative AI)的具身对话代理在在线交互中情感表达能力有限的问题,这导致其难以根据用户情绪调整响应,从而影响用户体验与参与度。解决方案的关键在于引入情感分析(Sentiment Analysis)与自然语言处理(Natural Language Processing, NLP)技术,使虚拟对话代理能够识别并理解用户情绪,并据此生成具有同理心和情感表达力的回应,从而提升交互的适应性、可用性和整体质量。
链接: https://arxiv.org/abs/2602.22362
作者: Abhishek Kulkarni,Alexander Barquero,Pavitra Lahari,Aryaan Shaikh,Sarah Brown
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages
Abstract:With the advent of generative AI and large language models, embodied conversational agents are becoming synonymous with online interactions. These agents possess vast amounts of knowledge but suffer from exhibiting limited emotional expressiveness. Without adequate expressions, agents might fail to adapt to users’ emotions, which may result in a sub-optimal user experience and engagement. Most current systems prioritize content based responses, neglecting the emotional context of conversations. Research in this space is currently limited to specific contexts, like mental health. To bridge this gap, our project proposes the implementation of expressive features in a virtual conversational agent which will utilize sentiment analysis and natural language processing to inform the generation of empathetic, expressive responses. The project delivers a functional conversational agent capable of assessing and responding to user emotions accordingly. We posit this will enhance usability, engagement, and the overall quality of conversations and present results from an exploratory pilot study investigating the same.
[HC-29] Interface Framework for Human-AI Collaboration within Intelligent User Interface Ecosystems
【速读】:该论文试图解决的问题是:在人机协作界面从静态用户路径向动态人-AI协同演进的过程中,缺乏基于用户需求和任务复杂度选择合适界面模式的标准方法;现有框架仅提供设计AI代理能力的指导原则,未能系统性地指导上下文感知、可扩展的AI界面(即模态,如提示栏、分屏、全屏等)的设计。解决方案的关键在于提出一个由工作流复杂度(workflow complexity)、AI自主性(AI autonomy)和AI推理能力(AI reasoning)三个维度构成的框架,通过与营销产品设计师的共同设计工作坊及对八位长期AI用户的定性研究进行验证,明确了任务与界面之间的映射关系,并强调了在高自主性场景下业务影响与安全风险的重要性,从而为产品团队提供一套共享语言,以实现界面间的流畅切换和渐进式用户控制,平衡AI自主性与人类监督。
链接: https://arxiv.org/abs/2602.22343
作者: Shruthi Andru,Shrut Kirti Saksena
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 22 pages, 8 figures
Abstract:As interfaces evolve from static user pathways to dynamic human-AI collaboration, no standard methods exist for selecting appropriate interface patterns based on user needs and task complexity. Existing frameworks only provide guiding principles for designing AI agent capabilities. We propose a dimensional framework based on workflow complexity, AI autonomy, and AI reasoning to guide the design of context-aware, scalable AI interfaces aka modalities (e.g., prompt bars, split screens, full screens, etc.). The framework was developed through co-design workshops with designers of marketing products and refined through qualitative research with eight long-term AI users. The study evaluated the three dimensions, identified task-to-interface relationships, and surfaced the importance of both business impact and security risk across all high-autonomy scenarios. This framework provides product teams with a shared language to develop scalable AI interfaces, emphasizing fluidity between interfaces and progressive user control to balance AI autonomy with human oversight.
[HC-30] Conversational Successes and Breakdowns in Everyday Non-Display Smart Glasses Use
【速读】:该论文旨在解决如何通过非显示型智能眼镜(Non-Display Smart Glasses)在日常情境中实现高效语音交互的问题,尤其关注对话成功与失败的模式及其对语音界面设计的启示。其解决方案的关键在于开展为期一个月的协作自传式民族志研究(collaborative autoethnography),由两名参与者持续使用此类设备,并系统识别交互中的成功与失败模式,进而与已有语音交互研究进行对比,从而揭示非显示型智能眼镜所特有的交互 affordances(可供性)和设计机会。
链接: https://arxiv.org/abs/2602.22340
作者: Xiuqi Tommy Zhu,Xiaoan Liu,Casper Harteveld,Smit Desai,Eileen McGivney
机构: Northeastern University (东北大学); University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Non-Display Smart Glasses hold the potential to support everyday activities by combining continuous environmental sensing with voice-only interaction powered by large language models (LLMs). Understanding how conversational successes and breakdowns arise in everyday contexts can better inform the design of future voice-only interfaces. To investigate this, we conducted a month-long collaborative autoethnography (n=2) to identify patterns of successes and breakdowns when using such devices. We then compare these patterns with prior findings on voice-only interactions to highlight the unique affordances and opportunities offered by non-display smart glasses.
计算机视觉
[CV-0] MediX-R1: Open Ended Medical Reinforcement Learning
【速读】:该论文旨在解决当前医学多模态大语言模型(Multimodal Large Language Models, MLLMs)在开放式问答任务中难以生成符合临床实际、自由形式答案的问题,尤其是在传统基于多选题(Multiple-Choice Questions, MCQ)的奖励机制下无法有效评估语义正确性和推理过程的局限性。解决方案的关键在于提出MediX-R1框架——一个面向医学场景的开放式强化学习(Reinforcement Learning, RL)方法,其核心创新包括:(1) 基于分组的强化学习策略(Group Based RL)对基础视觉-语言模型进行微调;(2) 设计复合型奖励函数,包含LLM驱动的准确性奖励(用于严格判断语义正确性)、医学嵌入驱动的语义奖励(捕捉同义表达和术语变体),以及轻量级格式与模态奖励(确保可解释推理和模态识别);(3) 提出统一的评估框架,采用参考文本引导的LLM作为裁判(Reference-based LLM-as-judge),替代脆弱的字符串匹配指标,从而更准确地衡量语义一致性、推理能力和上下文贴合度。该方案通过多信号反馈机制显著提升了模型在开放任务中的表现,尤其在临床开放式问答中取得显著优势。
链接: https://arxiv.org/abs/2602.23363
作者: Sahal Shaji Mullappilly,Mohammed Irfan Kurpath,Omair Mohamed,Mohamed Zidan,Fahad Khan,Salman Khan,Rao Anwer,Hisham Cholakkal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only \sim51 K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at this https URL
[CV-1] VGG-T3: Offline Feed-Forward 3D Reconstruction at Scale CVPR2026
【速读】:该论文旨在解决离线前馈式三维重建方法中存在的计算与内存开销随输入图像数量呈二次增长的问题(即计算复杂度与内存占用与图像数量的平方成正比)。其核心解决方案在于识别出场景几何在键值(Key-Value, KV)空间中的表示长度不固定是导致性能瓶颈的根本原因,并通过测试时训练(Test-Time Training)将这一可变长度的KV表示蒸馏为固定大小的多层感知机(Multi-Layer Perceptron, MLP),从而实现线性时间复杂度的重建。该方法命名为VGG-T³(Visual Geometry Grounded Test Time Training),在保持全局场景聚合能力的同时,显著提升了效率——例如在1000张图像上的重建仅需54秒,较依赖softmax注意力机制的基线方法提速11.6倍,且点云重建误差优于其他线性时间方法。
链接: https://arxiv.org/abs/2602.23361
作者: Sven Elflein,Ruilong Li,Sérgio Agostinho,Zan Gojcic,Laura Leal-Taixé,Qunjie Zhou,Aljosa Osep
机构: NVIDIA(英伟达); Vector Institute(向量研究所); University of Toronto(多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project page: this https URL
Abstract:We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T ^3 (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a 1k image collection in just 54 seconds, achieving a 11.6\times speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.
[CV-2] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation CVPR2026
【速读】:该论文旨在解决3D布局条件生成中被忽视的遮挡推理(occlusion reasoning)问题,即如何在生成部分遮挡物体时保持几何与尺度的一致性。现有方法虽能生成符合输入布局的逼真场景,但难以精确建模物体间的遮挡关系。其解决方案的关键在于提出一种显式建模遮挡的3D场景表示方法——遮挡感知3D场景表示(Occlusion-aware 3D Scene Representation, OSCR),该表示将物体建模为透明的3D边界框,并通过渲染视角实现相机控制;同时引入基于渲染结果的视觉token来条件化预训练的基于流的文本到图像生成模型,并利用掩码自注意力机制精准绑定每个物体边界框与其对应的文本描述,从而避免属性混淆,实现具有真实遮挡效果和一致相机视角的多物体生成。
链接: https://arxiv.org/abs/2602.23359
作者: Vaibhav Agrawal,Rishubh Parihar,Pradhaan Bhat,Ravi Kiran Sarvadevabhatla,R. Venkatesh Babu
机构: IIIT Hyderabad (印度信息技术研究所海得拉巴分校); IISc Bengaluru (印度科学研究所班加罗尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL . Accepted at CVPR 2026
Abstract:We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
[CV-3] A Dataset is Worth 1 MB
【速读】:该论文旨在解决大规模数据集在分布式环境中向多个客户端传输时产生的高昂通信成本问题,尤其是在客户端硬件和软件框架异构、无法直接传输预训练模型的情况下,如何高效地实现任务知识迁移。其解决方案的关键在于提出一种名为Pseudo-Labels as Data (PLADA) 的方法,该方法完全消除像素级数据的传输:假设客户端已预加载一个通用的大规模无标签参考数据集(如ImageNet-1K),仅通过传输目标任务中特定图像的类别标签即可完成知识传递;同时引入一种剪枝机制,从参考数据集中筛选出与目标任务语义最相关的样本,从而在最小化传输负载的同时最大化本地训练效率。实验表明,该方法可在小于1 MB的传输 payload 下保持高分类准确率,为高效数据集服务提供了可行路径。
链接: https://arxiv.org/abs/2602.23358
作者: Elad Kimchi Shoshani,Leeyam Gabay,Yedid Hoshen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 9 figures
Abstract:A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on diverse hardware and software frameworks, transmitting a pre-trained model is often infeasible; instead, agents require raw data to train their own task-specific models locally. While dataset distillation attempts to compress training signals, current methods struggle to scale to high-resolution data and rarely achieve sufficiently small files. In this paper, we propose Pseudo-Labels as Data (PLADA), a method that completely eliminates pixel transmission. We assume agents are preloaded with a large, generic, unlabeled reference dataset (e.g., ImageNet-1K, ImageNet-21K) and communicate a new task by transmitting only the class labels for specific images. To address the distribution mismatch between the reference and target datasets, we introduce a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task. This selection process simultaneously maximizes training efficiency and minimizes transmission payload. Experiments on 10 diverse datasets demonstrate that our approach can transfer task knowledge with a payload of less than 1 MB while retaining high classification accuracy, offering a promising solution for efficient dataset serving.
[CV-4] Sensor Generalization for Adaptive Sensing in Event-based Object Detection via Joint Distribution Training
【速读】:该论文旨在解决事件相机(event camera)在对象检测任务中因输出信号特性新颖而导致的数据多样性不足及参数表征分析匮乏的问题。其解决方案的关键在于深入分析内在参数对基于事件数据训练模型性能的影响,并利用这些发现提升下游模型的传感器无关鲁棒性(sensor-agnostic robustness),从而增强模型在不同事件相机配置下的泛化能力。
链接: https://arxiv.org/abs/2602.23357
作者: Aheli Saha,René Schuster,Didier Stricker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, International Conference on Pattern Recognition Applications and Methods
Abstract:Bio-inspired event cameras have recently attracted significant research due to their asynchronous and low-latency capabilities. These features provide a high dynamic range and significantly reduce motion blur. However, because of the novelty in the nature of their output signals, there is a gap in the variability of available data and a lack of extensive analysis of the parameters characterizing their signals. This paper addresses these issues by providing readers with an in-depth understanding of how intrinsic parameters affect the performance of a model trained on event data, specifically for object detection. We also use our findings to expand the capabilities of the downstream model towards sensor-agnostic robustness.
[CV-5] Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
【速读】:该论文旨在解决开放词汇分割(Open-vocabulary segmentation, OVS)在零样本场景下性能落后于全监督方法的问题,其核心挑战在于视觉语言模型(Vision-language models, VLMs)训练时采用粗粒度图像级监督以及自然语言描述存在的语义模糊性。解决方案的关键在于引入少量样本设置(few-shot setting),通过提供像素级标注的支持集(support set)来增强文本提示,并提出一种检索增强的测试时适配器(retrieval-augmented test-time adapter),该适配器通过学习每个查询的融合机制,将文本与视觉支持特征进行动态、自适应融合,从而实现跨模态更强的协同效应。此方法不仅支持持续扩展支持集,还适用于细粒度任务如个性化分割,在保持开放词汇能力的同时显著缩小了零样本与监督分割之间的性能差距。
链接: https://arxiv.org/abs/2602.23339
作者: Tilemachos Aravanis,Vladan Stojnić,Bill Psomas,Nikos Komodakis,Giorgos Tolias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
[CV-6] hinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding ICLR2026
【速读】:该论文旨在解决当前多模态大语言模型(Omni-modal Large Language Models, OLLM)在复杂推理能力上的不足问题,即尽管其具备对多种模态数据的感知能力,但在跨模态推理任务中表现受限。为实现无需额外训练和数据即可提升推理性能的目标,论文提出了一种名为ThinkOmni的训练-free、数据-free框架,其核心创新在于两个关键组件:一是利用现成的大规模推理模型(Large Reasoning Models, LRM)作为引导(LRM-as-a-Guide),在解码过程中注入结构化推理逻辑;二是引入步进式对比缩放机制(Stepwise Contrastive Scaling),自动平衡感知信号与推理信号之间的权重,避免人工调参。该方案在六个多模态推理基准测试中显著提升了性能,验证了其通用性与有效性。
链接: https://arxiv.org/abs/2602.23306
作者: Yiran Guan,Sifan Tu,Dingkang Liang,Linghao Zhu,Jianzhong Ju,Zhenbo Luo,Jian Luan,Yuliang Liu,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); MiLM Plus, Xiaomi Inc. (小米公司米隆实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by ICLR 2026
Abstract:Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.
[CV-7] PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM
【速读】:该论文旨在解决医学诊断中多模态信息融合不足的问题,即现有方法通常将临床元数据(clinical metadata)视为孤立标签,未能充分利用其蕴含的丰富语义知识。解决方案的关键在于提出PRIMA框架,通过三个核心机制实现:首先,利用检索增强生成(Retrieval-Augmented Generation, RAG)构建专家级风险-疾病关联语料库,对Clinical ModernBERT进行精炼以嵌入诊断先验知识;其次,设计基于DINOv3与优化BERT的双编码器预训练策略,结合四种互补损失函数,实现跨模态的多层次语义对齐并处理临床关联的模糊性;最后,借助Qwen-3模型融合对齐后的特征,完成精准疾病分类。该方案在不依赖大规模数据或高算力的前提下显著提升了模型鲁棒性与性能。
链接: https://arxiv.org/abs/2602.23297
作者: Yiqing Wang,Chunming He,Ming-Chen Lu,Mercy Pawar,Leslie Niziol,Maria Woodward,Sina Farsiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as isolated tags, failing to exploit the rich semantic knowledge embedded in clinical descriptions. We propose PRIMA (Pre-training with Risk-integrated Image-Metadata Alignment), a framework that integrates domain-specific knowledge into multi-modal representation learning. We first curate an expert corpus of risk-disease correlations via Retrieval-Augmented Generation (RAG) to refine Clinical ModernBERT, embedding diagnostic priors into the text encoder. To bridge the modality gap, we introduce a dual-encoder pre-training strategy utilizing DINOv3 and our refined BERT, optimized by a suite of four complementary loss functions. These losses are designed to capture multi-granular semantic alignment and handle the ambiguity of clinical correlations through soft labels. Finally, we leverage Qwen-3 to fuse these aligned features for precise disease classification. Extensive experiments demonstrate that PRIMA effectively harmonizes pixel-level features with abstract clinical expertise, significantly outperforming other state-of-the-art methods. Notably, our framework achieves superior robustness without the need for massive data collection or exhaustive computational resources. Our code will be made public upon acceptance.
[CV-8] ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
【速读】:该论文旨在解决大规模数据集在模型训练中效率低下且包含冗余信息的问题,提出一种无需重新训练的高效数据蒸馏方法。其核心挑战在于如何从海量数据中提取具有代表性的紧凑数据集,同时保持语义一致性与多样性。解决方案的关键在于提出流形引导蒸馏(Manifold-Guided Distillation, ManifoldGD),该方法利用变分自编码器(VAE)潜空间中的层次分割聚类生成多尺度实例原型中心(IPC),构建每个去噪步骤的局部潜流形,并在每一步将模式对齐向量投影到估计流形的切空间中,从而约束生成轨迹始终沿流形分布,实现几何感知的语义保真生成。此策略显著提升了合成数据的代表性、多样性和图像保真度,优于现有训练-free 和 training-based 方法。
链接: https://arxiv.org/abs/2602.23295
作者: Ayush Roy,Wei-Yang Alex Lee,Rudrasis Chakraborty,Vishnu Suresh Lokhande
机构: University at Buffalo, SUNY (纽约州立大学布法罗分校); Lawrence Livermore National Lab (劳伦斯利弗莫尔国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPE 2026
Abstract:In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.
[CV-9] owards Long-Form Spatio-Temporal Video Grounding
【速读】:该论文旨在解决长视频时序目标定位(Long-Form Spatio-Temporal Video Grounding, LF-STVG)问题,即在长达数分钟甚至数小时的长视频中精确定位文本查询所指的目标。现有方法主要针对短视频(通常小于一分钟)设计,难以处理长视频中更长的时间跨度和大量无关信息。解决方案的关键在于提出一种自回归Transformer架构ART-STVG,其核心创新包括:1)将视频作为流式输入逐帧处理,避免一次性处理全部帧带来的计算瓶颈;2)设计空间与时间记忆库(memory banks),结合有效的记忆选择策略以提供当前帧更相关的上下文信息;3)采用级联式时空结构,先进行空间定位再基于精细的空间线索辅助复杂的时间定位,从而显著提升长视频中的定位性能。
链接: https://arxiv.org/abs/2602.23294
作者: Xin Gu,Bing Fan,Jiali Yao,Zhipeng Zhang,Yan Huang,Cheng Han,Heng Fan,Libo Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.
[CV-10] PGVMS: A Prompt-Guided Unified Framework for Virtual Multiplex IHC Staining with Pathological Semantic Learning
【速读】:该论文旨在解决小样本活检中免疫组织化学(IHC)分析受限的问题,提出了一种基于单重训练数据的提示引导虚拟多色染色框架(PGVMS),以实现从常规苏木精-伊红(HE)图像中数字化生成多组分IHC表示。其解决方案的关键在于三个创新策略:一是引入自适应提示引导机制,利用病理视觉语言模型动态调整染色提示,克服语义指导不足的问题;二是设计蛋白感知学习策略(PALS),通过直接量化与约束蛋白分布来保持精确的蛋白表达模式;三是采用原型一致学习策略(PCLS),建立跨图像语义交互以校正不同染色模态间的空间错位问题。
链接: https://arxiv.org/abs/2602.23292
作者: Fuqiang Chen,Ranran Zhang,Wanming Hu,Deboch Eyob Abera,Yue Peng,Boyun Zheng,Yiwen Sun,Jing Cai,Wenjian Qin
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Sun Yat-sen University Cancer Center; Hong Kong Polytechnic University; The Chinese University of Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TMI
Abstract:Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in modern pathology. However, comprehensive IHC analysis is frequently limited by insufficient tissue quantities in small biopsies. Therefore, virtual multiplex staining emerges as an innovative solution to digitally transform HE images into multiple IHC representations, yet current methods still face three critical challenges: (1) inadequate semantic guidance for multi-staining, (2) inconsistent distribution of immunochemistry staining, and (3) spatial misalignment across different stain modalities. To overcome these limitations, we present a prompt-guided framework for virtual multiplex IHC staining using only uniplex training data (PGVMS). Our framework introduces three key innovations corresponding to each challenge: First, an adaptive prompt guidance mechanism employing a pathological visual language model dynamically adjusts staining prompts to resolve semantic guidance limitations (Challenge 1). Second, our protein-aware learning strategy (PALS) maintains precise protein expression patterns by direct quantification and constraint of protein distributions (Challenge 2). Third, the prototype-consistent learning strategy (PCLS) establishes cross-image semantic interaction to correct spatial misalignments (Challenge 3).
[CV-11] LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction
【速读】:该论文旨在解决从卫星影像中自动、准确提取道路的问题,尤其针对现有方法在捕捉长距离依赖关系和复杂拓扑结构时表现不足的局限性。其核心解决方案是提出LineGraph2Road框架,关键创新在于将连接性预测建模为在全局但稀疏的欧几里得图上的二分类任务:图节点由分割掩码提取的关键点构成,边表示预定义距离阈值内的点对,代表潜在的道路段;进一步通过构建线图(line graph)并应用图Transformer(Graph Transformer)进行连接性预测,从而克服传统端点嵌入融合在集合同构边上的表达瓶颈,实现更丰富的连接关系表征与全局结构的有效推理。此外,引入过街天桥/地下通道判别头和耦合非极大值抑制策略,有效处理多层交叉结构并保留关键连接,显著提升实际部署所需的细节精度。
链接: https://arxiv.org/abs/2602.23290
作者: Zhengyang Wei,Renzhi Jing,Yiyi He,Jenny Suckale
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation. Many existing methods decompose this task into keypoint extraction and connectedness prediction, but often struggle to capture long-range dependencies and complex topologies. Here, we propose LineGraph2Road, a framework that improves connectedness prediction by formulating it as binary classification over edges in a constructed global but sparse Euclidean graph, where nodes are keypoints extracted from segmentation masks and edges connect node pairs within a predefined distance threshold, representing potential road segments. To better learn structural link representation, we transform the original graph into its corresponding line graph and apply a Graph Transformer on it for connectedness prediction. This formulation overcomes the limitations of endpoint-embedding fusion on set-isomorphic links, enabling rich link representations and effective relational reasoning over the global structure. Additionally, we introduce an overpass/underpass head to resolve multi-level crossings and a coupled NMS strategy to preserve critical connections. We evaluate LineGraph2Road on three benchmarks: City-scale, SpaceNet, and Global-scale, and show that it achieves state-of-the-art results on two key metrics, TOPO-F1 and APLS. It also captures fine visual details critical for real-world deployment. We will make our code publicly available.
[CV-12] Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling
【速读】:该论文旨在解决生成式 AI (Generative AI) 在训练过程中因记忆敏感图像数据而导致隐私泄露的问题,尤其关注如何在保障差分隐私(Differential Privacy, DP)的前提下提升图像生成质量。其解决方案的关键在于提出了一种基于频谱的差分隐私框架(Spectral DP Framework),核心思想是:敏感信息主要集中在图像的低频波段(如人脸特征和物体轮廓),而高频成分多为通用纹理,可视为公共信息。因此,该方法采用两阶段策略:首先在低分辨率下对图像的小波系数进行差分隐私微调,仅保护全局结构;随后利用预训练的超分辨率模型进行上采样以恢复细节,借助差分隐私的后处理性质实现高质量图像生成,从而在隐私与实用性之间取得更优平衡。
链接: https://arxiv.org/abs/2602.23262
作者: Jasmine Bayrooti,Weiwei Kong,Natalia Ponomareva,Carlos Esteves,Ameesh Makadia,Amanda Prorok
机构: University of Cambridge (剑桥大学); Google Research (谷歌研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Generative models trained on sensitive image datasets risk memorizing and reproducing individual training examples, making strong privacy guarantees essential. While differential privacy (DP) provides a principled framework for such guarantees, standard DP finetuning (e.g., with DP-SGD) often results in severe degradation of image quality, particularly in high-frequency textures, due to the indiscriminate addition of noise across all model parameters. In this work, we propose a spectral DP framework based on the hypothesis that the most privacy-sensitive portions of an image are often low-frequency components in the wavelet space (e.g., facial features and object shapes) while high-frequency components are largely generic and public. Based on this hypothesis, we propose the following two-stage framework for DP image generation with coarse image intermediaries: (1) DP finetune an autoregressive spectral image tokenizer model on the low-resolution wavelet coefficients of the sensitive images, and (2) perform high-resolution upsampling using a publicly pretrained super-resolution model. By restricting the privacy budget to the global structures of the image in the first stage, and leveraging the post-processing property of DP for detail refinement, we achieve promising trade-offs between privacy and utility. Experiments on the MS-COCO and MM-CelebA-HQ datasets show that our method generates images with improved quality and style capture relative to other leading DP image frameworks.
[CV-13] Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
【速读】:该论文旨在解决端到端自动驾驶(End-to-End Autonomous Driving, E2E-AD)系统在面对长尾罕见场景时泛化能力不足的问题,即当前基于模仿学习(Imitation Learning, IL)的方法依赖专家行为监督,在未见场景中易产生不安全决策。其解决方案的关键在于提出一种无需专家动作监督的统一框架——风险感知世界模型预测控制(Risk-aware World Model Predictive Control, RaWMPC),通过构建具备风险预测能力的世界模型(World Model)来评估候选动作的风险,并引入风险感知交互策略增强模型对危险行为后果的预测能力;同时设计自评估蒸馏方法,将世界模型中的避险能力迁移至生成式动作提案网络(Generative Action Proposal Network),从而在测试阶段生成低风险动作,实现鲁棒且可解释的决策。
链接: https://arxiv.org/abs/2602.23259
作者: Jiangxin Sun,Feng Xue,Teng Long,Chang Liu,Jian-Fang Hu,Wei-Shi Zheng,Nicu Sebe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of “only driving like the expert” suffers from limited generalization: when encountering rare or unseen long-tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E-AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk-aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk-aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability.
[CV-14] Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
【速读】:该论文旨在解决纯视觉图形用户界面(GUI)智能体在高分辨率屏幕截图和历史轨迹中存在严重时空冗余所导致的效率瓶颈问题。现有压缩范式存在两个关键错位:一是时间上的不匹配,即统一的历史编码与智能体“遗忘记忆”的注意力模式不符;二是空间拓扑冲突,即无结构剪枝破坏了网格完整性,引发空间幻觉。解决方案的核心是提出一种无需训练的框架GUIPruner,其关键创新在于融合两种机制:Temporal-Adaptive Resolution(TAR),通过基于衰减的自适应分辨率调整消除历史冗余;以及Stratified Structure-aware Pruning(SSP),优先保留交互前景与语义锚点的同时保护全局布局结构,从而实现高效且精准的GUI导航。
链接: https://arxiv.org/abs/2602.23235
作者: Zhou Xu,Bowen Zhou,Qi Wang,Shuwen Feng,Jingyu Xiao
机构: Tsinghua UniversityShenzhenChina; Xidian UniversityXi’anChina; The Chinese University of Hong KongHong KongChina
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent’s “fading memory” attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.
[CV-15] Skarimva: Skeleton-based Action Recognition is a Multi-view Application
【速读】:该论文试图解决骨架数据质量对基于骨架的动作识别模型性能的限制问题。解决方案的关键在于利用多视角摄像机进行三角测量,以生成更精确的3D骨架数据,从而显著提升现有先进动作识别模型的性能。研究结果表明,在大多数实际应用场景中,采用多视角方案的成本效益比非常有利,因此建议未来基于骨架的动作识别研究将多视角应用作为标准设置。
链接: https://arxiv.org/abs/2602.23231
作者: Daniel Bermuth,Alexander Poeppel,Wolfgang Reif
机构: University of Augsburg (奥格斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.
[CV-16] Large Multimodal Models as General In-Context Classifiers CVPR
【速读】:该论文旨在解决多模态模型在闭世界和开世界分类任务中性能差异的问题,尤其是传统对比视觉语言模型(CLIP-like contrastive Vision-Language Models, VLMs)在零样本分类上表现优异,而大型多模态模型(Large Multimodal Models, LMMs)虽在复杂任务中更具潜力,却因缺乏有效利用上下文信息的能力而被低估。其解决方案的关键在于揭示LMM的“上下文学习”(in-context learning)能力,并提出一种无需训练的迭代伪标签优化方法——CIRCLE,通过利用上下文自身信息逐步修正示例标签,从而显著提升LMM在开世界场景下的鲁棒性与分类性能,验证了LMM作为统一且灵活分类器的潜力。
链接: https://arxiv.org/abs/2602.23229
作者: Marco Garosi,Matteo Farina,Alessandro Conti,Massimiliano Mancini,Elisa Ricci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Findings 2026. Project website at this https URL
Abstract:Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP’s, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their “in-context” equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.
[CV-17] MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
【速读】:该论文旨在解决长时视频(如电影和电视剧)自动摘要生成中现有视觉语言模型(Vision-Language Models, VLMs)面临的两大核心问题:一是角色身份不一致(ID-consistent character identification)缺失,二是叙事连贯性断裂。解决方案的关键在于提出了一种无需训练的工具增强型渐进抽象框架——MovieTeller,其核心创新是通过调用现成的人脸识别模型作为外部“工具”建立事实锚定(Factual Groundings),即精确的角色身份及其边界框信息,并将这些可验证的事实注入提示(prompt)以引导VLM推理;同时,采用多阶段渐进抽象流程分解整部电影的摘要任务,从而缓解当前VLM上下文长度限制,显著提升摘要的事实准确性、角色一致性和整体叙事连贯性。
链接: https://arxiv.org/abs/2602.23228
作者: Yizhi Li,Xiaohan Chen,Miao Jiang,Wentao Tang,Gaoang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, CSCWD 2026
Abstract:With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external “tool” to establish Factual Groundings–precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM’s reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
[CV-18] UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception
【速读】:该论文旨在解决机器人应用中基于视觉的环境结构准确重建问题,即如何从多视角图像序列中联合估计相机内参与外参、尺度不变深度图和点云,并恢复场景的度量尺度(metric scale)。其解决方案的关键在于提出了一种统一且具备尺度感知能力的多视图3D重建框架UniScale,该框架通过模块化、语义驱动的设计灵活整合几何先验信息,结合全局上下文推理与相机感知特征表示,在单个前向网络中实现多任务协同优化,从而在无需从头训练的情况下利用预训练模型中的世界先验,显著提升资源受限机器人团队中的鲁棒性与精度。
链接: https://arxiv.org/abs/2602.23224
作者: Mohammad Mahdavian,Gordon Tan,Binbin Xu,Yuan Ren,Dongfeng Bai,Bingbing Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.
[CV-19] Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks
【速读】:该论文旨在解决当前计算机视觉任务在数学建模上受限于矩阵思维的问题,即现有架构依赖于矩阵权重和向量偏置,需通过结构扁平化(flattening)处理输入数据,从而限制了可表达任务的空间。其解决方案的关键在于提出多维任务学习(Multidimensional Task Learning, MTL),基于广义爱因斯坦多层感知机(Generalized Einstein MLPs, GE-MLPs),该模型直接在张量(tensor)上通过爱因斯坦积(Einstein product)进行运算,允许显式控制哪些维度被保留或压缩,且不造成信息损失。这一框架揭示了分类、分割和检测等任务本质上是MTL在不同维度配置下的特例,并证明该任务空间严格大于传统矩阵方法所能表达的范围,从而为设计如时空联合预测或跨模态预测等复杂任务提供了数学基础。
链接: https://arxiv.org/abs/2602.23217
作者: Alaa El Ichi,Khalide Jbilou
机构: Université du Littoral Cote d’Opale (滨海大学); LMPA (实验室名称,无需翻译)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
Abstract:This paper introduces Multidimensional Task Learning (MTL), a unified mathematical framework based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product. We argue that current computer vision task formulations are inherently constrained by matrix-based thinking: standard architectures rely on matrix-valued weights and vectorvalued biases, requiring structural flattening that restricts the space of naturally expressible tasks. GE-MLPs lift this constraint by operating with tensor-valued parameters, enabling explicit control over which dimensions are preserved or contracted without information loss. Through rigorous mathematical derivations, we demonstrate that classification, segmentation, and detection are special cases of MTL, differing only in their dimensional configuration within a formally defined task space. We further prove that this task space is strictly larger than what matrix-based formulations can natively express, enabling principled task configurations such as spatiotemporal or cross modal predictions that require destructive flattening under conventional approaches. This work provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through the lens of tensor algebra.
[CV-20] Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction
【速读】:该论文旨在解决基于插件式扩散先验(Plug-and-Play Diffusion Prior, PnPDP)的成像逆问题求解中存在的重要缺陷:现有PnP求解器(如基于HQS或近似梯度的方法)作为无记忆算子,仅依赖瞬时梯度更新估计值,导致在严重噪声污染下无法消除稳态偏差(non-vanishing steady-state bias),即重建结果不能严格满足物理测量约束。解决方案的关键在于提出双耦合PnP扩散方法(Dual-Coupled PnP Diffusion),通过恢复经典对偶变量以提供积分反馈机制,理论上保证收敛至精确的数据流形(data manifold)。然而,这种严格的几何耦合引入了次级挑战——累积对偶残差呈现频谱相关结构伪影,违反扩散先验所假设的加性白高斯噪声(Additive White Gaussian Noise, AWGN)特性,引发严重幻觉现象。为此,作者进一步设计频谱同质化(Spectral Homogenization, SH)机制,在频域内将结构化残差调制为统计上符合AWGN特性的伪噪声输入,从而实现优化轨迹与去噪器有效统计流形之间的精准对齐,最终突破偏置-幻觉权衡瓶颈,显著提升重建保真度并加速收敛。
链接: https://arxiv.org/abs/2602.23214
作者: Chenhe Du,Xuanyu Tian,Qing Wu,Muyu Liu,Jingyi Yu,Hongjiang Wei,Yuyao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion, which restores the classical dual variable to provide integral feedback, theoretically guaranteeing asymptotic convergence to the exact data manifold. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver’s rigorous optimization trajectory with the denoiser’s valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence.
[CV-21] hrough BrokenEyes: How Eye Disorders Impact Face Detection?
【速读】:该论文旨在解决视觉障碍对深度学习模型中神经样特征表示影响的机制问题,特别是量化不同眼疾如何改变模型从图像中提取的特征映射。其解决方案的关键在于构建了一个基于BrokenEyes系统的计算框架,通过模拟五种常见眼疾(包括年龄相关性黄斑变性、白内障、青光眼、屈光不正和糖尿病视网膜病变)来生成受干扰的视觉输入,并结合人类与非人类数据集训练模型,在正常和特定疾病条件下对比特征表示的变化;利用激活能量(activation energy)和余弦相似度(cosine similarity)等指标定量分析了白内障和青光眼引起的特征畸变最显著,从而揭示了退化视觉输入与模型学习表征之间的内在关联。
链接: https://arxiv.org/abs/2602.23212
作者: Prottay Kumar Adhikary
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision disorders significantly impact millions of lives, altering how visual information is processed and perceived. In this work, a computational framework was developed using the BrokenEyes system to simulate five common eye disorders: Age-related macular degeneration, cataract, glaucoma, refractive errors, and diabetic retinopathy and analyze their effects on neural-like feature representations in deep learning models. Leveraging a combination of human and non-human datasets, models trained under normal and disorder-specific conditions revealed critical disruptions in feature maps, particularly for cataract and glaucoma, which align with known neural processing challenges in these conditions. Evaluation metrics such as activation energy and cosine similarity quantified the severity of these distortions, providing insights into the interplay between degraded visual inputs and learned representations.
[CV-22] EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
【速读】:该论文旨在解决现有捕捉系统在自然环境中大规模采集场景条件下的高质量人体运动数据受限的问题,尤其是传统方法依赖昂贵的演播室设备或可穿戴装置,难以实现无标记、低成本且具度量尺度一致性的数据采集。其解决方案的关键在于提出了一种名为EmbodMocap的便携式数据采集流程,利用两个移动的iPhone协同校准双视角RGB-D序列,在统一的度量世界坐标系中重建人类与场景几何信息,从而无需静态相机或标记即可实现日常环境中的高精度人体-场景联合捕捉。该方法通过双视角设置有效缓解了单目深度歧义问题,显著提升了重建与对齐性能,为具身人工智能(Embodied AI)任务提供了可靠的数据基础。
链接: https://arxiv.org/abs/2602.23205
作者: Wenjia Wang,Liang Pan,Huaijin Pi,Yuke Lou,Xuqian Ren,Yifan Wu,Zhouyingcheng Liao,Lei Yang,Rishabh Dabral,Christian Theobalt,Taku Komura
机构: The University of Hong Kong (香港大学); Tampere University (坦佩雷大学); The Chinese University of Hong Kong (香港中文大学); Max-Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.
[CV-23] Motion-aware Event Suppression for Event Cameras
【速读】:该论文旨在解决事件相机(Event Camera)在动态场景中因自身运动(ego-motion)和移动物体(IMOs, Moving Objects)引发的冗余事件干扰问题,从而提升下游任务如视觉Transformer推理效率与事件视觉里程计精度。解决方案的关键在于提出首个面向运动感知的事件抑制框架(Motion-aware Event Suppression),通过轻量级模型实时联合分割当前事件流中的IMO并预测其未来运动轨迹,实现对动态事件的前瞻性抑制;该方法在消费级GPU上达到173 Hz推理速度且内存占用低于1 GB,在EVIMO基准测试中 segmentation accuracy 提升67%,同时通过事件令牌剪枝使Vision Transformer推理加速83%,并将绝对轨迹误差(ATE)降低13%。
链接: https://arxiv.org/abs/2602.23204
作者: Roberto Pellerito,Nico Messikommer,Giovanni Cioffi,Marco Cannici,Davide Scaramuzza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67% in segmentation accuracy while operating at a 53% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13%.
[CV-24] ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation
【速读】:该论文旨在解决结肠镜视频生成中因数据稀缺导致的临床分析困难问题,尤其在保持时间一致性与精准控制临床属性(如病变特征、成像模态等)方面面临挑战。其关键解决方案是提出了一种基于扩散模型的框架 ColoDiff,其中 TimeStream 模块通过跨帧标记化机制解耦时间依赖性,实现对不规则肠道结构下的复杂动态建模;Content-Aware 模块引入噪声注入嵌入与可学习原型,实现对临床属性的精细调控,突破传统扩散模型粗粒度引导的局限;此外,采用非马尔可夫采样策略将生成步数减少超 90%,支持实时生成。
链接: https://arxiv.org/abs/2602.23203
作者: Junhu Fu,Shuyu Liang,Wutong Li,Chen Ma,Peng Huang,Kehao Wang,Ke Chen,Shengli Lin,Pinghong Zhou,Zeju Li,Yuanyuan Wang,Yi Guo
机构: Fudan University (复旦大学); Fudan University Shanghai Cancer Center (复旦大学附属肿瘤医院); Zhongshan Hospital, Fudan University (复旦大学附属中山医院); Shanghai Collaborative Innovation Center of Endoscopy (上海内镜协同创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Colonoscopy video generation delivers dynamic, information-rich data critical for diagnosing intestinal diseases, particularly in data-scarce scenarios. High-quality video generation demands temporal consistency and precise control over clinical attributes, but faces challenges from irregular intestinal structures, diverse disease representations, and various imaging modalities. To this end, we propose ColoDiff, a diffusion-based framework that generates dynamic-consistent and content-aware colonoscopy videos, aiming to alleviate data shortage and assist clinical analysis. At the inter-frame level, our TimeStream module decouples temporal dependency from video sequences through a cross-frame tokenization mechanism, enabling intricate dynamic modeling despite irregular intestinal structures. At the intra-frame level, our Content-Aware module incorporates noise-injected embeddings and learnable prototypes to realize precise control over clinical attributes, breaking through the coarse guidance of diffusion models. Additionally, ColoDiff employs a non-Markovian sampling strategy that cuts steps by over 90% for real-time generation. ColoDiff is evaluated across three public datasets and one hospital database, based on both generation metrics and downstream tasks including disease diagnosis, modality discrimination, bowel preparation scoring, and lesion segmentation. Extensive experiments show ColoDiff generates videos with smooth transitions and rich dynamics. ColoDiff presents an effort in controllable colonoscopy video generation, revealing the potential of synthetic videos in complementing authentic representation and mitigating data scarcity in clinical settings.
[CV-25] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification
【速读】:该论文旨在解决神经网络量化压缩过程中对算法公平性(algorithmic fairness)关注不足的问题,尤其是在医疗图像分类任务中,如何在有限的比特预算下实现性能与公平性的协同优化。其解决方案的关键在于提出FairQuant框架,该框架融合了群体感知的重要性分析、受预算约束的混合精度分配策略,以及一种可学习的位感知量化(Bit-Aware Quantization, BAQ)机制,通过联合优化权重和每个单元的比特分配,在保持模型整体精度的同时显著提升最差群体的表现,从而在相同比特预算下实现优于均匀量化基线的公平性指标。
链接: https://arxiv.org/abs/2602.23192
作者: Thomas Woergaard,Raghavendra Selvan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Source code available at this https URL
Abstract:Compressing neural networks by quantizing model parameters offers useful trade-off between performance and efficiency. Methods like quantization-aware training and post-training quantization strive to maintain the downstream performance of compressed models compared to the full precision models. However, these techniques do not explicitly consider the impact on algorithmic fairness. In this work, we study fairness-aware mixed-precision quantization schemes for medical image classification under explicit bit budgets. We introduce FairQuant, a framework that combines group-aware importance analysis, budgeted mixed-precision allocation, and a learnable Bit-Aware Quantization (BAQ) mode that jointly optimizes weights and per-unit bit allocations under bitrate and fairness regularization. We evaluate the method on Fitzpatrick17k and ISIC2019 across ResNet18/50, DeiT-Tiny, and TinyViT. Results show that FairQuant configurations with average precision near 4-6 bits recover much of the Uniform 8-bit accuracy while improving worst-group performance relative to Uniform 4- and 8-bit baselines, with comparable fairness metrics under shared budgets.
[CV-26] Uni-Animator: Towards Unified Visual Colorization CVPR2026
【速读】:该论文旨在解决现有图像与视频草图着色方法在统一建模时面临的三大问题:颜色传递不准确(尤其在单/多参考场景下)、高频物理细节保留不足,以及大运动场景中时间一致性差、出现运动伪影。其核心解决方案包括三个关键模块:一是通过实例块嵌入(instance patch embedding)增强视觉参考,实现参考颜色信息的精准对齐与融合;二是设计基于物理特征的细节强化机制,有效捕捉并保留高频纹理;三是提出基于草图的动态旋转位置编码(dynamic RoPE encoding),自适应建模运动感知的空间-时间依赖关系,从而提升时间一致性。整体框架名为Uni-Animator,基于扩散Transformer(DiT)架构,在保持任务专用方法性能的同时,实现了跨域统一着色能力,并具备高保真细节和强鲁棒性的时间一致性。
链接: https://arxiv.org/abs/2602.23191
作者: Xinyuan Chen,Yao Xu,Shaowen Wang,Pengjie Song,Bowen Deng
机构: Mississippi State University (密西西比州立大学); UIUC (伊利诺伊大学厄巴纳-香槟分校); Hunan University (湖南大学); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures. Submitted to CVPR 2026
Abstract:We propose Uni-Animator, a novel Diffusion Transformer (DiT)-based framework for unified image and video sketch colorization. Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with single or multiple references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts in large-motion scenes. To tackle imprecise color transfer, we introduce visual reference enhancement via instance patch embedding, enabling precise alignment and fusion of reference color information. To resolve insufficient physical detail preservation, we design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures. To mitigate motion-induced temporal inconsistency, we propose sketch-based dynamic RoPE encoding that adaptively models motion-aware spatial-temporal dependencies. Extensive experimental results demonstrate that Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.
[CV-27] Phys-3D: Physics-Constrained Real-Time Crowd Tracking and Counting on Railway Platforms
【速读】:该论文旨在解决铁路站台场景下基于单目相机的实时人群计数问题,尤其针对列车行驶过程中因密集遮挡、相机运动及透视畸变导致的传统跟踪-检测方法失效的问题。其关键解决方案在于提出一种物理约束的跟踪框架(Phys-3D),通过将YOLOv11m目标检测器与EfficientNet-B0外观编码结合到DeepSORT中,并引入基于针孔相机模型的物理约束卡尔曼滤波器(Phys-3D),强制实现符合真实三维运动规律的轨迹预测,同时设计虚拟计数带(virtual counting band)以增强遮挡下的计数鲁棒性。该方法在MOT-RailwayPlatformCrowdHead Dataset(MOT-RPCH)上将计数误差降至2.97%,验证了先验几何与运动知识对复杂动态环境下可靠人群计数的有效性。
链接: https://arxiv.org/abs/2602.23177
作者: Bin Zeng,Johannes Künzel,Anna Hilsmann,Peter Eisert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: published at VISAPP 2026
Abstract:Accurate, real-time crowd counting on railway platforms is essential for safety and capacity management. We propose to use a single camera mounted in a train, scanning the platform while arriving. While hardware constraints are simple, counting remains challenging due to dense occlusions, camera motion, and perspective distortions during train arrivals. Most existing tracking-by-detection approaches assume static cameras or ignore physical consistency in motion modeling, leading to unreliable counting under dynamic conditions. We propose a physics-constrained tracking framework that unifies detection, appearance, and 3D motion reasoning in a real-time pipeline. Our approach integrates a transfer-learned YOLOv11m detector with EfficientNet-B0 appearance encoding within DeepSORT, while introducing a physics-constrained Kalman model (Phys-3D) that enforces physically plausible 3D motion dynamics through pinhole geometry. To address counting brittleness under occlusions, we implement a virtual counting band with persistence. On our platform benchmark, MOT-RailwayPlatformCrowdHead Dataset(MOT-RPCH), our method reduces counting error to 2.97%, demonstrating robust performance despite motion and occlusions. Our results show that incorporating first-principles geometry and motion priors enables reliable crowd counting in safety-critical transportation scenarios, facilitating effective train scheduling and platform safety management.
[CV-28] Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking
【速读】:该论文旨在解决动态环境中机器人安全可靠运行所需的4D(三维空间+时间)场景理解问题,现有方法通常仅关注几何跟踪(如边界框)或静态3D结构(如体素占用),缺乏显式的时序关联。其解决方案的关键在于提出Latent Gaussian Splatting (LaGS),通过将多视角观测融合为稀疏的点中心潜在表示——3D高斯分布,并利用新颖的潜在高斯投射(latent Gaussian splatting)机制高效地将特征聚合到3D体素网格中,从而实现端到端的基于掩码的多视角全景占用预测与时序一致性建模,显著提升了4D全景占用跟踪的精度与完整性。
链接: https://arxiv.org/abs/2602.23172
作者: Maximilian Luz,Rohit Mohan,Thomas Nürnberg,Yakov Miron,Daniele Cattaneo,Abhinav Valada
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Capturing 4D spatiotemporal surroundings is crucial for the safe and reliable operation of robots in dynamic environments. However, most existing methods address only one side of the problem: they either provide coarse geometric tracking via bounding boxes, or detailed 3D structures like voxel-based occupancy that lack explicit temporal association. In this work, we present Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS) that advances spatiotemporal scene understanding in a holistic direction. Our approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach. Specifically, we first fuse observations into 3D Gaussians that serve as a sparse point-centric latent representation of the 3D scene, and then splat the aggregated features onto a 3D voxel grid that is decoded by a mask-based segmentation head. We evaluate LaGS on the Occ3D nuScenes and Waymo datasets, achieving state-of-the-art performance for 4D panoptic occupancy tracking. We make our code available at this https URL.
[CV-29] Learning Continuous Wasserstein Barycenter Space for Generalized All-in-One Image Restoration
【速读】:该论文旨在解决当前统一图像复原方法在面对分布外(out-of-distribution)退化类型时泛化能力不足的问题,从而限制了其在真实场景中的应用效果。解决方案的关键在于提出BaryIR框架,通过在Wasserstein barycenter(WB)空间中对多源退化特征进行对齐,建模一个退化无关的共享分布;同时引入残差子空间,与WB嵌入正交且相互对比,从而显式解耦出两个正交空间:一是编码跨退化不变内容的WB空间,二是自适应保留退化特异性知识的残差子空间。这种解耦机制有效缓解了对训练内退化的过拟合,并实现了基于退化无关共享不变性的自适应复原,显著提升了模型在未见退化类型和混合真实退化场景下的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2602.23169
作者: Xiaole Tang,Xiaoyi He,Jiayi Xu,Xiang Gu,Jian Sun
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite substantial advances in all-in-one image restoration for addressing diverse degradations within a unified model, existing methods remain vulnerable to out-of-distribution degradations, thereby limiting their generalization in real-world scenarios. To tackle the challenge, this work is motivated by the intuition that multisource degraded feature distributions are induced by different degradation-specific shifts from an underlying degradation-agnostic distribution, and recovering such a shared distribution is thus crucial for achieving generalization across degradations. With this insight, we propose BaryIR, a representation learning framework that aligns multisource degraded features in the Wasserstein barycenter (WB) space, which models a degradation-agnostic distribution by minimizing the average of Wasserstein distances to multisource degraded distributions. We further introduce residual subspaces, whose embeddings are mutually contrasted while remaining orthogonal to the WB embeddings. Consequently, BaryIR explicitly decouples two orthogonal spaces: a WB space that encodes the degradation-agnostic invariant contents shared across degradations, and residual subspaces that adaptively preserve the degradation-specific knowledge. This disentanglement mitigates overfitting to in-distribution degradations and enables adaptive restoration grounded on the degradation-agnostic shared invariance. Extensive experiments demonstrate that BaryIR performs competitively against state-of-the-art all-in-one methods. Notably, BaryIR generalizes well to unseen degradations (\textite.g., types and levels) and shows remarkable robustness in learning generalized features, even when trained on limited degradation types and evaluated on real-world data with mixed degradations.
[CV-30] Agent Vista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios
【速读】:该论文旨在解决当前多模态基准测试难以全面评估通用多模态代理在真实场景中执行长程、多步骤任务的能力问题,特别是现有评测体系缺乏对视觉细节敏感性、跨模态工具协同使用以及长时间跨度任务执行能力的充分刻画。解决方案的关键在于提出AgentVista——一个涵盖7大类25个子领域的多模态基准,其核心特征是将高保真视觉场景与自然混合工具链(包括网页搜索、图像搜索、页面导航及基于代码的操作)相结合,从而系统性地评估模型在复杂多步流程中的长期多模态工具调用能力。实验表明,即使是最先进的模型(如Gemini-3-Pro with tools)在该基准上仅达到27.3%的整体准确率,且部分难题需超过25次工具调用才能完成,凸显了当前技术在现实世界多模态智能代理构建上的显著差距。
链接: https://arxiv.org/abs/2602.23166
作者: Zhaochen Su,Jincheng Gao,Hangyu Guo,Zhenhua Liu,Lueyang Zhang,Xinyu Geng,Shijue Huang,Peng Xia,Guanyu Jiang,Cheng Wang,Yue Zhang,Yi R. Fung,Junxian He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project website is available at \url{ this https URL }, and the code is available at \url{ this https URL }
Abstract:Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.
[CV-31] DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation
【速读】:该论文旨在解决现有生成式AI方法在数字人类交互中缺乏社会情境建模与双人互动动态捕捉的问题,即当前方法通常仅将单一音频流映射为单个说话者的动作,忽略了对话双方之间的相互影响。其解决方案的关键在于提出DyaDiT——一种多模态扩散Transformer架构,能够从双人对话音频(dyadic audio)中生成符合社交语境的人体动作;该模型通过融合双人信息以捕捉交互动力学、引入运动词典编码运动先验,并可选择性利用对话伙伴的姿势信息来生成更具响应性的动作,从而实现更自然、社会感知的数字人类交互。
链接: https://arxiv.org/abs/2602.23165
作者: Yichen Peng,Jyun-Ting Song,Siyeol Jung,Ruofan Liu,Haiyang Liu,Xuangeng Chu,Ruicong Liu,Erwin Wu,Hideki Koike,Kris Kitani
机构: Institute of Science Tokyo (东京科学研究所); Carnegie Mellon Unversity (卡内基梅隆大学); UNIST (韩国科学技术院); Shanda AI Research Tokyo (Shanda AI 研究所东京); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures
Abstract:Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker’s motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner’s gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.
[CV-32] Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
【速读】:该论文旨在解决当前3D大模型(Large Multimodal Models, LMMs)在处理点云数据时依赖复杂预训练视觉编码器所带来的计算效率低和扩展性差的问题。传统方法通常采用重型编码器提取几何特征,但难以适应点云无序性和大规模特性,限制了模型的实用性。解决方案的关键在于提出Fase3D——首个无需编码器的基于傅里叶变换的3D场景LMM,其核心创新包括:1)通过结构化超点(superpoints)紧凑表示大规模场景;2)利用空间填充曲线对点云进行序列化并结合快速傅里叶变换(Fast Fourier Transform, FFT),实现高效全局上下文建模与图结构令牌合并;3)引入傅里叶增强型LoRA适配器,在极低计算成本下向语言模型注入全局频率感知交互机制。这一设计使Fase3D在性能上媲美编码器驱动的3D LMMs,同时显著降低计算开销与参数量。
链接: https://arxiv.org/abs/2602.23153
作者: Guofeng Mei,Wei Lin,Luigi Riz,Yujiao Wu,Yiming Wang,Fabio Poiesi
机构: Fondazione Bruno Kessler, Italy; JKU Linz, Austria; CSIRO, Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: this https URL.
[CV-33] Partial recovery of meter-scale surface weather
【速读】:该论文旨在解决当前天气分析与预报中缺乏对近地表气象要素(如风速、温度和湿度)在数十至数百米尺度上的空间变异性描述的问题。传统模型通常无法捕捉由地表覆盖类型和地形引起的微尺度气象差异,导致预测精度受限。解决方案的关键在于利用稀疏地面观测站数据与高分辨率地球观测数据,对粗分辨率大气状态进行条件约束,从而推断出美国本土范围内10米分辨率的连续近地表气象场。该方法显著降低了风速误差29%、温度与露点误差6%,并能解释更多空间方差,同时展现出可物理解释的结构特征,例如城市热岛效应、蒸散发驱动的湿度梯度以及不同地表覆盖类型间的风速差异。
链接: https://arxiv.org/abs/2602.23146
作者: Jonathan Giezendanner,Qidong Yang,Eric Schmitt,Anirban Chandra,Daniel Salles Civitarese,Johannes Jakubik,Jeremy Vila,Detlef Hohl,Campbell Watson,Sherrie Wang
机构: MIT (麻省理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Near-surface atmospheric conditions can differ sharply over tens to hundreds of meters due to land cover and topography, yet this variability is absent from current weather analyses and forecasts. It is unclear whether such meter-scale variability reflects irreducibly chaotic dynamics or contains a component predictable from surface characteristics and large-scale atmospheric forcing. Here we show that a substantial, physically coherent component of meter-scale near-surface weather is statistically recoverable from existing observations. By conditioning coarse atmospheric state on sparse surface station measurements and high-resolution Earth observation data, we infer spatially continuous fields of near-surface wind, temperature, and humidity at 10 m resolution across the contiguous United States. Relative to ERA5, the inferred fields reduce wind error by 29% and temperature and dewpoint error by 6%, while explaining substantially more spatial variance at fixed time steps. They also exhibit physically interpretable structure, including urban heat islands, evapotranspiration-driven humidity contrasts, and wind speed differences across land cover types. Our findings expand the frontier of weather modeling by demonstrating a computationally feasible approach to continental-scale meter-resolution inference. More broadly, they illustrate how conditioning coarse dynamical models on static fine-scale features can reveal previously unresolved components of the Earth system.
[CV-34] No Labels No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors CVPR2026
【速读】:该论文旨在解决在线视频稳定化(online video stabilization)中的三大长期挑战:数据受限、控制性差以及在资源受限硬件上的效率低下。传统基于深度学习的方法依赖于成对的稳定与不稳定视频数据集,而这类数据难以获取且泛化能力有限。为应对这些问题,作者提出了一种新的无监督框架,其核心在于将经典的稳定化流程分解为三个阶段,并引入多线程缓冲机制,从而在不依赖大规模标注数据的前提下实现高效、可控的实时稳定化处理。此外,为拓展应用场景至无人机夜间遥感等新领域,研究者还构建了首个多模态无人机航拍视频数据集(UAV-Test),实验表明该方法在定量指标和视觉质量上均优于现有在线稳定器,且性能接近离线方法。
链接: https://arxiv.org/abs/2602.23141
作者: Tao Liu,Gang Wan,Kan Ren,Shibo Wen
机构: Nanjing University of Science and Technology (南京理工大学); Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026
Abstract:We propose a new unsupervised framework for online video stabilization. Unlike methods based on deep learning that require paired stable and unstable datasets, our approach instantiates the classical stabilization pipeline with three stages and incorporates a multithreaded buffering mechanism. This design addresses three longstanding challenges in end-to-end learning: limited data, poor controllability, and inefficiency on hardware with constrained resources. Existing benchmarks focus mainly on handheld videos with a forward view in visible light, which restricts the applicability of stabilization to domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our method consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to offline methods.
[CV-35] From Calibration to Refinement: Seeking Certainty via Probabilistic Evidence Propagation for Noisy-Label Person Re-Identification
【速读】:该论文旨在解决无约束环境下行人重识别(Person Re-ID)中因标签噪声和每类样本稀疏导致的模型鲁棒性不足问题。现有方法依赖基于Softmax输出的损失修正或样本选择策略,但存在两大局限:一是Softmax具有平移不变性,对错误标签产生过度自信的预测;二是基于小损失阈值的样本选择常误删对特征学习至关重要的难正例。为此,作者提出两阶段框架CARE(Calibration-to-Refinement),其核心在于通过概率证据传播实现从校准到精炼的渐进式优化:第一阶段引入概率证据校准(PEC),通过在相似度函数中嵌入自适应可学习参数并设计证据校准损失,打破Softmax平移不变性并缓解误标样本的过自信问题;第二阶段设计证据传播精炼(EPR),包含复合角度边界(CAM)度量与确定性导向球面加权(COSW),前者在超球空间中精准区分干净难正例与误标样本,后者依据CAM动态分配样本权重,确保清洁样本主导模型更新。
链接: https://arxiv.org/abs/2602.23133
作者: Xin Yuan,Zhiyong Zhang,Xin Xu,Zheng Wang,Chia-Wen Lin
机构: Wuhan University of Science and Technology (武汉科技大学); Wuhan University (武汉大学); National Tsing Hua University (国立清华大学); Industrial Technology Research Institute (工业技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TMM 2026
Abstract:With the increasing demand for robust person Re-ID in unconstrained environments, learning from datasets with noisy labels and sparse per-identity samples remains a critical challenge. Existing noise-robust person Re-ID methods primarily rely on loss-correction or sample-selection strategies using softmax outputs. However, these methods suffer from two key limitations: 1) Softmax exhibits translation invariance, leading to over-confident and unreliable predictions on corrupted labels. 2) Conventional sample selection based on small-loss criteria often discards valuable hard positives that are crucial for learning discriminative features. To overcome these issues, we propose the CAlibration-to-REfinement (CARE) method, a two-stage framework that seeks certainty through probabilistic evidence propagation from calibration to refinement. In the calibration stage, we propose the probabilistic evidence calibration (PEC) that dismantles softmax translation invariance by injecting adaptive learnable parameters into the similarity function, and employs an evidential calibration loss to mitigate overconfidence on mislabeled samples. In the refinement stage, we design the evidence propagation refinement (EPR) that can more accurately distinguish between clean and noisy samples. Specifically, the EPR contains two steps: Firstly, the composite angular margin (CAM) metric is proposed to precisely distinguish clean but hard-to-learn positive samples from mislabeled ones in a hyperspherical space; Secondly, the certainty-oriented sphere weighting (COSW) is developed to dynamically allocate the importance of samples according to CAM, ensuring clean instances drive model updates. Extensive experimental results on Market1501, DukeMTMC-ReID, and CUHK03 datasets under both random and patterned noises show that CARE achieves competitive performance.
[CV-36] riLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement CVPR2026
【速读】:该论文旨在解决弱监督目标定位(Weakly Supervised Object Localization, WSOL)中两个核心挑战:一是现有方法多依赖多阶段流水线或对大型骨干网络进行全量微调,导致训练成本高;二是普遍存在的局部对象覆盖不足问题。解决方案的关键在于提出一种单阶段框架TriLite,其核心创新为TriHead模块,该模块将视觉Transformer(Vision Transformer)提取的图像块特征解耦为前景、背景和模糊区域三类,从而提升目标覆盖度并抑制伪激活响应。通过分离分类与定位任务的目标函数,TriLite充分利用自监督预训练ViT(Dinov2)所学习的通用表征能力,仅引入少于800K可训练参数即可实现高性能定位,显著优于此前方法且更易训练。
链接: https://arxiv.org/abs/2602.23120
作者: Arian Sabaghi,José Oramas
机构: University of Antwerp, sqIRL/IDLab, imec
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper consists of 8 pages including 6 figures. Accepted at CVPR 2026
Abstract:Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be released soon.
[CV-37] Devling into Adversarial Transferability on Image Classification: Review Benchmark and Evaluation
【速读】:该论文旨在解决当前关于基于迁移的对抗攻击(transfer-based attacks)评估缺乏标准化框架和评判标准的问题,这导致对现有方法的评估可能存在偏差。其解决方案的关键在于:首先系统性地梳理了数百篇相关研究,将各类迁移攻击归纳为六类;其次提出一个全面的评估框架,用作衡量不同攻击方法性能的基准;同时识别出提升对抗迁移性的常用策略,并指出可能导致比较不公平的常见问题,从而为该领域提供更客观、一致的评价体系。
链接: https://arxiv.org/abs/2602.23117
作者: Xiaosen Wang,Zhijin Ge,Bohan Liu,Zheng Fang,Fengfan Zhou,Ruixuan Zhang,Shaokang Wang,Yuyang Luo
机构: HUST(华中科技大学); Xidian University(西安电子科技大学); Wuhan University(武汉大学); HFUT(合肥工业大学); SJTU(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code is available at this https URL
Abstract:Adversarial transferability refers to the capacity of adversarial examples generated on the surrogate model to deceive alternate, unexposed victim models. This property eliminates the need for direct access to the victim model during an attack, thereby raising considerable security concerns in practical applications and attracting substantial research attention recently. In this work, we discern a lack of a standardized framework and criteria for evaluating transfer-based attacks, leading to potentially biased assessments of existing approaches. To rectify this gap, we have conducted an exhaustive review of hundreds of related works, organizing various transfer-based attacks into six distinct categories. Subsequently, we propose a comprehensive framework designed to serve as a benchmark for evaluating these attacks. In addition, we delineate common strategies that enhance adversarial transferability and highlight prevalent issues that could lead to unfair comparisons. Finally, we provide a brief review of transfer-based attacks beyond image classification.
[CV-38] FLIGHT: Fibonacci Lattice-based Inference for Geometric Heading in real-Time
【速读】:该论文旨在解决单目视频中相机位姿估计(camera motion estimation)的问题,尤其针对在高噪声和高异常值条件下现有方法精度下降或计算复杂度上升的局限性。其解决方案的关键在于提出了一种在单位球面(S²)上对霍夫变换(Hough transform)的新型泛化方法:通过提取两帧间的特征对应关系生成大圆(great circle),并利用斐波那契格点(Fibonacci lattice)离散化单位球面作为投票单元,使得不受噪声或动态物体干扰的特征能一致地投票给正确的相机航向(heading)。该策略显著提升了在复杂场景下的鲁棒性和效率,实验表明该方法位于准确率与计算效率的帕累托前沿(Pareto frontier),并在SLAM系统中通过初始位姿估计阶段的航向校正有效降低了均方根误差(RMSE)。
链接: https://arxiv.org/abs/2602.23115
作者: David Dirnfeld,Fabien Delattre,Pedro Miraldo,Erik Learned-Miller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Robotics (cs.RO)
备注:
Abstract:Estimating camera motion from monocular video is a fundamental problem in computer vision, central to tasks such as SLAM, visual odometry, and structure-from-motion. Existing methods that recover the camera’s heading under known rotation, whether from an IMU or an optimization algorithm, tend to perform well in low-noise, low-outlier conditions, but often decrease in accuracy or become computationally expensive as noise and outlier levels increase. To address these limitations, we propose a novel generalization of the Hough transform on the unit sphere (S(2)) to estimate the camera’s heading. First, the method extracts correspondences between two frames and generates a great circle of directions compatible with each pair of correspondences. Then, by discretizing the unit sphere using a Fibonacci lattice as bin centers, each great circle casts votes for a range of directions, ensuring that features unaffected by noise or dynamic objects vote consistently for the correct motion direction. Experimental results on three datasets demonstrate that the proposed method is on the Pareto frontier of accuracy versus efficiency. Additionally, experiments on SLAM show that the proposed method reduces RMSE by correcting the heading during camera pose initialization.
[CV-39] WARM-CAT: : Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning
【速读】:该论文旨在解决组合零样本学习(Compositional Zero-Shot Learning, CZSL)中因测试时标签空间分布偏移导致的性能下降问题,该偏移源于将已见属性与物体重新组合形成的未见组合。解决方案的关键在于:1)利用无监督数据在文本和视觉模态上累积知识,并在测试时更新多模态原型;2)设计自适应更新权重以灵活控制原型调整程度,从而应对测试阶段的分布变化;3)引入动态优先队列存储高置信度图像,用于历史图像中获取视觉原型,同时通过预热策略初始化队列并利用已见与未见文本原型间的映射生成未见视觉原型;4)通过多模态协同表示学习对齐文本与视觉原型,确保跨模态语义一致性。
链接: https://arxiv.org/abs/2602.23114
作者: Xudong Yan,Songhe Feng,Jiaxin Wang,Xin Su,Yi Jin
机构: Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual prototypes from historical images for inference. Since the model tends to favor compositions already stored in the queue during testing, we warm-start the queue by initializing it with training images for visual prototypes of seen compositions and generating unseen visual prototypes using the mapping learned between seen and unseen textual prototypes. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. To provide a more reliable evaluation for CZSL, we introduce a new benchmark dataset, C-Fashion, and refine the widely used but noisy MIT-States dataset. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. The source code and datasets are available at this https URL .
[CV-40] SpectralMamba-UNet: Frequency-Disentangled State Space Modeling for Texture-Structure Consistent Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中如何有效建模全局解剖结构与细粒度边界细节的问题。现有状态空间模型(如Vision Mamba)虽能高效捕捉长距离依赖关系,但其一维序列化处理方式削弱了局部空间连续性和高频特征表示能力。解决方案的关键在于提出SpectralMamba-UNet框架,通过频域解耦机制分离结构信息与纹理信息:其核心是频谱分解与建模(SDM)模块,利用离散余弦变换(DCT)将特征分解为低频和高频成分——低频部分由频域Mamba建模全局上下文,高频部分保留边界敏感细节;同时引入频谱通道重加权(SCR)机制实现通道级频率感知注意力,并设计频谱引导融合(SGF)模块在解码器中自适应地进行多尺度融合,从而提升分割精度与泛化能力。
链接: https://arxiv.org/abs/2602.23103
作者: Fuhao Zhang,Lei Liu,Jialin Zhang,Ya-Nan Zhang,Nan Mu
机构: Sichuan Normal University (四川师范大学); Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Guizhou University (贵州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate medical image segmentation requires effective modeling of both global anatomical structures and fine-grained boundary details. Recent state space models (e.g., Vision Mamba) offer efficient long-range dependency modeling. However, their one-dimensional serialization weakens local spatial continuity and high-frequency representation. To this end, we propose SpectralMamba-UNet, a novel frequency-disentangled framework to decouple the learning of structural and textural information in the spectral domain. Our Spectral Decomposition and Modeling (SDM) module applies discrete cosine transform to decompose low- and high-frequency features, where low frequency contributes to global contextual modeling via a frequency-domain Mamba and high frequency preserves boundary-sensitive details. To balance spectral contributions, we introduce a Spectral Channel Reweighting (SCR) mechanism to form channel-wise frequency-aware attention, and a Spectral-Guided Fusion (SGF) module to achieve adaptively multi-scale fusion in the decoder. Experiments on five public benchmarks demonstrate consistent improvements across diverse modalities and segmentation targets, validating the effectiveness and generalizability of our approach.
[CV-41] Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras
【速读】:该论文旨在解决事件相机(event camera)输出稀疏、异步的事件数据难以转化为神经网络可有效利用的密集张量表示的问题。传统的时间表面(time-surface)或直方图方法采用全局固定的衰减参数,在静止区域易保留空间结构,但在快速运动区域则导致边缘模糊,存在时空分辨率的权衡困境。解决方案的关键在于提出局部自适应衰减表面(Locally Adaptive Decay Surfaces, LADS),其核心思想是根据每个像素位置的局部信号动态(如事件速率、拉普拉斯高斯响应或高频频谱能量)自适应调节时间衰减常数,从而在静态区域保持细节清晰,在活跃区域减少模糊。实验表明,LADS在30 Hz和240 Hz频率下均显著提升人脸检测与关键点定位精度,且支持轻量化网络架构实现实时性能,为高频率事件视觉分析树立了新基准。
链接: https://arxiv.org/abs/2602.23101
作者: Paul Kielty,Timothy Hanley,Peter Corcoran
机构: University of Galway (爱尔兰戈尔韦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.
[CV-42] Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy
【速读】:该论文旨在解决生物医学领域中图像与语言跨模态耦合数据稀缺的问题,尤其是在显微镜下细胞体染色的脑组织切片分析场景中,缺乏足够的配对图像-文本数据来训练能够实现自然语言描述的视觉-语言模型。解决方案的关键在于提出一种标签媒介(label-mediated)方法:通过已有的解剖学标签(如脑区名称)自动从相关文献中挖掘区域描述文本作为合成标注,从而无需人工标注的图像-文本对即可构建有效的跨模态学习信号。该方法将现有的细胞构筑视觉基础模型(CytoNet)与大语言模型(LLM)通过图像到文本的训练目标进行耦合,实现了在57个脑区上的合理区域级描述生成,并具备开放集识别能力,展现出弱监督条件下连接视觉基础模型与自然语言的有效性。
链接: https://arxiv.org/abs/2602.23088
作者: Matthew Sutton,Katrin Amunts,Timo Dickscheid,Christian Schiffer
机构: Institute of Neuroscience and Medicine (INM-1), Research Centre Jülich, Jülich, Germany; Helmholtz AI, Research Centre Jülich, Jülich, Germany; Cécile Oskar Vogt Institute for Brain Research, University Hospital Düsseldorf, Düsseldorf, Germany; Computer Vision, Institute for Computational Visualistics, University of Koblenz, Koblenz, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, submitted for inclusion at a conference
Abstract:Foundation models increasingly offer potential to support interactive, agentic workflows that assist researchers during analysis and interpretation of image data. Such workflows often require coupling vision to language to provide a natural-language interface. However, paired image-text data needed to learn this coupling are scarce and difficult to obtain in many research and clinical settings. One such setting is microscopic analysis of cell-body-stained histological human brain sections, which enables the study of cytoarchitecture: cell density and morphology and their laminar and areal organization. Here, we propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label, without requiring curated paired image-text data. Given the label, we automatically mine area descriptions from related literature and use them as synthetic captions reflecting canonical cytoarchitectonic attributes. An existing cytoarchitectonic vision foundation model (CytoNet) is then coupled to a large language model via an image-to-text training objective, enabling microscopy regions to be described in natural language. Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas. It matches the cytoarchitectonic reference label for in-scope patches with 90.6% accuracy and, with the area label masked, its descriptions remain discriminative enough to recover the area in an 8-way test with 68.6% accuracy. These results suggest that weak, label-mediated pairing can suffice to connect existing biomedical vision foundation models to language, providing a practical recipe for integrating natural-language in domains where fine-grained paired annotations are scarce.
[CV-43] Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception
【速读】:该论文旨在解决4D点云视频理解中因4D数据集稀缺导致自监督模型可扩展性受限的问题,以及将3D预训练模型迁移至4D感知任务时面临的过拟合与模态差距(modality gap)两大挑战。解决方案的关键在于提出一种“先对齐再适配”(Align then Adapt, PointATA)范式,将参数高效迁移学习分解为两个阶段:第一阶段利用最优传输理论量化3D与4D数据分布差异,通过点对齐嵌入器(point align embedder)缓解模态差距;第二阶段在冻结的3D骨干网络基础上引入高效的点视频适配器(point-video adapter)和空间上下文编码器(spatial-context encoder),增强时间建模能力以抑制过拟合。该设计使无时序知识的3D预训练模型能在较低参数开销下有效推理动态视频内容,并在多个4D任务上达到或超越全量微调模型性能。
链接: https://arxiv.org/abs/2602.23069
作者: Yiding Sun,Jihua Zhu,Haozhe Cheng,Chaoyi Lu,Zhichuan Yang,Lin Chen,Yaonan Wang
机构: Xi’an Jiaotong University (西安交通大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel “Align then Adapt” (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 % accuracy on 3D action recognition, +8.7 % on 4 D action segmentation, and 84.06% on 4D semantic segmentation.
[CV-44] GeoWorld: Geometric World Models CVPR2026
【速读】:该论文旨在解决能量基预测世界模型在多步视觉规划中面临的两个核心问题:一是现有方法通常在欧几里得空间(Euclidean space)中学习潜在表示,忽略了状态间的几何结构和层次关系;二是长期预测性能差,导致在长时程轨迹中快速退化。解决方案的关键在于提出GeoWorld,其核心创新是通过一种超球面JEPA(Hyperbolic JEPA)将潜在表示从欧几里得空间映射到双曲流形(hyperbolic manifold),从而保留状态间的几何结构与层次关系,并结合几何强化学习(Geometric Reinforcement Learning)实现基于能量的优化,在双曲潜空间中实现稳定的多步规划。实验表明,该方法在CrossTask和COIN数据集上相较于最先进的V-JEPA 2,在3步和4步规划任务中分别提升了约3%和2%的成功率(Success Rate, SR)。
链接: https://arxiv.org/abs/2602.23058
作者: Zeyu Zhang,Danning Li,Ian Reid,Richard Hartley
机构: ANU; MBZUAI
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to CVPR 2026
Abstract:Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: this https URL.
[CV-45] D-FINE-seg: Object Detection and Instance Segmentation Framework with multi-backend deployment
【速读】:该论文旨在解决实时实例分割(instance segmentation)中基于Transformer的模型性能不足的问题,尤其是在保持低延迟的同时提升分割精度。其核心解决方案是提出D-FINE-seg,作为D-FINE检测器的扩展架构,关键创新包括:轻量级掩码头(mask head)、引入分割感知训练策略(包含框裁剪二值交叉熵损失和Dice掩码损失)、辅助与去噪掩码监督机制,以及适配匈牙利匹配代价函数。这些改进显著提升了TACO数据集上的F1分数,优于Ultralytics YOLO26在统一TensorRT FP16端到端基准测试下的表现,同时维持了竞争性推理延迟。此外,论文还构建了一个支持ONNX、TensorRT和OpenVINO的端到端训练-导出-优化推理框架,为多平台部署提供标准化流程。
链接: https://arxiv.org/abs/2602.23043
作者: Argo Saakyan,Dmitry Solntsev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, 5 tables
Abstract:Transformer-based real-time object detectors achieve strong accuracy-latency trade-offs, and D-FINE is among the top-performing recent architectures. However, real-time instance segmentation with transformers is still less common. We present D-FINE-seg, an instance segmentation extension of D-FINE that adds: a lightweight mask head, segmentation-aware training, including box cropped BCE and dice mask losses, auxiliary and denoising mask supervision, and adapted Hungarian matching cost. On the TACO dataset, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency. Second contribution is an end-to-end pipeline for training, exporting, and optimized inference across ONNX, TensorRT, OpenVINO for both object detection and instance segmentation tasks. This framework is released as open-source under the Apache-2.0 license. GitHub repository - this https URL.
[CV-46] PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
【速读】:该论文旨在解决体积视频(volumetric video)在大规模重建、存储与流媒体传输中面临的挑战,尤其是现有基于高斯点绘(Gaussian Splatting)的方法在长序列下出现的时间不一致性、大运动和遮挡(disocclusion)场景下的失效问题,以及其输出难以适配传统视频编码管道的问题。解决方案的关键在于提出一种名为PackUV的新颖4D高斯表示方法,它将所有高斯属性映射到一系列结构化、多尺度的UV贴图(UV atlas)中,实现图像原生的紧凑存储;并进一步设计PackUV-GS优化方法,在UV域内直接优化高斯参数,结合光流引导的高斯标签划分与关键帧机制,有效识别动态区域、稳定静态背景,并在复杂运动和遮挡条件下保持时间一致性。最终生成的UV贴图格式首次实现了与标准视频编码器(如FFV1)完全兼容且无质量损失,从而支持基于现有多媒体基础设施的高效流媒体传输。
链接: https://arxiv.org/abs/2602.23040
作者: Aashish Rai,Angela Xing,Anushka Agarwal,Xiaoyan Cong,Zekun Li,Tao Lu,Aayush Prakash,Srinath Sridhar
机构: Brown University (布朗大学); UMass Amherst (马萨诸塞大学阿默斯特分校); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions, and preserves temporal coherence even under large motions and disocclusions. The resulting UV atlas format is the first unified volumetric video representation compatible with standard video codecs (e.g., FFV1) without losing quality, enabling efficient streaming within existing multimedia infrastructure. To evaluate long-duration volumetric capture, we present PackUV-2B, the largest multi-view video dataset to date, featuring more than 50 synchronized cameras, substantial motion, and frequent disocclusions across 100 sequences and 2B (billion) frames. Extensive experiments demonstrate that our method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality. Comments: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.23040 [cs.CV] (or arXiv:2602.23040v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.23040 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: CVPR 2026
[CV-47] Small Object Detection Model with Spatial Laplacian Pyramid Attention and Multi-Scale Features Enhancement in Aerial Images
【速读】:该论文旨在解决航空图像中小目标检测效率低的问题,主要挑战包括目标尺寸小、分布密集且不均匀,导致传统检测算法特征表达能力不足。解决方案的关键在于三个方面:首先,提出空间拉普拉斯金字塔注意力(Spatial Laplacian Pyramid Attention, SLPA)模块,嵌入ResNet-50每个阶段后以增强对重要局部区域的识别与强调;其次,设计多尺度特征增强模块(Multi-Scale Feature Enhancement Module, MSFEM),引入C5层侧连接构建特征金字塔网络(Feature Pyramid Network, FPN),提升语义理解和特征表示能力;最后,采用可变形卷积(deformable convolutions)对FPN中上下层特征进行对齐融合,缓解因特征未对齐导致的表示质量下降问题,从而显著提升模型对小目标的检测性能。
链接: https://arxiv.org/abs/2602.23031
作者: Zhangjian Ji,Huijia Yan,Shaotong Qiao,Kai Feng,Wei Wei
机构: Shanxi University (山西大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting objects in aerial images confronts some significant challenges, including small size, dense and non-uniform distribution of objects over high-resolution images, which makes detection inefficient. Thus, in this paper, we proposed a small object detection algorithm based on a Spatial Laplacian Pyramid Attention and Multi-Scale Feature Enhancement in aerial images. Firstly, in order to improve the feature representation of ResNet-50 on small objects, we presented a novel Spatial Laplacian Pyramid Attention (SLPA) module, which is integrated after each stage of ResNet-50 to identify and emphasize important local regions. Secondly, to enhance the model’s semantic understanding and features representation, we designed a Multi-Scale Feature Enhancement Module (MSFEM), which is incorporated into the lateral connections of C5 layer for building Feature Pyramid Network (FPN). Finally, the features representation quality of traditional feature pyramid network will be affected because the features are not aligned when the upper and lower layers are fused. In order to handle it, we utilized deformable convolutions to align the features in the fusion processing of the upper and lower levels of the Feature Pyramid Network, which can help enhance the model’s ability to detect and recognize small objects. The extensive experimental results on two benchmark datasets: VisDrone and DOTA demonstrate that our improved model performs better for small object detection in aerial images compared to the original algorithm.
[CV-48] WISER: Wider Search Deeper Thinking and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
【速读】:该论文旨在解决零样本组合图像检索(Zero-Shot Composed Image Retrieval, ZS-CIR)任务中,如何在不依赖标注三元组训练数据的前提下,有效融合文本到图像(Text-to-Image, T2I)与图像到图像(Image-to-Image, I2I)两种检索范式的互补优势问题。现有方法通常将多模态查询(参考图像+修改文本)统一转换为单一模态进行检索,导致T2I丢失细粒度视觉细节或I2I难以处理复杂语义变化。解决方案的关键在于提出一个无需训练的WISER框架,其核心是通过“检索-验证-精炼”(retrieve-verify-refine)流水线,显式建模意图感知(intent awareness)与不确定性感知(uncertainty awareness),首先并行生成编辑后的文本和图像以扩大候选池(Wider Search),随后利用验证器动态融合双路径结果:对高置信度检索结果直接融合,对低置信度则触发结构化自我反思引导下一轮深度思考(Deeper Thinking),从而实现跨模态意图的精准匹配与鲁棒性提升。
链接: https://arxiv.org/abs/2602.23029
作者: Tianyue Wang,Leigang Qu,Tianyu Yang,Xiangzhao Hao,Yifan Xu,Haiyun Guo,Jinqiao Wang
机构: SAIS, UCAS; Institute of Automation, CAS; National University of Singapore; Minzu University of China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a “retrieve-verify-refine” pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at this https URL.
[CV-49] DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis CVPR2026
【速读】:该论文旨在解决传统基于光流(optical flow)的图像对齐方法在面对遮挡(occlusion)和光照变化等挑战时,导致对齐视觉质量下降及下游任务准确性受限的问题。其解决方案的关键在于提出一种基于扩散模型(diffusion-based)的图像对齐框架DMAligner,通过面向对齐的视图合成(alignment-oriented view synthesis)实现生成式对齐。核心创新包括:一是设计动态感知扩散训练策略(Dynamics-aware Diffusion Training),以学习条件图像生成;二是引入动态感知掩码生成模块(Dynamics-aware Mask Producing, DMP),自适应区分动态前景与静态背景,从而提升模型对复杂场景的鲁棒性。该方法从根本上规避了光流估计的局限性,在DSIA数据集及多个视频基准上均展现出优越性能。
链接: https://arxiv.org/abs/2602.23022
作者: Xinglong Luo,Ao Luo,Zhengning Wang,Yueqi Yang,Chaoyu Feng,Lei Lei,Bing Zeng,Shuaicheng Liu
机构: University of Electronic Science and Technology of China (电子科技大学); Southwest Jiaotong University (西南交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve. Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment. Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Our code is available at this https URL.
[CV-50] SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling CVPR2026
【速读】:该论文旨在解决工业视觉检测中少样本(few-shot)异常检测问题,即在每类正常图像仅提供少量样本的情况下实现高精度异常识别。传统方法通常依赖记忆库、辅助数据集或对视觉-语言模型进行多模态微调,引入了复杂性与额外开销。其解决方案的关键在于利用预训练视觉基础模型(如DINOv2)提取的冻结特征,通过两个简单步骤完成检测:首先从少量正常图像中提取patch级特征,随后使用主成分分析(PCA)建模正常变化的低维子空间;推理时基于该子空间计算重构残差以生成可解释且统计学合理的异常分数。此方法无需训练、提示调优或内存存储,在一shot和few-shot场景下均达到当前最优性能。
链接: https://arxiv.org/abs/2602.23013
作者: Camile Lendering,Erkut Akdag,Egor Bondarev
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026
Abstract:Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0% and 97.6% on the MVTec-AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at this https URL.
[CV-51] HELMLAB: An Analytical Data-Driven Color Space for Perceptual Distance in UI Design Systems
【速读】:该论文旨在解决用户界面(UI)设计系统中颜色空间感知一致性不足的问题,现有色差公式如CIEDE2000在跨数据集表现不稳定且难以直接用于设计工具链。其解决方案的关键在于提出一个72参数的解析色彩空间HELMLAB,通过学习得到的矩阵变换、通道独立的幂压缩、傅里叶色调校正以及嵌入式的赫尔姆霍兹-科尔劳施(Helmholtz-Kohlrausch)亮度调整,实现更符合人类视觉感知的颜色表示;同时引入中性色校正和色度平面刚性旋转,确保灰度色精确映射至a=b=0,并提升色相角对齐性而不改变距离度量,从而在COMBVD数据集上将STRESS指标降低至23.22(相较CIEDE2000的29.18下降20.4%),具备良好的跨数据集泛化能力与可逆性(往返误差低于10^-14)。
链接: https://arxiv.org/abs/2602.23010
作者: Gorkem Yildiz
机构: Independent Researcher
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures. Code and demo available at: this https URL
Abstract:We present HELMLAB, a 72-parameter analytical color space for UI design systems. The forward transform maps CIE XYZ to a perceptually-organized Lab representation through learned matrices, per-channel power compression, Fourier hue correction, and embedded Helmholtz-Kohlrausch lightness adjustment. A post-pipeline neutral correction guarantees that achromatic colors map to a=b=0 (chroma 10^-6), and a rigid rotation of the chromatic plane improves hue-angle alignment without affecting the distance metric, which is invariant under isometries. On the COMBVD dataset (3,813 color pairs), HELMLAB achieves a STRESS of 23.22, a 20.4% reduction from CIEDE2000 (29.18). Cross-validation on He et al. 2022 and MacAdam 1974 shows competitive cross-dataset performance. The transform is invertible with round-trip errors below 10^-14. Gamut mapping, design-token export, and dark/light mode adaptation utilities are included for use in web and mobile design systems.
[CV-52] An automatic counting algorithm for the quantification and uncertainty analysis of the number of microglial cells trainable in small and heterogeneous datasets
【速读】:该论文旨在解决在大鼠腰椎脊髓横切面图像中对小胶质细胞(microglial cells)进行精确计数的问题,传统方法依赖人工标注耗时且需大量培训,而经典基于颜色的自动方法仅能提供标记区域和强度信息,无法准确统计细胞数量。解决方案的关键在于摒弃复杂的细胞检测步骤,直接聚焦于计数任务,并提出一种非参数、非线性的核计数器(kernel counter),其优势在于:1)仅需一个超参数即可在小样本数据集上有效训练;2)具备灵活性,可充分表达复杂多样的数据特征;3)提供预测不确定性估计,并能直接处理同一图像由多位专家给出的不同标注结果。通过预处理生成多幅滤波图像以提升特征提取效率,结合该核计数器,在真实与人工数据集上均展现出优异性能。
链接: https://arxiv.org/abs/2602.22974
作者: L. Martino,M. M. Garcia,P. S. Paradas,E. Curbelo
机构: Universitá degli studi di Catania (卡塔尼亚大学); Universidad Rey Juan Carlos (雷伊·胡安·卡洛斯大学); Universidad Carlos III de Madrid (卡洛斯三世大学)
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Machine Learning (stat.ML)
备注:
Abstract:Counting immunopositive cells on biological tissues generally requires either manual annotation or (when available) automatic rough systems, for scanning signal surface and intensity in whole slide imaging. In this work, we tackle the problem of counting microglial cells in lumbar spinal cord cross-sections of rats by omitting cell detection and focusing only on the counting task. Manual cell counting is, however, a time-consuming task and additionally entails extensive personnel training. The classic automatic color-based methods roughly inform about the total labeled area and intensity (protein quantification) but do not specifically provide information on cell number. Since the images to be analyzed have a high resolution but a huge amount of pixels contain just noise or artifacts, we first perform a pre-processing generating several filtered images (providing a tailored, efficient feature extraction). Then, we design an automatic kernel counter that is a non-parametric and non-linear method. The proposed scheme can be easily trained in small datasets since, in its basic version, it relies only on one hyper-parameter. However, being non-parametric and non-linear, the proposed algorithm is flexible enough to express all the information contained in rich and heterogeneous datasets as well (providing the maximum overfit if required). Furthermore, the proposed kernel counter also provides uncertainty estimation of the given prediction, and can directly tackle the case of receiving several expert opinions over the same image. Different numerical experiments with artificial and real datasets show very promising results. Related Matlab code is also provided.
[CV-53] Certified Circuits: Stability Guarantees for Mechanistic Circuits
【速读】:该论文旨在解决现有机制可解释性(mechanistic interpretability)方法中电路发现(circuit discovery)的不稳定性问题,即当前发现的神经网络电路高度依赖于特定的概念数据集(concept dataset),且在分布外(out-of-distribution, OOD)场景下难以迁移,可能捕捉的是数据集特异性伪影而非真实概念表征。解决方案的关键在于提出“认证电路”(Certified Circuits)框架,通过在任意黑盒电路发现算法外部引入随机数据子采样机制,对电路组件的包含决策提供可证明的稳定性保障——即保证其在概念数据集的有界编辑距离扰动下保持不变。该方法自动排除不稳定神经元,从而生成更紧凑、更准确且在OOD场景下依然可靠的电路,使机制解释具备形式化稳定性和更强的概念对齐能力。
链接: https://arxiv.org/abs/2602.22968
作者: Alaa Anani,Tobias Lorenz,Bernt Schiele,Mario Fritz,Jonas Fischer
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits - minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture concept or dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that circuit component inclusion decisions are invariant to bounded edit-distance perturbations of the concept dataset. Unstable neurons are abstained from, yielding circuits that are more compact and more accurate. On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy while using 45% fewer neurons, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code will be released soon!
[CV-54] UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models
【速读】:该论文旨在解决基于视频生成的世界模型在长期场景一致性(long-term content consistency)和精确相机控制(precise camera control)方面的关键挑战。现有方法要么依赖显式三维重建而牺牲了无界场景下的灵活性与细粒度结构表达,要么直接使用先前生成的帧而缺乏明确的空间对应关系,从而限制了可控性和一致性。解决方案的关键在于提出一种名为UCM的新框架,其核心创新是通过时间感知的位置编码扭曲机制(time-aware positional encoding warping mechanism),统一建模长期记忆与精确相机控制;同时设计了一个高效的双流扩散Transformer以提升生成质量并降低计算开销,并结合基于点云渲染的数据增强策略,实现对超过50万段单目视频的有效训练,显著提升了真实世界与合成基准上的长期一致性与高保真视频生成中的相机可控性。
链接: https://arxiv.org/abs/2602.22960
作者: Tianxing Xu,Zixuan Wang,Guangyuan Wang,Li Hu,Zhongyi Zhang,Peng Zhang,Bang Zhang,Song-Hai Zhang
机构: Tsinghua University (清华大学); Alibaba (阿里巴巴); Tongyi Lab (通义实验室); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.
[CV-55] Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study
【速读】:该论文旨在解决医学影像中视觉特征高度混淆的疾病区分问题,尤其是在零样本(zero-shot)条件下难以准确诊断的场景,例如黑色素瘤与非典型痣、肺水肿与肺炎之间的鉴别。此类疾病的临床管理差异显著,但因其外观相似性导致传统方法难以可靠区分。解决方案的关键在于提出一种基于对比判决(contrastive adjudication)的多智能体框架,通过多个代理(agent)协同推理并相互校验,提升诊断准确性与可解释性,实验结果显示在皮肤镜数据上准确率提升11个百分点,并减少不支持的诊断声明,尽管整体性能仍不足以直接应用于临床部署。
链接: https://arxiv.org/abs/2602.22959
作者: Zihao Zhao,Frederik Hauke,Juliana De Castilhos,Sven Nebelung,Daniel Truhn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this https URL
Abstract:The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.
[CV-56] MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis
【速读】:该论文旨在解决当前脑肿瘤诊断模型在多模态医学影像理解中缺乏临床可解释性推理能力的问题,尤其针对现有公开数据集在标注丰富度和诊断语义层面的不足。其关键解决方案在于构建了一个大规模、多模态的基准数据集MM-NeuroOnco,包含24,726张MRI切片及约20万条语义增强的多模态指令,覆盖多种肿瘤亚型和成像模式;同时设计了一种多模型协同的自动化医疗信息补全与质量控制流程,以克服诊断语义标注稀缺和高成本问题;此外,进一步开发了人工标注的评估基准MM-NeuroOnco-Bench,采用拒绝感知设置减少封闭式问答带来的偏差。实验表明,基于该数据集微调的NeuroOnco-GPT模型在诊断类问题上相比最强基线(Gemini 3 Flash)实现了27%的绝对准确率提升,验证了所提方案对推动临床导向的多模态诊断推理的有效性。
链接: https://arxiv.org/abs/2602.22955
作者: Feng Guo,Jiaxiang Liu,Yang Li,Qianqian Shi,Mingkun Xu
机构: Guangdong Institute of Intelligence Science and Technology (广东智能科学与技术研究院); Center for Brain-Inspired Computing Research (CBICR), Department of Precision Instrument, Tsinghua University (清华大学精密仪器系脑启发计算研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate brain tumor diagnosis requires models to not only detect lesions but also generate clinically interpretable reasoning grounded in imaging manifestations, yet existing public datasets remain limited in annotation richness and diagnostic semantics. To bridge this gap, we introduce MM-NeuroOnco, a large-scale multimodal benchmark and instruction-tuning dataset for brain tumor MRI understanding, consisting of 24,726 MRI slices from 20 data sources paired with approximately 200,000 semantically enriched multimodal instructions spanning diverse tumor subtypes and imaging modalities. To mitigate the scarcity and high cost of diagnostic semantic annotations, we develop a multi-model collaborative pipeline for automated medical information completion and quality control, enabling the generation of diagnosis-related semantics beyond mask-only annotations. Building upon this dataset, we further construct MM-NeuroOnco-Bench, a manually annotated evaluation benchmark with a rejection-aware setting to reduce biases inherent in closed-ended question formats. Evaluation across ten representative models shows that even the strongest baseline, Gemini 3 Flash, achieves only 41.88% accuracy on diagnosis-related questions, highlighting the substantial challenges of multimodal brain tumor diagnostic understanding. Leveraging MM-NeuroOnco, we further propose NeuroOnco-GPT, which achieves a 27% absolute accuracy improvement on diagnostic questions following fine-tuning. This result demonstrates the effectiveness of our dataset and benchmark in advancing clinically grounded multimodal diagnostic reasoning. Code and dataset are publicly available at: this https URL
[CV-57] OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis CVPR2026
【速读】:该论文旨在解决手指拼写(fingerspelling)识别中的三大挑战:签名手歧义问题(signing-hand ambiguity)、训练损失函数设计不当导致的峰值行为问题(peaky behavior problem),以及词汇外(out-of-vocabulary, OOV)词识别困难。其核心解决方案包括:提出一种支持单手与双手输入的多手感知识别器,通过双层位置编码(dual-level positional encoding)和签名手聚焦损失(signing-hand focus, SF loss)实现隐式签名手检测;摒弃传统的连接主义时序分类(CTC)损失,引入单调对齐损失(monotonic alignment, MA loss)以通过交叉注意力正则化确保输出字母序列遵循输入姿态序列的时间顺序;同时设计帧级字母条件生成器(frame-wise letter-conditioned generator),用于合成OOV词的真实手指拼写姿态序列,从而构建新的合成基准FSNeo。这些方法共同提升了识别准确率并有效应对了实际应用中的关键瓶颈。
链接: https://arxiv.org/abs/2602.22949
作者: Junuk Cha,Jihyeon Kim,Han-Mu Park
机构: KAIST; KT; KETI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. Codes and data are available in: this https URL.
[CV-58] oProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization ICLR2026
【速读】:该论文旨在解决视觉自回归(Visual Autoregressive, VAR)模型在生成后期阶段存在的效率瓶颈问题。传统方法如FastVAR和SkipVAR依赖启发式跳过策略,难以在保持语义保真度的前提下实现高效加速。本文提出了一种基于注意力熵(attention entropy)的新优化框架,通过量化不同维度下模型参数的动态变化,精确识别出在不同标记粒度、语义范围和生成规模下的稀疏模式,从而在token、层和尺度三个关键维度上设计细粒度优化策略。其核心创新在于以注意力熵为分析工具,取代了以往依赖经验规则的方法,实现了在显著加速生成过程的同时最小化质量损失,实验表明该方法在Infinity-2B和Infinity-8B模型上可实现最高达3.4倍的加速比。
链接: https://arxiv.org/abs/2602.22948
作者: Jiayu Chen,Ruoyu Lin,Zihao Zheng,Jingxin Li,Maoliang Li,Guojie Luo,Xiang chen
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ToProVAR is honored to be accepted by ICLR 2026
Abstract:Visual Autoregressive(VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages. In this paper, we present a novel optimization framework for VAR models that fundamentally differs from prior approaches such as FastVAR and SkipVAR. Instead of relying on heuristic skipping strategies, our method leverages attention entropy to characterize the semantic projections across different dimensions of the model architecture. This enables precise identification of parameter dynamics under varying token granularity levels, semantic scopes, and generation scales. Building on this analysis, we further uncover sparsity patterns along three critical dimensions-token, layer, and scale-and propose a set of fine-grained optimization strategies tailored to these patterns. Extensive evaluation demonstrates that our approach achieves aggressive acceleration of the generation process while significantly preserving semantic fidelity and fine details, outperforming traditional methods in both efficiency and quality. Experiments on Infinity-2B and Infinity-8B models demonstrate that ToProVAR achieves up to 3.4x acceleration with minimal quality loss, effectively mitigating the issues found in prior work. Our code will be made publicly available.
[CV-59] Cross-Task Benchmarking of CNN Architectures
【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Networks, CNNs)在处理多任务、多模态数据时特征表示能力有限、计算效率不高以及跨任务泛化性能不足的问题。其解决方案的关键在于引入动态卷积机制(dynamic convolution)与注意力机制(attention mechanisms),通过自适应地调整卷积核权重来增强模型对复杂空间模式的响应能力,尤其在形态复杂的图像中表现优异;同时,基于ResNet-18架构的五种变体对比实验表明,软注意力机制(包括局部像素级和全局图像级特征注意力)及全向卷积(omni-directional CNN, ODConv)能够显著提升模型在图像分类、分割和时间序列分析等任务上的准确率、计算效率和跨任务迁移能力。
链接: https://arxiv.org/abs/2602.22945
作者: Kamal Sherawat,Vikrant Bhati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This project provides a comparative study of dynamic convolutional neural networks (CNNs) for various tasks, including image classification, segmentation, and time series analysis. Based on the ResNet-18 architecture, we compare five variants of CNNs: the vanilla CNN, the hard attention-based CNN, the soft attention-based CNN with local (pixel-wise) and global (image-wise) feature attention, and the omni-directional CNN (ODConv). Experiments on Tiny ImageNet, Pascal VOC, and the UCR Time Series Classification Archive illustrate that attention mechanisms and dynamic convolution methods consistently exceed conventional CNNs in accuracy, efficiency, and computational performance. ODConv was especially effective on morphologically complex images by being able to dynamically adjust to varying spatial patterns. Dynamic CNNs enhanced feature representation and cross-task generalization through adaptive kernel modulation. This project provides perspectives on advanced CNN design architecture for multiplexed data modalities and indicates promising directions in neural network engineering.
[CV-60] Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings
【速读】:该论文旨在解决划艇竞速项目中缺乏高精度、自动化性能分析工具的问题,尤其是在无法使用GPS设备的情况下,如何通过视频数据重建运动员的速度和划频(stroke rate)等关键性能指标。其解决方案的关键在于:首先利用YOLOv8模型检测浮标和运动员,结合已知的浮标网格估计图像到真实世界之间的单应性变换(homography);其次,采用基于U-net的船头校准方法学习特定船只的运动员偏移量,从而更准确地定位船体位置;再次,引入基于光流的鲁棒跟踪策略以适应多运动员划艇类型;最后,通过姿态估计或运动员边界框直接提取划频信息。该方法在精英赛事GPS数据上的验证结果表明,速度和划频的相对均方根误差(RRMSE)分别达到0.020 ± 0.011和0.022 ± 0.024,相关系数分别为0.956和0.932,实现了无需船上传感器或人工标注的自动化、高精度性能反馈。
链接: https://arxiv.org/abs/2602.22941
作者: Julian Ziegler,Daniel Matthes,Finn Gerdts,Patrick Frenzel,Torsten Warnke,Matthias Englert,Tina Koevari,Mirco Fuchs
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pacing strategies, defined by velocity and stroke rate profiles, are essential for peak performance in canoe sprint. While GPS is the gold standard for analysis, its limited availability necessitates automated video-based solutions. This paper presents an extended framework for reconstructing performance metrics from panned and zoomed video recordings across all sprint disciplines (K1-K4, C1-C2) and distances (200m-500m). Our method utilizes YOLOv8 for buoy and athlete detection, leveraging the known buoy grid to estimate homographies. We generalized the estimation of the boat position by means of learning a boat-specific athlete offset using a U-net based boat tip calibration. Further, we implement a robust tracking scheme using optical flow to adapt to multi-athlete boat types. Finally, we introduce methods to extract stroke rate information from either pose estimations or the athlete bounding boxes themselves. Evaluation against GPS data from elite competitions yields a velocity RRMSE of 0.020 ± 0.011 (rho = 0.956) and a stroke rate RRMSE of 0.022 ± 0.024 (rho = 0.932). The methods provide coaches with highly accurate, automated feedback without requiring on-boat sensors or manual annotation.
[CV-61] pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation
【速读】:该论文旨在解决当前参数高效微调(Parameter-efficient fine-tuning)方法在视觉适应任务中通常仅依赖单一预训练模型知识、忽视多领域知识协同效应的问题。其解决方案的关键在于提出一种新型的专家混合提示微调方法(pMoE),通过引入专家专用提示标记(expert-specific prompt tokens)和可学习的调度器(learnable dispatcher),在不同提示层动态分配提示 token,从而实现多个领域专家知识的有效融合与优化利用,显著提升模型在多样化任务中的泛化能力与适应效率。
链接: https://arxiv.org/abs/2602.22938
作者: Shentong Mo,Xufang Luo,Dongsheng Li
机构: Carnegie Mellon University (卡内基梅隆大学); Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Parameter-efficient fine-tuning has demonstrated promising results across various visual adaptation tasks, such as classification and segmentation. Typically, prompt tuning techniques have harnessed knowledge from a single pre-trained model, whether from a general or a specialized medical domain. However, this approach typically overlooks the potential synergies that could arise from integrating diverse domain knowledge within the same tuning process. In this work, we propose a novel Mixture-of-Experts prompt tuning method called pMoE, which leverages the strengths of multiple expert domains through expert-specialized prompt tokens and the learnable dispatcher, effectively combining their expertise in a unified model framework. Our pMoE introduces expert-specific prompt tokens and utilizes a dynamic token dispatching mechanism at various prompt layers to optimize the contribution of each domain expert during the adaptation phase. By incorporating both domain knowledge from diverse experts, the proposed pMoE significantly enhances the model’s versatility and applicability to a broad spectrum of tasks. We conduct extensive experiments across 47 adaptation tasks, including both classification and segmentation in general and medical domains. The results demonstrate that our pMoE not only achieves superior performance with a large margin of improvements but also offers an optimal trade-off between computational efficiency and adaptation effectiveness compared to existing methods.
[CV-62] MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频理解任务中效率低下的问题。其核心挑战在于如何从冗长的视频流中高效提取关键信息以支持准确问答。解决方案的关键在于提出一种联合进化框架——MLLM-Sampler Joint Evolution (MSJoE),该框架通过协同优化MLLM与一个轻量级关键帧采样器,实现对视频内容的智能筛选与理解。具体而言,MSJoE首先生成多个描述不同视觉视角的问题查询,利用冻结的CLIP模型计算查询与帧之间的相似性矩阵,并基于此矩阵由轻量采样器预测关键帧权重,从而选择紧凑且信息丰富的帧集供MLLM推理。整个过程通过强化学习进行端到端联合训练,使查询推理、帧采样和关键帧理解模块实现协同适应,显著提升了长视频问答的准确性与效率。
链接: https://arxiv.org/abs/2602.22932
作者: Wenhui Tan,Xiaoyi Yu,Jiaze Li,Yijing Chen,Jianzhong Ju,Zhenbo Luo,Ruihua Song,Jian Luan
机构: Gaoling School of Artificial Intelligence, Renmin University of China; Tongji University; MiLM Plus, Xiaomi Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0% accuracy gain upon the base MLLM, and 1.1% higher accuracy than strongest baseline method.
[CV-63] WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents
【速读】:该论文旨在解决自主水面航行器(Autonomous Surface Vessels, ASVs)在复杂动态水域环境中,从被动感知(如目标检测与分割)向知识驱动、交互式环境认知跃迁的瓶颈问题。当前ASV在面对多变光照、天气等现实因素时,难以实现符合航行规则且可解释的决策推理。解决方案的关键在于提出WaterVideoQA基准和NaviMind多智能体神经符号系统:前者构建了涵盖六类水道场景的3029段视频问答数据集,用于系统性评估ASV的认知能力;后者通过自适应语义路由(Adaptive Semantic Routing)、情境感知分层推理(Situation-Aware Hierarchical Reasoning)与自主自我验证(Autonomous Self-Reflective Verification)三大机制,使ASV能够完成从视觉感知到合规、可解释决策的跃升,从而推动智能航海向可信交互新范式演进。
链接: https://arxiv.org/abs/2602.22923
作者: Runwei Guan,Shaofeng Liang,Ningwei Ouyang,Weichen Fei,Shanliang Yao,Wei Dai,Chenhao Ge,Penglei Sun,Xiaohui Zhu,Tao Huang,Ryan Wen Liu,Hui Xiong
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Wuhan University of Technology (武汉理工大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Nanjing University (南京大学); Yancheng Institute of Technology (盐城工学院); Stanford University (斯坦福大学); James Cook University (詹姆斯库克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages,8 figures
Abstract:While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.
[CV-64] OSDaR-AR: Enhancing Railway Perception Datasets via Multi-modal Augmented Reality
【速读】:该论文旨在解决铁路智能感知系统中因高质量标注数据稀缺而导致的安全关键任务(如障碍物检测)性能受限的问题,尤其针对现有光栅化渲染器存在“仿真到现实”(sim-to-real)差距以及简单图像掩蔽技术缺乏时空一致性的问题。解决方案的关键在于提出了一种多模态增强现实(Augmented Reality, AR)框架,通过整合基于虚幻引擎5(Unreal Engine 5)的物理逼真虚拟对象至真实铁路场景(来自OSDaR23数据集),并利用LiDAR点云与惯性导航系统/全球导航卫星系统(INS/GNSS)数据实现对象在RGB帧间的精确空间定位与时间稳定性;此外,论文还提出一种基于分割的INS/GNSS数据精修策略,显著提升了增强序列的真实性,最终构建了公开可用的OSDaR-AR数据集,以支持下一代铁路感知系统的开发。
链接: https://arxiv.org/abs/2602.22920
作者: Federico Nesti,Gianluca D’Amico,Mauro Marinoni,Giorgio Buttazzo
机构: Scuola Superiore Sant’Anna (圣安娜高等研究院); Simulatrix MV srl
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although deep learning has significantly advanced the perception capabilities of intelligent transportation systems, railway applications continue to suffer from a scarcity of high-quality, annotated data for safety-critical tasks like obstacle detection. While photorealistic simulators offer a solution, they often struggle with the ``sim-to-real" gap; conversely, simple image-masking techniques lack the spatio-temporal coherence required to obtain augmented single- and multi-frame scenes with the correct appearance and dimensions. This paper introduces a multi-modal augmented reality framework designed to bridge this gap by integrating photorealistic virtual objects into real-world railway sequences from the OSDaR23 dataset. Utilizing Unreal Engine 5 features, our pipeline leverages LiDAR point-clouds and INS/GNSS data to ensure accurate object placement and temporal stability across RGB frames. This paper also proposes a segmentation-based refinement strategy for INS/GNSS data to significantly improve the realism of the augmented sequences, as confirmed by the comparative study presented in the paper. Carefully designed augmented sequences are collected to produce OSDaR-AR, a public dataset designed to support the development of next-generation railway perception systems. The dataset is available at the following page: this https URL
[CV-65] Chain of Flow: A Foundational Generative Framework for ECG-to-4D Cardiac Digital Twins
【速读】:该论文旨在解决当前心脏数字孪生(Cardiac Digital Twin, CDT)框架普遍存在的局限性问题,即现有方法多为任务特定的预测模型,缺乏对个体化心脏解剖结构与生理功能的完整建模能力,难以支持多样化下游仿真任务。其解决方案的关键在于提出Chain of Flow (COF)——一个以心电图(ECG)驱动的生成式基础框架,通过融合心脏磁共振成像(cine-CMR)与12导联ECG信号,在训练阶段学习统一的心脏几何形态、电生理特性与运动动力学表示,从而实现从单个心动周期中重建完整的4D心脏结构与动态运动。此方法使CDT从窄域预测模型转变为可操作、患者特异的虚拟心脏,显著提升其在容积测量、区域功能分析及虚拟电影合成等下游任务中的通用性和实用性。
链接: https://arxiv.org/abs/2602.22919
作者: Haofan Wu,Nay Aung,Theodoros N. Arvanitis,Joao A. C. Lima,Steffen E. Petersen,Le Zhang
机构: University of Birmingham (伯明翰大学); Queen Mary University London (伦敦玛丽女王大学); Johns Hopkins University School of Medicine (约翰霍普金斯大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures. Submitted to IEEE Transactions on Medical Imaging (TMI). Code will be released after review
Abstract:A clinically actionable Cardiac Digital Twin (CDT) should reconstruct individualised cardiac anatomy and physiology, update its internal state from multimodal signals, and enable a broad range of downstream simulations beyond isolated tasks. However, existing CDT frameworks remain limited to task-specific predictors rather than building a patient-specific, manipulable virtual heart. In this work, we introduce Chain of Flow (COF), a foundational ECG-driven generative framework that reconstructs full 4D cardiac structure and motion from a single cardiac cycle. The method integrates cine-CMR and 12-lead ECG during training to learn a unified representation of cardiac geometry, electrophysiology, and motion dynamics. We evaluate Chain of Flow on diverse cohorts and demonstrate accurate recovery of cardiac anatomy, chamber-wise function, and dynamic motion patterns. The reconstructed 4D hearts further support downstream CDT tasks such as volumetry, regional function analysis, and virtual cine synthesis. By enabling full 4D organ reconstruction directly from ECG, COF transforms cardiac digital twins from narrow predictive models into fully generative, patient-specific virtual hearts. Code will be released after review.
[CV-66] owards Multimodal Domain Generalization with Few Labels CVPR2026
【速读】:该论文旨在解决半监督多模态域泛化(Semi-Supervised Multimodal Domain Generalization, SSMDG)问题,即在仅有少量标注样本的情况下,从多源数据中学习具有鲁棒性的多模态模型,同时应对未见域中的分布偏移。现有方法存在明显局限:多模态域泛化方法无法利用未标注数据,半监督多模态学习忽略域差异,而半监督域泛化方法仅适用于单模态输入。为克服这些不足,作者提出一个统一框架,其关键创新在于三个核心组件:基于一致性的伪标签生成机制(Consensus-Driven Consistency Regularization),通过融合双模态预测的一致性获取可靠伪标签;争议感知正则化(Disagreement-Aware Regularization),有效利用不一致的模糊样本以提升模型判别力;跨模态原型对齐(Cross-Modal Prototype Alignment),通过跨模态翻译实现域不变和模态不变表示,增强缺失模态下的鲁棒性。
链接: https://arxiv.org/abs/2602.22917
作者: Hongzhao Li,Hao Dong,Hualei Wan,Shupan Li,Mingliang Xu,Muhammad Haris Khan
机构: Zhengzhou University (郑州大学); ETH Zürich (苏黎世联邦理工学院); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code are available at this https URL.
[CV-67] SO3UFormer: Learning Intrinsic Spherical Features for Rotation-Robust Panoramic Segmentation
【速读】:该论文旨在解决全景语义分割模型在面对任意3D旋转时性能显著下降的问题,特别是由于标准球面Transformer对全球纬度线索的过拟合导致其在非重力对齐场景下失效。解决方案的关键在于提出SO3UFormer架构,其核心创新包括:(1)基于内在特征表述,通过移除绝对纬度编码来解耦表示与重力向量;(2)采用一致采样密度的球面注意力机制以应对非均匀采样问题;(3)引入感知规范的相对位置机制,利用切平面投影角度和离散规范池化编码局部角几何信息,避免依赖全局坐标轴。这些设计共同提升了模型对任意SO(3)旋转的鲁棒性。
链接: https://arxiv.org/abs/2602.22867
作者: Qinfeng Zhu,Yunxi Jiang,Lei Fan
机构: XJTLU(西交利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Panoramic semantic segmentation models are typically trained under a strict gravity-aligned assumption. However, real-world captures often deviate from this canonical orientation due to unconstrained camera motions, such as the rotational jitter of handheld devices or the dynamic attitude shifts of aerial platforms. This discrepancy causes standard spherical Transformers to overfit global latitude cues, leading to performance collapse under 3D reorientations. To address this, we introduce SO3UFormer, a rotation-robust architecture designed to learn intrinsic spherical features that are less sensitive to the underlying coordinate frame. Our approach rests on three geometric pillars: (1) an intrinsic feature formulation that decouples the representation from the gravity vector by removing absolute latitude encoding; (2) quadrature-consistent spherical attention that accounts for non-uniform sampling densities; and (3) a gauge-aware relative positional mechanism that encodes local angular geometry using tangent-plane projected angles and discrete gauge pooling, avoiding reliance on global axes. We further use index-based spherical resampling together with a logit-level SO(3)-consistency regularizer during training. To rigorously benchmark robustness, we introduce Pose35, a dataset variant of Stanford2D3D perturbed by random rotations within \pm 35^\circ . Under the extreme test of arbitrary full SO(3) rotations, existing SOTAs fail catastrophically: the baseline SphereUFormer drops from 67.53 mIoU to 25.26 mIoU. In contrast, SO3UFormer demonstrates remarkable stability, achieving 72.03 mIoU on Pose35 and retaining 70.67 mIoU under full SO(3) rotations.
[CV-68] GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion CVPR2026
【速读】:该论文旨在解决基于模仿学习(imitation learning)的机器人操作策略在抓取任务中普遍存在抓取精度低、空间泛化能力弱以及对象泛化性能差的问题。其核心解决方案在于将抓取先验知识(grasp prior knowledge)融入扩散策略框架(diffusion policy framework),关键创新点包括:1)采用潜在扩散策略(latent diffusion policy)通过抓取位姿先验引导动作块解码,确保生成的动作轨迹紧密贴合可行的抓取配置;2)在扩散过程中引入自监督重建目标,通过在每个逆向扩散步骤中重构从中间表示反投影得到的抓取度(graspness)图像,从而嵌入抓取特性先验,显著提升抓取的精确性和泛化能力。
链接: https://arxiv.org/abs/2602.22862
作者: Enda Xiang,Haoxiang Ma,Xinzhu Ma,Zicheng Liu,Di Huang
机构: Beihang University (北京航空航天大学); School of Computer Science and Engineering, Beihang University (北京航空航天大学计算机学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:This paper focuses on enhancing the grasping precision and generalization of manipulation policies learned via imitation learning. Diffusion-based policy learning methods have recently become the mainstream approach for robotic manipulation tasks. As grasping is a critical subtask in manipulation, the ability of imitation-learned policies to execute precise and generalizable grasps merits particular attention. Existing imitation learning techniques for grasping often suffer from imprecise grasp executions, limited spatial generalization, and poor object generalization. To address these challenges, we incorporate grasp prior knowledge into the diffusion policy framework. In particular, we employ a latent diffusion policy to guide action chunk decoding with grasp pose prior, ensuring that generated motion trajectories adhere closely to feasible grasp configurations. Furthermore, we introduce a self-supervised reconstruction objective during diffusion to embed the graspness prior: at each reverse diffusion step, we reconstruct wrist-camera images back-projected the graspness from the intermediate representations. Both simulation and real robot experiments demonstrate that our approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.
[CV-69] From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在训练过程中依赖静态数据和固定策略所带来的局限性,特别是难以诊断能力盲区以及无法实现动态、针对性强化的问题。其核心解决方案是提出一种诊断驱动的渐进式进化框架(Diagnostic-driven Progressive Evolution, DPE),该框架通过一个螺旋迭代循环实现:首先由多个智能体对海量未标注多模态数据进行标注与质量控制,并利用网络搜索、图像编辑等工具生成多样化且真实的样本;其次,DPE能将失败归因于特定薄弱环节,动态调整数据混合比例,并引导智能体生成聚焦于这些弱点的数据以实施精准强化。实验表明,该方法在Qwen3-VL-8B-Instruct和Qwen2.5-VL-7B-Instruct模型上持续稳定提升,在11个基准测试中均取得进步,验证了DPE作为开放任务分布下可扩展的持续训练范式潜力。
链接: https://arxiv.org/abs/2602.22859
作者: Hongrui Jia,Chaoya Jiang,Shikun Zhang,Wei Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at this https URL.
[CV-70] A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling
【速读】:该论文旨在解决医学影像基础模型预训练中因盲目扩大数据规模所引发的两大问题:一是大规模医疗数据集普遍存在的冗余性和严重类别不平衡,导致表征学习偏向于高频模式;二是忽略数据质量异质性所带来的计算效率低下。解决方案的关键在于引入一种主动且有原则的数据筛选机制——CheXficient,该模型仅使用1,235,004张胸片(CXR)中的22.7%样本进行预训练,同时消耗不到27.3%的计算资源,却在20个跨任务基准测试中表现相当或优于全量数据训练的模型,尤其通过优先选择低频样本提升了对长尾或罕见病的泛化能力。
链接: https://arxiv.org/abs/2602.22843
作者: Chong Wang,Yabin Zhang,Yunhe Gao,Maya Varma,Clemence Mottez,Faidra Patsatzi,Jiaming Liu,Jin Long,Jean-Benoit Delbrouck,Sergios Gatidis,Akshay S. Chaudhari,Curtis P. Langlotz
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a “scale-at-all-costs” paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.
[CV-71] Reflectance Multispectral Imaging for Soil Composition Estimation and USDA Texture Classification
【速读】:该论文旨在解决土壤质地(soil texture)传统实验室测定方法耗时、劳动强度大,而现有传感技术要么成本过高,要么空间分辨率不足,难以在田间尺度上实现常规部署的问题。其解决方案的关键在于提出了一种基于多光谱成像(multispectral imaging, MSI)与机器学习相结合的低成本、可现场部署的土壤质地预测框架:利用自研的覆盖365–940 nm波段的13个光谱通道MSI设备获取土壤光谱特性,通过回归模型估算黏粒、粉粒和砂粒含量,并采用直接分类器预测美国农业部(USDA)定义的12类土壤质地;同时通过USDA土壤质地三角图间接映射成分结果至质地类别。实验表明,该方法在成分预测中R²达0.99,在质地分类中准确率超过99%,验证了该方案在地质工程筛查和精准农业中的可行性与高精度优势。
链接: https://arxiv.org/abs/2602.22829
作者: G.A.S.L Ranasinghe,J.A.S.T. Jayakody,M.C.L. De Silva,G. Thilakarathne,G.M.R.I. Godaliyadda,H.M.V.R. Herath,M.P.B. Ekanayake,S.K. Navaratnarajah
机构: University of Peradeniya (佩拉德尼亚大学); University of Ruhuna (鲁胡纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Under Review at IEEE Access. 17 pages, 15 figures
Abstract:Soil texture is a foundational attribute that governs water availability and erosion in agriculture, as well as load bearing capacity, deformation response, and shrink-swell risk in geotechnical engineering. Yet texture is still typically determined by slow and labour intensive laboratory particle size tests, while many sensing alternatives are either costly or too coarse to support routine field scale deployment. This paper proposes a robust and field deployable multispectral imaging (MSI) system and machine learning framework for predicting soil composition and the United States Department of Agriculture (USDA) texture classes. The proposed system uses a cost effective in-house MSI device operating from 365 nm to 940 nm to capture thirteen spectral bands, which effectively capture the spectral properties of soil texture. Regression models use the captured spectral properties to estimate clay, silt, and sand percentages, while a direct classifier predicts one of the twelve USDA textural classes. Indirect classification is obtained by mapping the regressed compositions to texture classes via the USDA soil texture triangle. The framework is evaluated on mixture data by mixing clay, silt, and sand in varying proportions, using the USDA classification triangle as a basis. Experimental results show that the proposed approach achieves a coefficient of determination R^2 up to 0.99 for composition prediction and over 99% accuracy for texture classification. These findings indicate that MSI combined with data-driven modeling can provide accurate, non-destructive, and field deployable soil texture characterization suitable for geotechnical screening and precision agriculture.
[CV-72] CMSA-Net: Causal Multi-scale Aggregation with Adaptive Multi-source Reference for Video Polyp Segmentation
【速读】:该论文旨在解决视频息肉分割(Video Polyp Segmentation, VPS)中的两大挑战:一是息肉与周围黏膜在视觉上相似,导致语义区分度弱;二是息肉在视频帧间位置和尺度变化大,影响分割的稳定性与准确性。解决方案的关键在于提出一种名为CMSA-Net的鲁棒框架,其核心创新包括:1)引入因果多尺度聚合(Causal Multi-scale Aggregation, CMA)模块,通过因果注意力机制从多个历史帧中跨尺度聚合语义信息,确保时间特征传播遵循严格时序,提升特征可靠性;2)设计动态多源参考(Dynamic Multi-source Reference, DMR)策略,基于语义可分性和预测置信度自适应选择可靠参考帧,实现高效且强指导性的多帧信息融合,从而在保证实时性的同时显著提升分割精度。
链接: https://arxiv.org/abs/2602.22821
作者: Tong Wang,Yaolei Qi,Siwen Wang,Imran Razzak,Guanyu Yang,Yutong Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video polyp segmentation (VPS) is an important task in computer-aided colonoscopy, as it helps doctors accurately locate and track polyps during examinations. However, VPS remains challenging because polyps often look similar to surrounding mucosa, leading to weak semantic discrimination. In addition, large changes in polyp position and scale across video frames make stable and accurate segmentation difficult. To address these challenges, we propose a robust VPS framework named CMSA-Net. The proposed network introduces a Causal Multi-scale Aggregation (CMA) module to effectively gather semantic information from multiple historical frames at different scales. By using causal attention, CMA ensures that temporal feature propagation follows strict time order, which helps reduce noise and improve feature reliability. Furthermore, we design a Dynamic Multi-source Reference (DMR) strategy that adaptively selects informative and reliable reference frames based on semantic separability and prediction confidence. This strategy provides strong multi-frame guidance while keeping the model efficient for real-time inference. Extensive experiments on the SUN-SEG dataset demonstrate that CMSA-Net achieves state-of-the-art performance, offering a favorable balance between segmentation accuracy and real-time clinical applicability.
[CV-73] Face Time Traveller : Travel Through Ages Without Losing Identity CVPR2026
【速读】:该论文旨在解决人脸老化(Face Aging)任务中长期存在的挑战,即在宽范围年龄变换下难以同时保持身份一致性与视觉真实感的问题。现有方法通常依赖数值化的年龄表示,忽视了生物因素(如遗传)与环境因素(如生活方式)的交互作用;此外,扩散模型中存在的静态注意力机制和优化密集的反演过程限制了其适应性、细粒度控制能力及背景一致性。为应对这些局限,作者提出了一种基于扩散模型的框架 Face Time Traveller (FaceTT),其核心创新在于:1)引入面向面部属性的提示优化策略(Face-Attribute-Aware Prompt Refinement),显式编码内在(生物)与外在(环境)老化线索以实现上下文感知条件生成;2)设计无需微调的角度反演方法(Tuning-Free Angular Inversion),高效地将真实人脸映射至扩散潜在空间以提升重建速度与精度;3)提出自适应注意力控制机制(Adaptive Attention Control),动态平衡跨注意力(用于语义老化特征提取)与自注意力(用于结构与身份保留),从而实现高保真、身份一致的年龄转换效果。
链接: https://arxiv.org/abs/2602.22819
作者: Purbayan Kar,Ayush Ghadiya,Vishal Chudasama,Pankaj Wasnik,C.V. Jawahar
机构: Sony Research India(索尼研究印度); IIIT Hyderabad(印度国际信息技术学院海得拉巴)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 (Findings Track)
Abstract:Face aging, an ill-posed problem shaped by environmental and genetic factors, is vital in entertainment, forensics, and digital archiving, where realistic age transformations must preserve both identity and visual realism. However, existing works relying on numerical age representations overlook the interplay of biological and contextual cues. Despite progress in recent face aging models, they struggle with identity preservation in wide age transformations, also static attention and optimization-heavy inversion in diffusion limit adaptability, fine-grained control and background consistency. To address these challenges, we propose Face Time Traveller (FaceTT), a diffusion-based framework that achieves high-fidelity, identity-consistent age transformation. Here, we introduce a Face-Attribute-Aware Prompt Refinement strategy that encodes intrinsic (biological) and extrinsic (environmental) aging cues for context-aware conditioning. A tuning-free Angular Inversion method is proposed that efficiently maps real faces into the diffusion latent space for fast and accurate reconstruction. Moreover, an Adaptive Attention Control mechanism is introduced that dynamically balances cross-attention for semantic aging cues and self-attention for structural and identity preservation. Extensive experiments on benchmark datasets and in-the-wild testset demonstrate that FaceTT achieves superior identity retention, background preservation and aging realism over state-of-the-art (SOTA) methods.
[CV-74] PhotoAgent : Agent ic Photo Editing with Exploratory Visual Aesthetic Planning
【速读】:该论文旨在解决当前基于指令的图像编辑方法中,用户需手动设计和排序编辑步骤的问题,从而导致编辑质量高度依赖于用户的任务分解能力。为实现自主图像编辑,论文提出 PhotoAgent 系统,其关键在于将自主图像编辑建模为一个长周期决策问题:通过显式美学规划(explicit aesthetic planning),利用树搜索(tree search)推理用户美学意图并规划多步编辑动作,并结合记忆机制与视觉反馈进行闭环执行与迭代优化,无需用户逐步骤提示即可完成高质量图像编辑。
链接: https://arxiv.org/abs/2602.22809
作者: Mingde Yao,Zhiyuan You,Tam-King Man,Menglu Wang,Tianfan Xue
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学); Institute of Science Tokyo (东京科学研究所); CPII under InnoHK (InnoHK创新香港研发平台的CPII); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A fully automated, intelligent photo-editing agent that autonomously plans multi-step aesthetic enhancements, smartly chooses diverse editing tools, and enables everyday users to achieve professional-looking results without crafting complex prompts. Project page: this https URL
Abstract:With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is this https URL.
[CV-75] GSTurb: Gaussian Splatting for Atmospheric Turbulence Mitigation
【速读】:该论文旨在解决大气湍流导致的图像退化问题,特别是长距离成像中因像素位移(tilt)和非各向同性模糊(non-isoplanatic blur)引起的显著质量下降。解决方案的关键在于提出了一种名为GSTurb的新框架,其核心创新是将光流引导的倾斜校正与高斯点绘(Gaussian splatting)相结合:前者用于建模和校正由湍流引起的像素位移,后者用于高效表示和优化非各向同性模糊;通过在多帧图像上联合优化高斯参数来同时估计并补偿tilt和blur,从而实现更高质量的图像恢复。
链接: https://arxiv.org/abs/2602.22800
作者: Hanliang Du,Zhangji Lu,Zewei Cai,Qijian Tang,Qifeng Yu,Xiaoli Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Atmospheric turbulence causes significant image degradation due to pixel displacement (tilt) and blur, particularly in long-range imaging applications. In this paper, we propose a novel framework for atmospheric turbulence mitigation, GSTurb, which integrates optical flow-guided tilt correction and Gaussian splatting for modeling non-isoplanatic blur. The framework employs Gaussian parameters to represent tilt and blur, and optimizes them across multiple frames to enhance restoration. Experimental results on the ATSyn-static dataset demonstrate the effectiveness of our method, achieving a peak PSNR of 27.67 dB and SSIM of 0.8735. Compared to the state-of-the-art method, GSTurb improves PSNR by 1.3 dB (a 4.5% increase) and SSIM by 0.048 (a 5.8% increase). Additionally, on real datasets, including the TSRWGAN Real-World and CLEAR datasets, GSTurb outperforms existing methods, showing significant improvements in both qualitative and quantitative performance. These results highlight that combining optical flow-guided tilt correction with Gaussian splatting effectively enhances image restoration under both synthetic and real-world turbulence conditions. The code for this method will be available at this https URL.
[CV-76] Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning
【速读】:该论文旨在解决真实场景中由于遮挡导致人体骨骼数据缺失对轨迹预测精度造成显著下降的问题。解决方案的关键在于引入一种基于掩码自编码(masked autoencoding)预训练的自监督骨骼表征模型,通过学习更鲁棒的骨骼特征表示,提升模型在存在缺失关节数据时的轨迹预测性能,同时在无缺失或轻度缺失情况下保持原有预测精度。
链接: https://arxiv.org/abs/2602.22791
作者: Taishu Arashima,Hiroshi Kera,Kazuhiko Kawamoto
机构: Chiba University (千叶大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages main, 5 pages supplementary material
Abstract:Human trajectory prediction plays a crucial role in applications such as autonomous navigation and video surveillance. While recent works have explored the integration of human skeleton sequences to complement trajectory information, skeleton data in real-world environments often suffer from missing joints caused by occlusions. These disturbances significantly degrade prediction accuracy, indicating the need for more robust skeleton representations. We propose a robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding. Experimental results in occlusion-prone scenarios show that our method improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.
[CV-77] SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation ICLR2026
【速读】:该论文旨在解决开放世界场景中3D结构生成的问题,即现有方法虽能生成部件级别的3D物体,却难以将这些部件组织为具有明确实例区分的完整场景。其核心问题在于模型内部分配机制缺乏结构约束,导致部件间出现混杂和碎片化。解决方案的关键在于提出SceneTransporter框架,通过引入去偏置聚类探针揭示问题本质,并将任务重新建模为全局相关性分配问题;具体地,在组合式DiT(Diffusion Transformer)模型的去噪循环中,构建并求解一个熵正则化的最优传输(Optimal Transport, OT)目标,从而施加两种关键结构约束:一是运输方案通过门控交叉注意力实现图像块到部件级3D潜变量的一对一专属路由,避免信息纠缠;二是运输过程的竞争特性促使相似图像块聚类形成连贯对象,同时由基于边缘的成本项进一步正则化,防止对象碎片化。此方法显著提升了实例级一致性与几何保真度。
链接: https://arxiv.org/abs/2602.22785
作者: Ling Wang,Hao-Xiang Guo,Xinzhou Wang,Fuchun Sun,Kai Sun,Pengkun Liu,Hang Xiao,Zhong Wang,Guangyuan Fu,Eric Li,Yang Liu,Yikai Wang
机构: Xi’an Research Institute of Hi-Tech (西安高技术研究院); Tsinghua University (清华大学); SkyWork AI (天工智能); Tongji University (同济大学); Fudan University (复旦大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: published at iclr 2026
Abstract:We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model’s internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at this https URL.
[CV-78] rajTok: Learning Trajectory Tokens enables better Video Understanding CVPR2026
【速读】:该论文旨在解决视频模型中因patchification导致的token冗余问题,该问题严重限制了视频处理的效率与可扩展性。现有基于轨迹的分词方法虽能解耦视频时长与token数量,但依赖复杂的外部分割和跟踪流水线,存在速度慢且任务无关的问题。其解决方案的关键在于提出TrajTok——一个端到端训练的视频分词模块,能够与下游视频模型联合优化,并根据语义复杂度动态调整token粒度,无需依赖视频时长。TrajTok通过统一的隐式时空聚类机制,在单次前向传播中直接生成物体轨迹,兼顾轻量高效与性能提升,显著改善视频理解效果。
链接: https://arxiv.org/abs/2602.22779
作者: Chenhao Zheng,Jieyu Zhang,Jianing Zhang,Weikai Huang,Ashutosh Kumar,Quan Kong,Oncel Tuzel,Chun-Liang Li,Ranjay Krishna
机构: University of Washington (华盛顿大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究所); Apple (苹果公司); Woven by Toyota, Inc (丰田织物公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.
[CV-79] Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval
【速读】:该论文旨在解决图像真实性研究中长期被忽视的问题——即在深度伪造(deepfake)检测与定位之外,如何从后处理(post-hoc)和生成过程中(in-generation)的水印范式中恢复篡改内容以支持事实检索(factual retrieval)。其解决方案的关键在于提出了一种统一的隐码恢复框架(unified hidden-code recovery framework),通过多尺度向量量化(multi-scale vector quantization)将语义与感知信息编码为紧凑的隐码表示,并利用条件Transformer模块增强上下文推理能力,从而实现高效的内容检索与重建。该方法在构建的ImageNet-S基准上验证了其在多种水印流程下的兼容性与优越性能,为超越检测与定位的通用图像恢复奠定了基础。
链接: https://arxiv.org/abs/2602.22759
作者: Yuan-Chih Chen,Chun-Shien Lu
机构: Model call failure
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in image authenticity have primarily focused on deepfake detection and localization, leaving recovery of tampered contents for factual retrieval relatively underexplored. We propose a unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms. Our method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules. To enable systematic evaluation for natural images, we construct ImageNet-S, a benchmark that provides paired image-label factual retrieval tasks. Extensive experiments on ImageNet-S demonstrate that our method exhibits promising retrieval and reconstruction performance while remaining fully compatible with diverse watermarking pipelines. This framework establishes a foundation for general-purpose image recovery beyond detection and localization.
[CV-80] SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation
【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)生成模型在生成过程中往往忽视文本提示中指定的动态空间关系(Dynamic Spatial Relationships, DSR)的问题,导致生成视频在空间一致性上表现不佳。解决方案的关键在于提出一种自提升框架 SPATIALALIGN,其核心是采用零阶正则化直接偏好优化(zeroth-order regularized Direct Preference Optimization, DPO)对 T2V 模型进行微调,并设计了一个基于几何的量化评估指标 DSR-SCORE,用于精确衡量生成视频与文本提示中 DSR 的对齐程度,从而替代以往依赖视觉语言模型(Vision-Language Model, VLM)的主观评价方式,显著提升了模型在动态空间关系建模上的能力。
链接: https://arxiv.org/abs/2602.22745
作者: Fengming Liu,Tat-Jen Cham,Chuanxia Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.
[CV-81] ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control
【速读】:该论文旨在解决生成式人体运动时实现精确空间控制的问题,现有方法通常依赖特定任务的训练或缓慢优化,且硬性约束常破坏动作自然性。解决方案的关键在于提出ProjFlow——一种无需训练的采样器,通过引入一种基于骨骼拓扑结构的运动学感知度量(kinematics-aware metric),使约束修正能协同分布于整个骨骼系统中,从而避免传统投影方法导致的不自然伪影;此外,针对稀疏输入(如关键帧间长间隔补全),还设计了随时间变化的伪观测机制,使约束在采样过程中渐进弱化,有效提升运动连贯性与真实性。
链接: https://arxiv.org/abs/2602.22742
作者: Akihisa Watanabe,Qing Yu,Edgar Simo-Serra,Kent Fujiwara
机构: Waseda University (早稻田大学); LY Corporation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.
[CV-82] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation ICLR2026
【速读】:该论文旨在解决指代表达图像分割(Referring Image Segmentation, RIS)任务中因视觉-语言对齐不充分导致的分割精度不足问题。其核心解决方案是提出了一种对齐感知的掩码学习策略(Alignment-Aware Masked Learning, AML),通过显式估计像素级视觉-语言对齐程度,在优化过程中过滤掉低置信度的对齐区域,从而聚焦于可靠的跨模态线索,显著提升了模型在RefCOCO数据集上的性能,并增强了对多样化描述和复杂场景的鲁棒性。
链接: https://arxiv.org/abs/2602.22740
作者: Tongfei Chen,Shuo Yang,Yuguang Yang,Linlin Yang,Runtang Guo,Changbai Li,He Long,Chunyu Xie,Dawei Leng,Baochang Zhang
机构: Beihang University (北京航空航天大学); Qihoo 360 (奇虎360); Communication University of China (中国传媒大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026 conference paper
Abstract:Referring Image Segmentation (RIS) aims to segment an object in an image identified by a natural language expression. The paper introduces Alignment-Aware Masked Learning (AML), a training strategy to enhance RIS by explicitly estimating pixel-level vision-language alignment, filtering out poorly aligned regions during optimization, and focusing on trustworthy cues. This approach results in state-of-the-art performance on RefCOCO datasets and also enhances robustness to diverse descriptions and scenarios
[CV-83] Asymmetric Idiosyncrasies in Multimodal Models
【速读】:该论文旨在解决生成式 AI 中文本到图像(text-to-image)模型对原始文本描述(caption)风格特征的保留能力不足的问题,即不同 caption 模型所携带的独特语义与风格差异在生成图像中未能有效传递。其解决方案的关键在于构建一个基于分类的系统性分析框架:通过训练神经网络来预测给定图像或文本源自哪个 caption 模型,从而量化 caption 模型的风格指纹(stylistic idiosyncrasies)以及 text-to-image 系统的 prompt-following 能力。实验表明,文本分类准确率达 99.70%,而图像分类准确率仅约 50%,说明图像难以保留 caption 中的关键细节差异(如细节层次、颜色纹理强调及物体分布),揭示了当前 text-to-image 模型在忠实还原文本语义特征方面的局限性。
链接: https://arxiv.org/abs/2602.22734
作者: Muzi Tao,Chufan Shi,Huijuan Wang,Shengbang Tong,Xuezhe Ma
机构: University of Southern California (南加州大学); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:In this work, we study idiosyncrasies in the caption models and their downstream impact on text-to-image models. We design a systematic analysis: given either a generated caption or the corresponding image, we train neural networks to predict the originating caption model. Our results show that text classification yields very high accuracy (99.70%), indicating that captioning models embed distinctive stylistic signatures. In contrast, these signatures largely disappear in the generated images, with classification accuracy dropping to at most 50% even for the state-of-the-art Flux model. To better understand this cross-modal discrepancy, we further analyze the data and find that the generated images fail to preserve key variations present in captions, such as differences in the level of detail, emphasis on color and texture, and the distribution of objects within a scene. Overall, our classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.
[CV-84] Sapling-NeRF: Geo-Localised Sapling Reconstruction in Forests for Ecological Monitoring
【速读】:该论文旨在解决现有三维传感技术在森林幼苗(saplings)精细结构特征量化评估中的局限性问题,特别是传统地面激光扫描(TLS)、移动激光扫描(MLS)或摄影测量方法难以重建细枝、密集叶幕且缺乏长期监测所需的尺度一致性。其解决方案的关键在于提出一种融合神经辐射场(NeRF)、LiDAR同步定位与建图(SLAM)及全球导航卫星系统(GNSS)的多层级三维重建流程:首先利用GNSS实现粗粒度地球坐标系定位,继而通过LiDAR SLAM获得厘米级精度局部定位与重建,最终基于NeRF实现个体幼苗为中心的高密度、对象感知重建。该方案实现了可重复、地理精确化的生态监测能力,显著提升了幼苗茎高、分枝模式和叶木比等关键指标的测量精度,为森林动态分析提供了更丰富的结构化定量数据。
链接: https://arxiv.org/abs/2602.22731
作者: Miguel Ángel Muñoz-Bañón,Nived Chebrolu,Sruthi M. Krishna Moorthy,Yifu Tao,Fernando Torres,Roberto Salguero-Gómez,Maurice Fallon
机构: Oxford Robotics Institute, Department of Engineering Science, University of Oxford, Oxford, UK; Group of Automation, Robotics and Computer Vision, University of Alicante, Alicante, Spain; Department of Biology, University of Oxford, Oxford, UK
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Saplings are key indicators of forest regeneration and overall forest health. However, their fine-scale architectural traits are difficult to capture with existing 3D sensing methods, which make quantitative evaluation difficult. Terrestrial Laser Scanners (TLS), Mobile Laser Scanners (MLS), or traditional photogrammetry approaches poorly reconstruct thin branches, dense foliage, and lack the scale consistency needed for long-term monitoring. Implicit 3D reconstruction methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) are promising alternatives, but cannot recover the true scale of a scene and lack any means to be accurately geo-localised. In this paper, we present a pipeline which fuses NeRF, LiDAR SLAM, and GNSS to enable repeatable, geo-localised ecological monitoring of saplings. Our system proposes a three-level representation: (i) coarse Earth-frame localisation using GNSS, (ii) LiDAR-based SLAM for centimetre-accurate localisation and reconstruction, and (iii) NeRF-derived object-centric dense reconstruction of individual saplings. This approach enables repeatable quantitative evaluation and long-term monitoring of sapling traits. Our experiments in forest plots in Wytham Woods (Oxford, UK) and Evo (Finland) show that stem height, branching patterns, and leaf-to-wood ratios can be captured with increased accuracy as compared to TLS. We demonstrate that accurate stem skeletons and leaf distributions can be measured for saplings with heights between 0.5m and 2m in situ, giving ecologists access to richer structural and quantitative data for analysing forest dynamics.
[CV-85] HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models CVPR2026
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的图像幻觉(object hallucination)问题,即模型在生成文本时错误地引入不存在于输入图像中的对象或属性,从而影响其可靠部署。现有方法通常难以在效率与准确性之间取得平衡,要么依赖昂贵的参考模型和多次前向传播,要么采用静态编辑策略,可能误删真实的视觉证据。本文提出的解决方案是HulluEdit,其核心创新在于正交子空间编辑(orthogonal subspace editing):通过将模型隐藏状态分解为三个正交子空间——视觉证据、冲突先验(conflicting priors)和残差不确定性,实现对幻觉模式的选择性抑制,同时确保视觉成分不受干扰。该方法在数学上保证了仅对先验子空间进行编辑不会影响视觉信息的完整性,从而在多个基准测试(如POPE和CHAIR)中实现了最优的幻觉减少效果,并保持了模型的通用能力(如MME基准)和高效推理性能。
链接: https://arxiv.org/abs/2602.22727
作者: Yangguang Lin,Quan Fang,Yufei Li,Jiachen Sun,Junyu Gao,Jitao Sang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at CVPR 2026
Abstract:Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces - visual evidence, conflicting priors, and residual uncertainty - enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.
[CV-86] IRSDE-Despeckle: A Physics-Grounded Diffusion Model for Generalizable Ultrasound Despeckling
【速读】:该论文旨在解决超声成像中斑点噪声(speckle)及其相关伪影对图像质量的负面影响,这些问题会降低诊断准确性并阻碍临床解读。解决方案的关键在于提出一种基于扩散模型(diffusion-based)的去斑点方法,构建于图像恢复随机微分方程(Image Restoration Stochastic Differential Equations)框架之上;该方法通过在MATLAB UltraSound Toolbox中模拟从无斑点磁共振图像生成超声图像的方式,构建大规模配对训练数据集以实现监督学习,并在保持解剖学有意义边缘和对比度的同时重建去斑点图像。实验表明,该方法在模拟测试集上优于传统滤波器及近期基于学习的去斑点基线模型,同时利用跨模型方差量化预测不确定性,发现高不确定性区域与高重建误差相关,从而为识别困难或易失败区域提供实用指标。
链接: https://arxiv.org/abs/2602.22717
作者: Shuoqi Chen,Yujia Wu,Geoffrey P. Luke
机构: Thayer School of Engineering, Dartmouth College (达特茅斯学院工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages main text + 6 pages appendix, 7 figures main + 3 figures appendix, 3 tables main + 1 table appendix. Preprint
Abstract:Ultrasound imaging is widely used for real-time, noninvasive diagnosis, but speckle and related artifacts reduce image quality and can hinder interpretation. We present a diffusion-based ultrasound despeckling method built on the Image Restoration Stochastic Differential Equations framework. To enable supervised training, we curate large paired datasets by simulating ultrasound images from speckle-free magnetic resonance images using the Matlab UltraSound Toolbox. The proposed model reconstructs speckle-suppressed images while preserving anatomically meaningful edges and contrast. On a held-out simulated test set, our approach consistently outperforms classical filters and recent learning-based despeckling baselines. We quantify prediction uncertainty via cross-model variance and show that higher uncertainty correlates with higher reconstruction error, providing a practical indicator of difficult or failure-prone regions. Finally, we evaluate sensitivity to simulation probe settings and observe domain shift, motivating diversified training and adaptation for robust clinical deployment.
[CV-87] SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs CVPR2026
【速读】:该论文旨在解决3D大视觉语言模型(3D Large Vision-Language Models, 3D LVLMs)在处理点云数据时因继承自大语言模型(Large Language Models, LLMs)的位置编码机制——旋转位置嵌入(Rotary Position Embedding, RoPE)——所导致的空间结构失真与方向感知不足的问题。RoPE在编码3D token时无法保持点云数据的三维空间结构,且其相对距离计算忽略了角度依赖性,从而削弱了模型对视觉表征中方向变化的捕捉能力。解决方案的关键在于提出球坐标位置嵌入(Spherical Coordinate-based Positional Embedding, SoPE),将点云token索引映射至三维球坐标空间,统一建模空间位置与方向角信息,从而保留点云数据的几何结构、增强空间感知能力,并生成更一致且丰富的几何表示;此外,引入多尺度频率混合策略以融合不同频域特征,进一步提升模型性能。
链接: https://arxiv.org/abs/2602.22716
作者: Guanting Ye,Qiyan Zhao,Wenhao Yu,Liangyu Yuan,Mingkai Li,Xiaofeng Zhang,Jianmin Ji,Yanyong Zhang,Qing Jiang,Ka-Veng Yuen
机构: University of Macau(澳门大学); University of Science and Technology of China(中国科学技术大学); Shanghai Jiaotong University(上海交通大学); Hefei University of Technology(合肥工业大学); National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026
Abstract:3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model’s ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.
[CV-88] UFO-DETR: Frequency-Guided End-to-End Detector for UAV Tiny Objects
【速读】:该论文旨在解决无人机(UAV)影像中小型目标检测面临的挑战,包括尺度变化大、目标密集分布以及小目标占主导地位等问题。现有算法依赖人工设计组件,通用目标检测器未针对UAV图像优化,难以在精度与复杂度之间取得平衡。解决方案的关键在于提出一种端到端的目标检测框架UFO-DETR,其核心创新包括:基于LSKNet的主干网络以优化感受野并减少参数量;引入DAttention和AIFI模块以灵活建模多尺度空间关系,提升多尺度目标检测性能;以及设计DynFreq-C3模块,通过跨空间频率特征增强机制显著强化小目标检测能力。实验表明,该方法在检测性能和计算效率上均优于RT-DETR-L,为无人机边缘计算提供了高效解决方案。
链接: https://arxiv.org/abs/2602.22712
作者: Yuankai Chen,Kai Lin,Qihong Wu,Xinxuan Yang,Jiashuo Lai,Ruoen Chen,Haonan Shi,Minfan He,Meihua Wang
机构: South China Agricultural University (华南农业大学); Southeast University (东南大学); Foshan University (佛山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures, published to 2026 International Conference on Computer Supported Cooperative Work in Design
Abstract:Small target detection in UAV imagery faces significant challenges such as scale variations, dense distribution, and the dominance of small targets. Existing algorithms rely on manually designed components, and general-purpose detectors are not optimized for UAV images, making it difficult to balance accuracy and complexity. To address these challenges, this paper proposes an end-to-end object detection framework, UFO-DETR, which integrates an LSKNet-based backbone network to optimize the receptive field and reduce the number of parameters. By combining the DAttention and AIFI modules, the model flexibly models multi-scale spatial relationships, improving multi-scale target detection performance. Additionally, the DynFreq-C3 module is proposed to enhance small target detection capability through cross-space frequency feature enhancement. Experimental results show that, compared to RT-DETR-L, the proposed method offers significant advantages in both detection performance and computational efficiency, providing an efficient solution for UAV edge computing.
[CV-89] GFRRN: Explore the Gaps in Single Image Reflection Removal CVPR26
【速读】:该论文旨在解决单图像去反射(Single Image Reflection Removal, SIRR)任务中两大关键问题:一是预训练模型特征与去反射模型特征之间的语义理解鸿沟,二是合成数据与真实世界数据在反射标签上的不一致性。解决方案的核心在于提出一种无间隙去反射网络(Gap-Free Reflection Removal Network, GFRRN),其关键技术包括:(1) 采用参数高效微调(Parameter Efficient Fine-Tuning, PEFT)策略,通过引入可学习的Mona层对齐训练方向;(2) 设计标签生成器统一合成与真实数据的反射标签;(3) 提出基于高斯的自适应频率学习模块(Gaussian-based Adaptive Frequency Learning Block, G-AFLB)以自适应融合频域先验信息;(4) 引入动态代理注意力机制(Dynamic Agent Attention, DAA)替代传统的窗口注意力,动态建模窗口间(inter-)与窗口内(intra-)的重要性分布。这些组件协同作用,显著提升了SIRR方法在复杂场景下的性能表现。
链接: https://arxiv.org/abs/2602.22695
作者: Yu Chen,Zewei He,Xingyu Liu,Zixuan Chen,Zheming Lu
机构: Zhejiang University (浙江大学); Huanjiang Laboratory (环江实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR26
Abstract:Prior dual-stream methods with the feature interaction mechanism have achieved remarkable performance in single image reflection removal (SIRR). However, they often struggle with (1) semantic understanding gap between the features of pre-trained models and those of reflection removal models, and (2) reflection label inconsistencies between synthetic and real-world training data. In this work, we first adopt the parameter efficient fine-tuning (PEFT) strategy by integrating several learnable Mona layers into the pre-trained model to align the training directions. Then, a label generator is designed to unify the reflection labels for both synthetic and real-world data. In addition, a Gaussian-based Adaptive Frequency Learning Block (G-AFLB) is proposed to adaptively learn and fuse the frequency priors, and a Dynamic Agent Attention (DAA) is employed as an alternative to window-based attention by dynamically modeling the significance levels across windows (inter-) and within an individual window (intra-). These components constitute our proposed Gap-Free Reflection Removal Network (GFRRN). Extensive experiments demonstrate the effectiveness of our GFRRN, achieving superior performance against state-of-the-art SIRR methods.
[CV-90] No Caption No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings ICLR2026
【速读】:该论文旨在解决潜在扩散模型(Latent Diffusion Models)在文本到图像生成中因训练数据记忆(memorization)引发的隐私与知识产权问题,尤其是当缺乏真实标注文本(ground-truth captions)时,如何有效实施成员推断攻击(Membership Inference Attacks, MIAs)。传统MIA方法依赖于精确的文本条件输入,但在实际场景中常仅能获取图像本身,而其对应的文本描述不可得,导致现有方法失效。解决方案的关键在于提出MoFit框架——通过两阶段优化构建合成条件输入:首先利用对目标模型生成流形(generative manifold)的拟合,在无条件先验空间中生成一个“模型适配”的代理样本(surrogate),其次基于该代理提取出一个“模型适配嵌入”(model-fitted embedding),将其作为不匹配条件用于查询图像,从而放大成员样本的条件损失响应,增强分类可分性,实现无需真实文本即可高效执行MIA。
链接: https://arxiv.org/abs/2602.22689
作者: Joonsung Jeon,Woo Jae Kim,Suhyeon Ha,Sooel Son,Sung-Eui Yoon
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted to ICLR 2026
Abstract:Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model’s generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model’s unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.
[CV-91] SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在智能眼镜场景下进行视觉问答(Visual Question Answering, VQA)时存在的性能不足问题,其核心挑战在于:现有模型多基于传统多模态数据集训练与评估,缺乏真实智能眼镜使用场景中的多样性与复杂性,尤其是对象识别准确性对后续外部知识检索的决定性影响未被充分考虑。为填补这一空白,作者提出了SUPERGLASSES——首个完全由智能眼镜设备采集的真实世界VQA基准数据集,包含2,422个第一人称视角图像-问题对及完整搜索轨迹和推理标注。在此基础上,进一步提出SUPERLENS,一种面向智能眼镜的多模态代理系统,其关键创新在于融合自动目标检测、查询解耦与多模态网络搜索,实现增强型答案生成,最终在SUPERGLASSES上达到优于GPT-4o 2.19%的性能,验证了任务特定设计对提升智能眼镜VQA效果的重要性。
链接: https://arxiv.org/abs/2602.22683
作者: Zhuohang Jiang,Xu Yuan,Haohao Qu,Shanru Lin,Kanglong Liu,Wenqi Fan,Qing Li
机构: The Hong Kong Polytechnic University (香港理工大学); City University of Hong Kong (城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.
[CV-92] ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models)在低资源语言(如越南语)中图像-文本检索性能不佳的问题。现有模型多针对高资源语言优化,难以有效处理越南语等语言的跨模态对齐与语义一致性。解决方案的关键在于提出ViCLIP-OT框架,其核心创新是将CLIP风格的对比学习与相似性图正则化的最优传输(Similarity-Graph Regularized Optimal Transport, SIGROT)损失相结合,从而增强跨模态全局一致性并缓解模态间隙问题。实验表明,该方法在三个越南语基准数据集上均显著优于CLIP和SigLIP基线模型,尤其在零样本场景下提升明显,验证了SIGROT机制在低资源语言跨模态检索中的有效性与可扩展性。
链接: https://arxiv.org/abs/2602.22678
作者: Quoc-Khang Tran,Minh-Thien Nguyen,Nguyen-Khang Pham
机构: Ho Chi Minh City University of Technology (胡志明市科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint submitted to Expert Systems with Applications
Abstract:Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.
[CV-93] SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling
【速读】:该论文旨在解决水下目标检测中因光衰减、颜色失真、背景杂乱及目标尺度小等问题导致的检测精度低和鲁棒性差的挑战。其解决方案的关键在于提出一种融合多尺度特征增强与全局上下文建模的新网络结构SPMamba-YOLO:首先引入空间金字塔池化增强的层聚合网络(SPPELAN)模块以强化多尺度特征聚合并扩展感受野;其次采用金字塔分割注意力(PSA)机制提升特征判别力,抑制背景干扰;最后嵌入基于Mamba的状态空间建模模块,高效捕获长距离依赖关系与全局上下文信息,从而显著提升复杂水下环境中目标检测的准确性与稳定性。
链接: https://arxiv.org/abs/2602.22674
作者: Guanghao Liao,Zhen Liu,Liyuan Cao,Yonghui Yang,Qi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 10 figures, 6 tables. This paper presents SPMamba-YOLO, an underwater object detection framework integrating multi-scale feature enhancement and global context modeling. The work is under review
Abstract:Underwater object detection is a critical yet challenging research problem owing to severe light attenuation, color distortion, background clutter, and the small scale of underwater targets. To address these challenges, we propose SPMamba-YOLO, a novel underwater object detection network that integrates multi-scale feature enhancement with global context modeling. Specifically, a Spatial Pyramid Pooling Enhanced Layer Aggregation Network (SPPELAN) module is introduced to strengthen multi-scale feature aggregation and expand the receptive field, while a Pyramid Split Attention (PSA) mechanism enhances feature discrimination by emphasizing informative regions and suppressing background interference. In addition, a Mamba-based state space modeling module is incorporated to efficiently capture long-range dependencies and global contextual information, thereby improving detection robustness in complex underwater environments. Extensive experiments on the URPC2022 dataset demonstrate that SPMamba-YOLO outperforms the YOLOv8n baseline by more than 4.9% in mAP@0.5, particularly for small and densely distributed underwater objects, while maintaining a favorable balance between detection accuracy and computational cost.
[CV-94] Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes CVPR2026
【速读】:该论文旨在解决开放词汇场景下3D占据表示(open-vocabulary 3D occupancy)在室内复杂环境中的建模难题,尤其是针对几何密度高、布局复杂且语义细粒度强的特点,现有基于室外驾驶场景的方法难以有效迁移。解决方案的关键在于提出一种仅依赖二值占据标签(occupied vs free)的几何监督范式,并构建以3D语言嵌入高斯(3D Language-Embedded Gaussians)为统一中间表征的框架:一方面设计了基于泊松(Poisson-based)的透明度感知聚合方法,稳定了弱监督下的体积聚合过程;另一方面引入渐进式温度衰减(Progressive Temperature Decay)策略,在splating过程中逐步 sharpen 高斯的不透明度,增强高斯与语言嵌入之间的对齐效果,从而实现更精确的开放词汇语义占据预测。
链接: https://arxiv.org/abs/2602.22667
作者: Changqing Zhou,Yueru Luo,Han Zhang,Zeyu Jiang,Changhao Chen
机构: The Hong Kong University of Science and Technology (Guangzhou); The Chinese University of Hong Kong, Shenzhen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at this https URL.
[CV-95] ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals
【速读】:该论文旨在解决复杂多部件物体在重建为高保真数字孪生(digital twin)过程中,现有自监督方法对初始部件分割高度敏感、易陷入局部最优解的问题。其解决方案的关键在于提出一种名为ArtPro的新颖自监督框架,通过引入**自适应融合运动假设(adaptive integration of mobility proposals)**实现鲁棒重建:首先基于几何特征和运动先验进行过分割初始化,生成具有合理运动假设的部件提案;随后在优化过程中,依据空间邻域间的运动一致性动态合并提案,并结合碰撞感知的运动剪枝机制避免错误的运动学估计,从而显著提升复杂多部件物体重建的准确性和稳定性。
链接: https://arxiv.org/abs/2602.22666
作者: Xuelu Li,Zhaonan Wang,Xiaogang Wang,Lei Wu,Manyi Li,Changhe Tu
机构: Shandong University (山东大学); Southwest University (西南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing articulated objects into high-fidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects. To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while a collision-aware motion pruning mechanism prevents erroneous kinematic estimation. Extensive experiments on both synthetic and real-world objects demonstrate that ArtPro achieves robust reconstruction of complex multi-part objects, significantly outperforming existing methods in accuracy and stability.
[CV-96] Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing ICASSP2026
【速读】:该论文旨在解决现有音视频质量评估(Audio-visual Quality Assessment, AVQA)数据集在规模、内容多样性及标注粒度上的局限性问题,这些不足限制了模型开发和多模态感知机制的研究进展。其解决方案的关键在于提出一种实用的AVQA数据集构建方法:首先设计基于众包的主观实验框架,突破实验室环境限制,实现跨场景可靠标注;其次采用系统性的数据准备策略,确保质量等级与语义场景的广泛覆盖;最后扩展额外标注信息,支持多模态感知机制及其与内容关联性的研究。通过该方法构建的YT-NTU-AVQ数据集成为迄今规模最大、最多样化的AVQA数据集,包含1620个用户生成的音视频序列。
链接: https://arxiv.org/abs/2602.22659
作者: Renyu Yang,Jian Jin,Lili Meng,Meiqin Liu,Yilin Wang,Balu Adsumilli,Weisi Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to ICASSP 2026. 5 pages (main paper) + 8 pages (supplementary material)
Abstract:Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at this https URL
[CV-97] Denoising as Path Planning : Training-Free Acceleration of Diffusion Models with DPCache CVPR2026
【速读】:该论文旨在解决扩散模型(Diffusion Models)在实际部署中因多步迭代采样带来的高计算开销问题。现有基于缓存(caching-based)的加速方法虽无需训练且效果良好,但依赖固定或局部自适应的时间步调度策略,未考虑去噪轨迹的全局结构,易导致误差累积和视觉伪影。其解决方案的关键在于提出一种名为 DPCache 的全新无训练加速框架,将采样加速建模为全局路径规划问题:通过小规模校准集构建路径感知代价张量(Path-Aware Cost Tensor),量化跳过时间步时对前一关键时间步的路径依赖误差;进而利用动态规划算法选取最优关键时间步序列,在最小化总路径代价的同时保持轨迹保真度;推理阶段仅在关键时间步执行完整计算,其余中间输出则基于缓存特征高效预测,从而实现显著加速与高质量生成的平衡。
链接: https://arxiv.org/abs/2602.22654
作者: Bowen Cui,Yuanbin Wang,Huajiang Xu,Biaolong Chen,Aixi Zhang,Hao Jiang,Zhengzheng Jin,Xu Liu,Pipei Huang
机构: Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by + 0.031 ImageReward at 4.87 \times speedup and even surpassing the full-step baseline by + 0.028 ImageReward at 3.54 \times speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at this https URL.
[CV-98] Interactive Medical-SAM2 GUI: A Napari-based semi-automatic annotation tool for medical images
【速读】:该论文旨在解决3D医学图像中手动标注效率低、成本高,且现有工具缺乏统一、面向队列的交互式标注工作流的问题。核心挑战在于如何在保持体素级标注精度的同时,实现跨切片的高效传播与交互修正,并支持多研究数据的本地化处理与定量输出。解决方案的关键在于构建一个以Napari为基础的本地优先(local-first)图形用户界面(GUI),将SAM2的传播机制扩展至3D体积(通过将其视为切片序列),并集成框/点提示(box/point prompting)与医学专用模型Medical-SAM2,从而实现从稀疏提示出发的自动掩膜传播;同时提供逐病例导航、对象初始化(包括首/末切片引导)、点提示修正、以及基于SimpleITK的几何保真导出功能,形成端到端的3D标注流程。
链接: https://arxiv.org/abs/2602.22649
作者: Woojae Hong,Jong Ha Hwang,Jiyong Chung,Joongyeon Choi,Hyunngun Kim,Yong Hwy Kim
机构: Sungkyunkwan University (成均馆大学); Seoul National University Hospital (首尔国立大学医院); Seoul National University College of Medicine (首尔国立大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, Planning to submit JOSS (Journal of Open Source Software)
Abstract:Interactive Medical-SAM2 GUI is an open-source desktop application for semi-automatic annotation of 2D and 3D medical images. Built on the Napari multi-dimensional viewer, box/point prompting is integrated with SAM2-style propagation by treating a 3D volume as a slice sequence, enabling mask propagation from sparse prompts using Medical-SAM2 on top of SAM2. Voxel-level annotation remains essential for developing and validating medical imaging algorithms, yet manual labeling is slow and expensive for 3D scans, and existing integrations frequently emphasize per-slice interaction without providing a unified, cohort-oriented workflow for navigation, propagation, interactive correction, and quantitative export in a single local pipeline. To address this practical limitation, a local-first Napari workflow is provided for efficient 3D annotation across multiple studies using standard DICOM series and/or NIfTI volumes. Users can annotate cases sequentially under a single root folder with explicit proceed/skip actions, initialize objects via box-first prompting (including first/last-slice initialization for single-object propagation), refine predictions with point prompts, and finalize labels through prompt-first correction prior to saving. During export, per-object volumetry and 3D volume rendering are supported, and image geometry is preserved via SimpleITK. The GUI is implemented in Python using Napari and PyTorch, with optional N4 bias-field correction, and is intended exclusively for research annotation workflows. The code is released on the project page: this https URL.
[CV-99] Plug Play and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models
【速读】:该论文旨在解决多模态模型中因模态缺失(missing modalities)而导致的性能严重下降问题,其根本原因在于模型在训练过程中存在模态偏好失衡,即对某些模态过度优化而忽视其他模态。解决方案的关键在于提出一种基于频域分析的频率比指标(Frequency Ratio Metric, FRM),用于量化不同模态之间的主导关系;并进一步设计了一个可插拔的多模态权重分配模块(Multimodal Weight Allocation Module, MWAM),该模块根据FRM动态调整各模态分支的贡献权重,从而实现训练过程中的自适应平衡,提升模型对缺失模态的鲁棒性。
链接: https://arxiv.org/abs/2602.22644
作者: Siqi Lu,Wanying Xu,Yongbin Zheng,Wenting Luan,Peng Sun,Jianhang Yao
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also manifests as further performance improvements to state-of-the-art methods addressing the missing modality problem.
[CV-100] QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition CVPR2026
【速读】:该论文旨在解决多视图几何中相机位姿同步(camera pose synchronization)的问题,特别是如何利用四焦点张量(quadrifocal tensor)这一高阶结构信息来提升从多个图像中恢复相机姿态的精度与鲁棒性。传统方法主要依赖成对的极线约束(如本质矩阵),但难以充分利用多视角间的高阶关联;本文提出了一种基于块四焦点张量(block quadrifocal tensor)的新框架,其关键在于发现该张量可进行Tucker分解,且其因子矩阵对应于堆叠的相机矩阵,具有固定的多线性秩(4,4,4,4),从而实现了与相机数量无关的结构化表示。进一步地,作者开发了首个针对四焦点张量的同步算法,结合Tucker分解、交替方向乘子法(ADMM)和迭代重加权最小二乘(IRLS),并揭示了块四焦点、三焦点(trifocal)和双焦点(bifocal)张量之间的内在联系,提出了联合同步三类张量的统一优化策略,实验验证了该方法在现代数据集上的有效性,证明了高阶张量信息在同步任务中的潜力与实用性。
链接: https://arxiv.org/abs/2602.22639
作者: Daniel Miao,Gilad Lerman,Joe Kileel
机构: University of Minnesota (明尼苏达大学); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注: 30 pages, accepted to CVPR 2026
Abstract:In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover n cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,~4,~4,~4) independent of n . We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higher-order information in synchronization.
[CV-101] CRAG : Can 3D Generative Models Help 3D Assembly?
【速读】:该论文旨在解决3D组装任务中现有方法仅依赖刚性变换进行姿态估计、无法处理缺失几何结构的问题。传统方法在面对部分缺失或复杂形状时,难以恢复完整的结构信息,导致组装结果不准确。解决方案的关键在于将3D组装重新建模为一个联合的生成与组装问题(joint assembly and generation problem),即通过协同优化实现互增强:组装过程提供局部结构先验以指导生成,而生成过程则注入整体形状上下文以消除组装中的歧义。该方法提出CRAG(Coupled Reasoning and Generation),能够同时生成合理的完整形状并预测输入部件的姿态,从而在具有多样几何形态、不同部件数量及缺失零件的自然场景对象上实现最优性能。
链接: https://arxiv.org/abs/2602.22629
作者: Zeyu Jiang,Sihang Li,Siqi Tan,Chenyang Xu,Juexiao Zhang,Julia Galway-Witham,Xue Wang,Scott A. Williams,Radu Iovita,Chen Feng,Jing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures
Abstract:Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Our code and models will be released.
[CV-102] DiffBMP: Differentiable Rendering with Bitmap Primitives CVPR2026
【速读】:该论文旨在解决传统可微分渲染器(differentiable renderer)仅限于矢量图形(vector graphics)而无法有效处理现实世界中广泛存在的位图图像(bitmap images)的问题。其关键解决方案是提出 DiffBMP,一个可扩展且高效的可微分渲染引擎,核心在于设计了一种高度并行化的渲染流水线,并采用定制的 CUDA 实现来高效计算梯度。该系统能够在消费级 GPU 上于不到 1 分钟内优化数千个位图图元的位置、旋转、缩放、颜色和透明度等参数,同时结合软光栅化(soft rasterization)、结构感知初始化、噪声画布以及针对视频或空间约束图像的专用损失函数/启发式策略,显著提升了优化稳定性和实用性。
链接: https://arxiv.org/abs/2602.22625
作者: Seongmin Hong,Junghun James Kim,Daehyeop Kim,Insoo Chung,Se Young Chun
机构: Seoul National University (首尔国立大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026, this https URL
Abstract:We introduce DiffBMP, a scalable and efficient differentiable rendering engine for a collection of bitmap images. Our work addresses a limitation that traditional differentiable renderers are constrained to vector graphics, given that most images in the world are bitmaps. Our core contribution is a highly parallelized rendering pipeline, featuring a custom CUDA implementation for calculating gradients. This system can, for example, optimize the position, rotation, scale, color, and opacity of thousands of bitmap primitives all in under 1 min using a consumer GPU. We employ and validate several techniques to facilitate the optimization: soft rasterization via Gaussian blur, structure-aware initialization, noisy canvas, and specialized losses/heuristics for videos or spatially constrained images. We demonstrate DiffBMP is not just an isolated tool, but a practical one designed to integrate into creative workflows. It supports exporting compositions to a native, layered file format, and the entire framework is publicly accessible via an easy-to-hack Python package.
[CV-103] Instruction-based Image Editing with Planning Reasoning and Generation
【速读】:该论文旨在解决基于指令的图像编辑(instruction-based image editing)中因场景理解能力不足而导致的编辑质量受限问题。现有方法依赖于大型语言模型、目标分割模型和编辑模型的链式结构,但这些方法仅具备单一模态的理解能力,难以应对复杂场景。其解决方案的关键在于引入一种多模态链式思维(Chain-of-Thought, CoT)框架,将整个编辑任务拆解为三个阶段:CoT规划、编辑区域推理和图像编辑。其中,通过多模态大语言模型训练指令驱动的编辑区域生成网络,并结合提示引导的图像编辑网络(基于大规模文本到图像扩散模型),实现对复杂真实图像的高质量编辑,显著提升了多模态理解与生成之间的协同能力。
链接: https://arxiv.org/abs/2602.22624
作者: Liya Ji,Chenyang Qi,Qifeng Chen
机构: HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures
Abstract:Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.
[CV-104] CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection ICLR2026
【速读】:该论文旨在解决源域无数据的域适应目标检测(Source-Free Domain Adaptive Object Detection, SF-DAOD)问题,即在不保留任何源域数据的前提下,将训练于标注源域的检测器适配到未标注的目标域。现有方法多集中于伪标签阈值调整或教师-学生框架优化,忽视了跨域数据中物体级别的结构线索。其解决方案的关键在于提出CGSA框架,首次将基于对象中心学习(Object-Centric Learning, OCL)引入SF-DAOD,通过在DETR架构中集成层级槽位感知(Hierarchical Slot Awareness, HSA)模块,逐步解耦图像为具有视觉先验意义的槽位表示,并利用类引导槽位对比(Class-Guided Slot Contrast, CGSC)模块引导槽位向类别语义对齐,从而保持语义一致性并促进域不变性适应。
链接: https://arxiv.org/abs/2602.22621
作者: Boyang Dai,Zeng Fan,Zihao Qi,Meng Lou,Yizhou Yu
机构: The University of Hong Kong (香港大学); National University of Singapore (新加坡国立大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The paper has been accepted by the conference ICLR 2026
Abstract:Source-Free Domain Adaptive Object Detection (SF-DAOD) aims to adapt a detector trained on a labeled source domain to an unlabeled target domain without retaining any source data. Despite recent progress, most popular approaches focus on tuning pseudo-label thresholds or refining the teacher-student framework, while overlooking object-level structural cues within cross-domain data. In this work, we present CGSA, the first framework that brings Object-Centric Learning (OCL) into SF-DAOD by integrating slot-aware adaptation into the DETR-based detector. Specifically, our approach integrates a Hierarchical Slot Awareness (HSA) module into the detector to progressively disentangle images into slot representations that act as visual priors. These slots are then guided toward class semantics via a Class-Guided Slot Contrast (CGSC) module, maintaining semantic consistency and prompting domain-invariant adaptation. Extensive experiments on multiple cross-domain datasets demonstrate that our approach outperforms previous SF-DAOD methods, with theoretical derivations and experimental analysis further demonstrating the effectiveness of the proposed components and the framework, thereby indicating the promise of object-centric design in privacy-sensitive adaptation scenarios. Code is released at this https URL.
[CV-105] Coded-E2LF: Coded Aperture Light Field Imaging from Events CVPR2026
【速读】:该论文旨在解决如何仅通过事件数据(event data)实现4-D光场(light field)的高精度重建问题,传统方法依赖于同时采集事件和强度图像,限制了硬件实现的灵活性。其解决方案的关键在于提出了一种基于编码孔径(coded aperture)与静态事件相机相结合的计算成像方法——Coded-E2LF,该方法完全基于事件信息进行光场重建,并通过引入黑斑图案(black pattern)在编码模式中的关键作用,理论上支持且实践上显著提升了仅从事件中恢复光场的能力。实验表明,该方法可在真实场景中实现像素级精度的4-D光场重建,是首个证明纯事件驱动光场重建可行性的工作。
链接: https://arxiv.org/abs/2602.22620
作者: Tomoya Tsuchida,Keita Takahashi,Chihiro Tsutake,Toshiaki Fujii,Hajime Nagahara
机构: Nagoya University (名古屋大学); Osaka University (大阪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to CVPR 2026
Abstract:We propose Coded-E2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. In a previous work, an imaging system similar to ours was adopted, but both events and intensity images were captured and used for light field reconstruction. In contrast, our method is purely event-based, which relaxes restrictions for hardware implementation. We also introduce several advancements from the previous work that enable us to theoretically support and practically improve light field reconstruction from events alone. In particular, we clarify the key role of a black pattern in aperture coding patterns. We finally implemented our method on real imaging hardware to demonstrate its effectiveness in capturing real 3-D scenes. To the best of our knowledge, we are the first to demonstrate that a 4-D light field with pixel-level accuracy can be reconstructed from events alone. Our software and supplementary video are available from our project website.
[CV-106] Spectrally Distilled Representations Aligned with Instruction-Augmented LLM s for Satellite Imagery
【速读】:该论文旨在解决当前视觉-语言基础模型(Vision-Language Foundation Models, VLFMs)在地球观测(Earth Observation, EO)领域应用中的两大瓶颈问题:一是多光谱输入虽具信息量但因波段冗余和配准不一致难以稳定利用;二是CLIP-style文本编码器语义表达能力有限,削弱了细粒度的图文对齐效果。其解决方案的关键在于提出SATtxt框架,包含两个核心阶段:首先通过“光谱表示蒸馏”(Spectral Representation Distillation),将冻结的多光谱教师模型中的光谱先验知识以轻量投影方式迁移至仅使用RGB输入的学生模型;其次采用“光谱引导对齐”(Spectrally Grounded Alignment)结合指令增强的大语言模型(Instruction-Augmented LLMs),实现蒸馏后的视觉空间与高表达力的语言嵌入空间的有效衔接。此方法使模型在推理时仅需RGB图像即可保持训练中学习到的光谱感知能力,显著提升了零样本分类、检索和线性探测性能。
链接: https://arxiv.org/abs/2602.22613
作者: Minh Kha Do,Wei Xiang,Kang Han,Di Wu,Khoa Phan,Yi-Ping Phoebe Chen,Gaowen Liu,Ramana Rao Kompella
机构: La Trobe University (拉特罗布大学); Cisco Research (思科研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: this https URL
[CV-107] DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion
【速读】:该论文旨在解决差分隐私随机梯度下降(Differentially Private Stochastic Gradient Descent, DP-SGD)在条件扩散模型(conditional diffusion models)训练中因异质性条件上下文(如观测历史、缺失模式或异常协变量)引发的重尾梯度问题。此类梯度会导致全局裁剪(global clipping)过度响应稀有事件,从而引入显著的裁剪偏差并降低隐私预算下的模型效用。解决方案的关键在于提出一种即插即用的敏感度感知条件机制——DP-aware AdaLN-Zero,通过边界参数化联合约束条件表示幅度与AdaLN调制参数,提前抑制由条件引起的极端梯度尾部事件,从而减少裁剪畸变并提升隐私保护下的插值、填补和预测性能,同时保持非私有训练中的表达能力。
链接: https://arxiv.org/abs/2602.22610
作者: Tao Huang,Jiayang Meng,Xu Yang,Chen Hou,Hong Chen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Condition injection enables diffusion models to generate context-aware outputs, which is essential for many time-series tasks. However, heterogeneous conditional contexts (e.g., observed history, missingness patterns or outlier covariates) can induce heavy-tailed per-example gradients. Under Differentially Private Stochastic Gradient Descent (DP-SGD), these rare conditioning-driven heavy-tailed gradients disproportionately trigger global clipping, resulting in outlier-dominated updates, larger clipping bias, and degraded utility under a fixed privacy budget. In this paper, we propose DP-aware AdaLN-Zero, a drop-in sensitivity-aware conditioning mechanism for conditional diffusion transformers that limits conditioning-induced gain without modifying the DP-SGD mechanism. DP-aware AdaLN-Zero jointly constrains conditioning representation magnitude and AdaLN modulation parameters via bounded re-parameterization, suppressing extreme gradient tail events before gradient clipping and noise injection. Empirically, DP-SGD equipped with DP-aware AdaLN-Zero improves interpolation/imputation and forecasting under matched privacy settings. We observe consistent gains on a real-world power dataset and two public ETT benchmarks over vanilla DP-SGD. Moreover, gradient diagnostics attribute these improvements to conditioning-specific tail reshaping and reduced clipping distortion, while preserving expressiveness in non-private training. Overall, these results show that sensitivity-aware conditioning can substantially improve private conditional diffusion training without sacrificing standard performance.
[CV-108] LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals
【速读】:该论文旨在解决传统基于三维查找表(3D LUT)的图像增强与风格迁移方法中存在的参数冗余、可解释性差以及感知质量不足的问题。解决方案的关键在于提出一种统一的低秩(low-rank)建模框架LoR-LUT,其核心创新是将残差校正(residual corrections,本质上为低秩张量)与一组基础LUT联合使用,而非依赖密集的基础LUT融合策略。该方法在保持相同三线性插值复杂度的前提下,显著减少了网络参数、残差校正和LUT的总规模,同时提升了图像的感知质量,并通过交互式可视化工具LoR-LUT Viewer增强了结果的可解释性和用户信任度。
链接: https://arxiv.org/abs/2602.22607
作者: Ziqi Zhao,Abhijit Mishra,Shounak Roychowdhury
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present LoR-LUT, a unified low-rank formulation for compact and interpretable 3D lookup table (LUT) generation. Unlike conventional 3D-LUT-based techniques that rely on fusion of basis LUTs, which are usually dense tensors, our unified approach extends the current framework by jointly using residual corrections, which are in fact low-rank tensors, together with a set of basis LUTs. The approach described here improves the existing perceptual quality of an image, which is primarily due to the technique’s novel use of residual corrections. At the same time, we achieve the same level of trilinear interpolation complexity, using a significantly smaller number of network, residual corrections, and LUT parameters. The experimental results obtained from LoR-LUT, which is trained on the MIT-Adobe FiveK dataset, reproduce expert-level retouching characteristics with high perceptual fidelity and a sub-megabyte model size. Furthermore, we introduce an interactive visualization tool, termed LoR-LUT Viewer, which transforms an input image into the LUT-adjusted output image, via a number of slidebars that control different parameters. The tool provides an effective way to enhance interpretability and user confidence in the visual results. Overall, our proposed formulation offers a compact, interpretable, and efficient direction for future LUT-based image enhancement and style transfer.
[CV-109] ϕ-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models CVPR’26
【速读】:该论文旨在解决大规模多模态模型(Large Multimodal Models, LMMs)在持续学习(Continual Learning)过程中因数据分布不平衡导致的公平性问题,即模型在面对不同任务时可能因某些类别的样本稀缺而产生偏差更新,进而影响整体性能。其核心解决方案是提出一种新的公平性导向的直接偏好优化框架(Fairness Direct Preference Optimization, FaiDPO 或 φ-DPO),关键在于将传统的直接偏好优化(Direct Preference Optimization, DPO)扩展为能够显式处理数据分布偏倚的 φ-DPO 损失函数,从而在缓解灾难性遗忘的同时,提升模型在各类任务上的公平表现。理论分析与实验证明该方法在多个基准测试中均优于现有 LMM 持续学习方法。
链接: https://arxiv.org/abs/2602.22601
作者: Thanh-Dat Truong,Huu-Thien Tran,Jackson Cothren,Bhiksha Raj,Khoa Luu
机构: University of Arkansas (阿肯色大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR’26
Abstract:Fairness in Continual Learning for Large Multimodal Models (LMMs) is an emerging yet underexplored challenge, particularly in the presence of imbalanced data distributions that can lead to biased model updates and suboptimal performance across tasks. While recent continual learning studies have made progress in addressing catastrophic forgetting, the problem of fairness caused the imbalanced data remains largely underexplored. This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or \phi -DPO) framework for continual learning in LMMs. In particular, we first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals. Then, we identify the limitations of conventional DPO in imbalanced data and present a new \phi -DPO loss that explicitly addresses distributional biases. We provide a comprehensive theoretical analysis demonstrating that our approach addresses both forgetting and data imbalance. Additionally, to enable \phi -DPO-based continual learning, we construct pairwise preference annotations for existing benchmarks in the context of continual learning. Extensive experiments and ablation studies show the proposed \phi -DPO achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.
[CV-110] BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model
【速读】:该论文旨在解决极稀疏、无约束真实场景图像下新视角合成(Novel View Synthesis, NVS)质量差的问题,特别是由传统方法在推理时产生的不一致细节和伪影。其解决方案的关键在于对预训练的Stable Video Diffusion(SVD)模型进行深度优化:首先分析扩散模型的潜在空间,引入两种新组件——时间等变正则化(temporal equivariance regularization)和视觉基础模型对齐表示(vision foundation model-aligned representation),均应用于SVD流水线中的变分自编码器(Variational Autoencoder, VAE)模块;同时结合前馈式3D高斯溅射(3D Gaussian Splatting, 3DGS)模型生成连续、无伪影且视图一致的新视角图像,从而显著提升NVS的保真度与一致性。
链接: https://arxiv.org/abs/2602.22596
作者: Yuci Han,Charles Toth,John E. Anderson,William J. Shuart,Alper Yilmaz
机构: The Ohio State University (俄亥俄州立大学); USACE ERDC GRL (美国陆军工程兵团垦务局研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.
[CV-111] Dont let the information slip away
【速读】:该论文旨在解决现有目标检测模型在处理前景物体时忽视背景上下文信息的问题,从而导致检测性能受限。尽管YOLO系列和DETR类模型(如RT-DETR及其改进版本RT-DETRv2)已在精度与速度上达到先进水平,但它们主要关注前景特征而忽略了背景提供的语义线索。解决方案的关键在于提出一种名为Association DETR的新模型,通过显式建模前景与背景之间的关联关系,有效利用背景上下文信息来增强目标检测的准确性,最终在COCO val2017数据集上实现了优于其他主流模型的性能表现。
链接: https://arxiv.org/abs/2602.22595
作者: Taozhe Li
机构: University of Oklahoma (俄克拉荷马大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10
Abstract:Real-time object detection has advanced rapidly in recent years. The YOLO series of detectors is among the most well-known CNN-based object detection models and cannot be overlooked. The latest version, YOLOv26, was recently released, while YOLOv12 achieved state-of-the-art (SOTA) performance with 55.2 mAP on the COCO val2017 dataset. Meanwhile, transformer-based object detection models, also known as DEtection TRansformer (DETR), have demonstrated impressive performance. RT-DETR is an outstanding model that outperformed the YOLO series in both speed and accuracy when it was released. Its successor, RT-DETRv2, achieved 53.4 mAP on the COCO val2017 dataset. However, despite their remarkable performance, all these models let information to slip away. They primarily focus on the features of foreground objects while neglecting the contextual information provided by the background. We believe that background information can significantly aid object detection tasks. For example, cars are more likely to appear on roads rather than in offices, while wild animals are more likely to be found in forests or remote areas rather than on busy streets. To address this gap, we propose an object detection model called Association DETR, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset.
[CV-112] Causal Motion Diffusion Models for Autoregressive Motion Generation CVPR2026 MDM
【速读】:该论文旨在解决当前人体运动生成中两大核心问题:一是基于全序列扩散模型的方法依赖双向生成,破坏了时间因果性且难以实现实时应用;二是自回归模型存在不稳定性和误差累积问题。解决方案的关键在于提出因果运动扩散模型(Causal Motion Diffusion Models, CMDM),其核心创新包括:首先构建一个运动-语言对齐的因果变分自编码器(Motion-Language-Aligned Causal VAE, MAC-VAE),将运动序列编码为具有时间因果性的潜在表示;在此基础上,设计一种因果扩散Transformer,通过因果扩散强制训练实现帧级有序去噪;同时引入帧级采样调度策略与因果不确定性机制,支持快速推理和流式合成。该框架在保持语义保真度和时序平滑性的同时显著降低推理延迟,适用于交互式场景下的长时序运动生成。
链接: https://arxiv.org/abs/2602.22594
作者: Qing Yu,Akihisa Watanabe,Kent Fujiwara
机构: LY Corporation(LY公司); Waseda University(早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026, Project website: this https URL
Abstract:Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.
[CV-113] GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
【速读】:该论文旨在解决现有前馈式3D重建方法在稀疏无姿态视图下性能受限、难以适应域外数据以及引入生成先验后推理时间显著增加的问题。其核心挑战在于传统前馈流水线采用一次性预测范式,模型容量受限、缺乏推理时优化能力,且不适用于持续注入生成先验。解决方案的关键是提出GIFSplat框架,它是一种纯前馈的迭代精化机制,通过少量仅向前传播的残差更新逐步利用渲染证据优化3D场景,同时通过蒸馏冻结扩散先验生成高保真新视角图像,并将其转化为高斯级线索以实现每场景自适应,无需梯度反向传播或视图集扩展,从而在保持秒级推理时间的同时显著提升重建质量(PSNR最高提升2.1 dB)。
链接: https://arxiv.org/abs/2602.22571
作者: Tianyu Chen,Wei Xiang,Kang Han,Yu Lu,Di Wu,Gaowen Liu,Ramana Rao Kompella
机构: La Trobe University (拉特罗布大学); Cisco Research (思科研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.
[CV-114] Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation
【速读】:该论文旨在解决当前扩散模型(diffusion models)中指导方法(diffusion guidance methods)评估体系存在的偏差问题,尤其是现有基于人类偏好的评价指标对大引导尺度(guidance scale)存在显著倾向性,导致高评分可能仅源于语义对齐而非性能提升,甚至掩盖图像质量下降的问题。其解决方案的关键在于提出一种新的引导感知评估框架(Guidance-aware Evaluation, GA-Eval),通过有效的引导尺度校准,将不同指导方法的效果分解为与标准分类器自由引导(Classifier-free Guidance, CFG)正交和并行的成分,从而实现公平比较;同时揭示了部分先进方法在传统评估下表现优异但实际无效的现象,强调需重新审视该领域的评估范式与研究方向。
链接: https://arxiv.org/abs/2602.22570
作者: Dian Xie,Shitong Shao,Lichen Bai,Zikai Zhou,Bojun Cheng,Shuo Yang,Jun Wu,Zeke Xie
机构: The Hong Kong University of Science and Technology (Guangzhou); Harbin Institute of Technology (Shenzhen); Cogniser Information Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.
[CV-115] Quality-Aware Robust Multi-View Clustering for Heterogeneous Observation Noise
【速读】:该论文旨在解决深度多视图聚类(Deep Multi-View Clustering)在真实场景中对复杂噪声敏感的问题,尤其是现有方法通常采用简化的二元假设(即数据要么完全干净要么完全污染),忽略了实际中普遍存在的异质性观测噪声(heterogeneous observation noise),即不同样本的污染强度呈连续变化。解决方案的关键在于提出一种名为质量感知鲁棒多视图聚类(Quality-Aware Robust Multi-View Clustering, QARMVC)的新框架:其核心机制是利用信息瓶颈(information bottleneck)提取各视图的内在语义用于重建,并通过重建差异量化细粒度的污染强度,进而生成实例级质量分数;这些质量分数被嵌入到分层学习策略中——在特征层面设计质量加权对比目标以自适应抑制噪声传播,在融合层面通过质量加权聚合构建高质量全局共识,并借助互信息最大化实现局部视图的对齐与修正。
链接: https://arxiv.org/abs/2602.22568
作者: Peihan Wu,Guanjie Cheng,Yufei Tong,Meng Xi,Shuiguang Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep multi-view clustering has achieved remarkable progress but remains vulnerable to complex noise in real-world applications. Existing noisy robust methods predominantly rely on a simplified binary assumption, treating data as either perfectly clean or completely corrupted. This overlooks the prevalent existence of heterogeneous observation noise, where contamination intensity varies continuously across data. To bridge this gap, we propose a novel framework termed Quality-Aware Robust Multi-View Clustering (QARMVC). Specifically, QARMVC employs an information bottleneck mechanism to extract intrinsic semantics for view reconstruction. Leveraging the insight that noise disrupts semantic integrity and impedes reconstruction, we utilize the resulting reconstruction discrepancy to precisely quantify fine-grained contamination intensity and derive instance-level quality scores. These scores are integrated into a hierarchical learning strategy: at the feature level, a quality-weighted contrastive objective is designed to adaptively suppress the propagation of noise; at the fusion level, a high-quality global consensus is constructed via quality-weighted aggregation, which is subsequently utilized to align and rectify local views via mutual information maximization. Extensive experiments on five benchmark datasets demonstrate that QARMVC consistently outperforms state-of-the-art baselines, particularly in scenarios with heterogeneous noise intensities.
[CV-116] SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction
【速读】:该论文旨在解决现有深度引导的3D重建方法中存在的尺度漂移(scale drift)、多视角不一致性以及高保真几何重建需大量优化后处理的问题。其解决方案的关键在于提出SwiftNDC框架,该框架基于神经深度校正场(Neural Depth Correction field)生成跨视角一致的深度图,并通过反投影与鲁棒重投影误差过滤获得稠密点云,从而为下游重建提供可靠且均匀分布的几何初始化;此初始化显著加速了3D高斯溅射(3D Gaussian Splatting, 3DGS)的收敛过程,减少优化迭代次数,同时提升网格重建质量和新视角合成的渲染保真度。
链接: https://arxiv.org/abs/2602.22565
作者: Kang Han,Wei Xiang,Lu Yu,Mathew Wyatt,Gaowen Liu,Ramana Rao Kompella
机构: La Trobe University (拉特罗布大学); Australian Institute of Marine Science (澳大利亚海洋科学研究所); Cisco Research (思科研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Depth-guided 3D reconstruction has gained popularity as a fast alternative to optimization-heavy approaches, yet existing methods still suffer from scale drift, multi-view inconsistencies, and the need for substantial refinement to achieve high-fidelity geometry. Here, we propose SwiftNDC, a fast and general framework built around a Neural Depth Correction field that produces cross-view consistent depth maps. From these refined depths, we generate a dense point cloud through back-projection and robust reprojection-error filtering, obtaining a clean and uniformly distributed geometric initialization for downstream reconstruction. This reliable dense geometry substantially accelerates 3D Gaussian Splatting (3DGS) for mesh reconstruction, enabling high-quality surfaces with significantly fewer optimization iterations. For novel-view synthesis, SwiftNDC can also improve 3DGS rendering quality, highlighting the benefits of strong geometric initialization. We conduct a comprehensive study across five datasets, including two for mesh reconstruction, as well as three for novel-view synthesis. SwiftNDC consistently reduces running time for accurate mesh reconstruction and boosts rendering fidelity for view synthesis, demonstrating the effectiveness of combining neural depth refinement with robust geometric initialization for high-fidelity and efficient 3D reconstruction.
[CV-117] DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
【速读】:该论文旨在解决当前基于扩散模型的自动驾驶场景生成方法中存在的两大核心问题:一是几何条件(如高精地图与3D边界框)之间的隐式依赖关系导致控制条件独立变化时生成失败;二是生成场景在语义和结构细节上表现不足,具体表现为语义上下文受限(因简短且视角不变的文本描述)以及前景结构失真(因标准去噪损失对空间权重均匀分配,忽略高频结构信息)。解决方案的关键在于提出DrivePTS框架,其创新性体现在三个方面:首先采用渐进式学习策略并引入显式互信息约束以缓解几何条件间的耦合;其次利用视觉-语言模型生成跨六类语义维度的多视角分层描述,提供细粒度文本引导;最后设计频域引导的结构损失函数,增强模型对高频元素的敏感性,从而提升前景结构保真度。
链接: https://arxiv.org/abs/2602.22549
作者: Zhechao Wang,Yiming Zeng,Lufan Ma,Zeqing Fu,Chen Bai,Ziyao Lin,Cheng Lu
机构: XPeng Motors (小鹏汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model’s sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.
[CV-118] DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)病理标志物tau蛋白正电子发射断层成像(tau-PET)成本高、可及性有限的问题,提出一种基于磁共振成像(MRI)的替代方法,即从配对的T1加权和FLAIR MRI图像中合成tau-PET图像。其解决方案的关键在于引入DisQ-HNet(DQH)框架,该框架结合了两个核心组件:(i) 基于部分信息分解(Partial Information Decomposition, PID)引导的向量量化编码器,将潜在特征划分为冗余、独特和互补信息成分,从而明确各模态贡献;(ii) 一种基于结构边缘线索条件控制的伪跳跃连接(pseudo-skip connections)半UNet解码器,有效保留解剖细节而不直接复用编码器特征。此设计不仅保持了重建保真度,还更优地保留了与AD下游任务(如Braak分期、tau定位和分类)相关的疾病信号,并通过PID引导的Shapley分析实现对合成摄取模式的模态特异性归因。
链接: https://arxiv.org/abs/2602.22545
作者: Agamdeep S. Chopra,Caitlin Neher,Tianyi Ren,Juampablo E. Heras Rivera,Mehmet Kurt
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures, 8 tables; includes PID guided vector quantized latent factorization and sobel edge conditioned Half-UNet decoder
Abstract:Tau positron emission tomography (tau-PET) provides an in vivo marker of Alzheimer’s disease pathology, but cost and limited availability motivate MRI-based alternatives. We introduce DisQ-HNet (DQH), a framework that synthesizes tau-PET from paired T1-weighted and FLAIR MRI while exposing how each modality contributes to the prediction. The method combines (i) a Partial Information Decomposition (PID)-guided, vector-quantized encoder that partitions latent information into redundant, unique, and complementary components, and (ii) a Half-UNet decoder that preserves anatomical detail using pseudo-skip connections conditioned on structural edge cues rather than direct encoder feature reuse. Across multiple baselines (VAE, VQ-VAE, and UNet), DisQ-HNet maintains reconstruction fidelity and better preserves disease-relevant signal for downstream AD tasks, including Braak staging, tau localization, and classification. PID-based Shapley analysis provides modality-specific attribution of synthesized uptake patterns.
[CV-119] Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning
【速读】:该论文旨在解决**组合图像检索(Composed Image Retrieval, CIR)**中因传统融合流水线依赖监督三元组而丢失细粒度视觉线索,以及现有零样本方法通过图像描述与文本编辑拼接导致隐含用户意图缺失和结果重复的问题。其解决方案的关键在于提出Pix2Key框架,将查询和候选图像统一表示为开放词汇视觉词典(open-vocabulary visual dictionaries),从而在统一嵌入空间中实现意图感知的约束匹配(intent-aware constraint matching)与多样性感知的重排序(diversity-aware reranking)。此外,引入仅使用图像进行自监督预训练的V-Dict-AE组件,进一步增强视觉词典表达能力,无需CIR特定标注即可提升细粒度属性理解,显著改善检索性能与意图一致性。
链接: https://arxiv.org/abs/2602.22510
作者: Guoyizhe Wei,Yang Jiao,Nan Xi,Zhishen Huang,Jingjing Meng,Rama Chellappa,Yan Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.
[CV-120] Space Syntax-guided Post-training for Residential Floor Plan Generation
【速读】:该论文旨在解决预训练生成式模型在住宅平面图生成中对建筑学先验知识(如公共空间的配置主导性和连通性)关注不足的问题,导致生成结果功能层次模糊、公共空间主导性弱。其解决方案的关键在于提出一种后训练范式——Space Syntax-guided Post-training (SSPT),通过引入一个非微分化的空间句法“oracle”,将RPLAN风格布局转换为矩形空间图并计算基于整合度的测量值以量化公共空间主导性和功能层级;在此基础上,采用两种策略实现知识注入:一是基于空间句法过滤与扩散微调的迭代重训练,二是基于近端策略优化(PPO)的空间句法奖励强化学习,其中PPO策略在提升公共空间主导性和功能清晰度的同时展现出更高的计算效率和更低的方差,从而为数据驱动的平面生成提供了可扩展且兼容性强的理论融合路径。
链接: https://arxiv.org/abs/2602.22507
作者: Zhuoyang Jiang,Dongqing Zhang
机构: Tongji University (同济大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained generative models for residential floor plans are typically optimized to fit large-scale data distributions, which can under-emphasize critical architectural priors such as the configurational dominance and connectivity of domestic public spaces (e.g., living rooms and foyers). This paper proposes Space Syntax-guided Post-training (SSPT), a post-training paradigm that explicitly injects space syntax knowledge into floor plan generation via a non-differentiable oracle. The oracle converts RPLAN-style layouts into rectangle-space graphs through greedy maximal-rectangle decomposition and door-mediated adjacency construction, and then computes integration-based measurements to quantify public space dominance and functional hierarchy. To enable consistent evaluation and diagnosis, we further introduce SSPT-Bench (Eval-8), an out-of-distribution benchmark that post-trains models using conditions capped at \leq 7 rooms while evaluating on 8-room programs, together with a unified metric suite for dominance, stability, and profile alignment. SSPT is instantiated with two strategies: (i) iterative retraining via space-syntax filtering and diffusion fine-tuning, and (ii) reinforcement learning via PPO with space-syntax rewards. Experiments show that both strategies improve public-space dominance and restore clearer functional hierarchy compared to distribution-fitted baselines, while PPO achieves stronger gains with substantially higher compute efficiency and reduced variance. SSPT provides a scalable pathway for integrating architectural theory into data-driven plan generation and is compatible with other generative backbones given a post-hoc evaluation oracle. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.22507 [cs.LG] (or arXiv:2602.22507v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.22507 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-121] Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在图像描述生成过程中频繁出现的幻觉问题,即模型错误地生成输入图像中不存在的对象。作者通过理论分析指出,这一现象源于“空间信用坍塌”(spatial credit collapse)——早期Transformer层中的激活信用过度集中于稀疏的视觉补丁,从而抑制了上下文信息的利用,加剧了对语言先验的依赖。解决方案的关键在于提出一种无需训练的推理阶段干预方法——空间信用重分配(Spatial Credit Redistribution, SCR),其核心机制是基于低熵输入引导,将高注意力源补丁的隐藏状态激活重新分配至其上下文区域,从而恢复被压制的视觉证据。实验表明,SCR在多个主流VLM架构和规模下显著降低幻觉率(POPE-Adversarial下降4.7–6.0个百分点,CHAIR基准下降达51%相对改善),同时保持生成质量(CIDEr变化小于0.8个百分点),且计算开销极低(仅增加43–56毫秒),优于现有方法如OPERA、VCD及OVCD。
链接: https://arxiv.org/abs/2602.22469
作者: Niamul Hassan Samin,Md Arifur Rahman,Abdullah Ibne Hanif,Juena Ahmed Noshin,Md Ashikur Rahman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in early transformer layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution (SCR), a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage points to only ~2.6-3.4 percentage points, pointing to credit-collapse as the key driver.
[CV-122] Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge
【速读】:该论文旨在解决可穿戴助手在实时在线情景记忆问答(episodic memory question answering)中因依赖云端计算而引发的隐私泄露与高延迟问题。其核心解决方案是将多模态大语言模型(Multimodal Large Language Models, MLLMs)部署于边缘设备上,并设计了一个双线程异步处理架构:一个描述符线程(Descriptor Thread)持续将视频流转换为轻量级文本记忆,另一个问答线程(Question Answering Thread)基于该文本记忆进行推理以响应查询。通过在QAEgo4D-Closed基准上的实验验证,该方案在消费级8GB GPU上实现了51.76%准确率与0.41秒首字节时间(Time-To-First-Token, TTFT),在企业级服务器上进一步提升至54.40%准确率,性能接近云端方案(56.00%),证明了边缘部署在保障隐私的同时具备实用潜力。
链接: https://arxiv.org/abs/2602.22455
作者: Giuseppe Lando,Rosario Forte,Antonino Furnari
机构: University of Catania (卡塔尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.
[CV-123] SimpleOCR: Rendering Visualized Questions to Teach MLLM s to Read
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉文本理解中的“模态懒惰”(modality laziness)问题,即模型是否真正具备从图像中提取和理解嵌入文本的能力,而非仅依赖提示词中的文本信息进行参数化捷径推理。为诊断此问题,作者提出视觉化问题(Visualized-Question, VQ)设置,将查询文本直接渲染到图像上以强制模型进行视觉交互。实验发现,即使模型拥有强大的光学字符识别(OCR)能力,在VQ设置下性能仍下降达12.7%,揭示其对视觉文本处理的显著依赖不足。解决方案的关键在于提出SimpleOCR训练策略——通过随机化风格将训练样本转换为VQ格式,从而人为破坏基于文本的捷径,迫使模型激活并优化其视觉文本提取路径。该方法无需架构改动即可实现显著性能提升,并在多个分布外(OOD)基准上优于基线模型与基于强化学习(RL)的方法,且数据效率极高,仅需30倍更少样本即可达到更优效果。
链接: https://arxiv.org/abs/2602.22426
作者: Yibo Peng,Peng Xia,Ding Zhong,Kaide Zeng,Siwei Han,Yiyang Zhou,Jiaqi Liu,Ruiyi Zhang,Huaxiu Yao
机构: UNC-Chapel Hill (北卡罗来纳大学教堂山分校); Carnegie Mellon University (卡内基梅隆大学); University of Michigan (密歇根大学); Adobe Research (Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated modality laziness.‘’ To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at this https URL.
[CV-124] CLIP Is Shortsighted: Paying Attention Beyond the First Sentence CVPR2026
【速读】:该论文旨在解决CLIP模型在预训练过程中因依赖短文本描述而产生的偏差问题,即模型倾向于捕捉图像中显著物体的简单描述,导致在复杂场景和密集文本描述上的对齐效果较差。其关键解决方案是提出DeBias-CLIP,通过在训练阶段移除长文本中的首句摘要,并引入句子子采样与文本标记填充策略,使监督信号均匀分布于所有token位置,从而打破“首句注意力集中”的捷径学习模式,提升长文本检索性能并增强对句子顺序变化的鲁棒性。
链接: https://arxiv.org/abs/2602.22419
作者: Marc-Antoine Lavoie,Anas Mahmoud,Aldo Zaimi,Arsene Fansi Tchango,Steven L. Waslander
机构: University of Toronto Robotics Institute (多伦多大学机器人研究所); Mila (Mila)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 13 figures, to be published in the CVPR 2026 proceedings
Abstract:CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP’s pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.
[CV-125] MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion
【速读】:该论文旨在解决分子属性预测中单一分子表示(如序列、图或3D结构)无法充分捕捉分子多维特征的问题,特别是忽略了分子几何构象的动态变化对性质预测的影响。其核心解决方案是提出MolFM-Lite,一种多模态模型,通过交叉注意力融合机制联合编码SELFIES序列(1D)、分子图(2D)和构象集合(3D),并利用特征级线性调制(FiLM)引入实验上下文条件信息。关键创新在于:(1) 构象集合注意力机制,结合可学习注意力与基于RDKit生成的多个构象的玻尔兹曼加权先验,以捕获分子形状的热力学分布;(2) 跨模态融合层,允许各模态间相互关注,实现互补信息共享。实验证明,三模态融合相比单模态基线提升7–11% AUC,构象集合进一步带来约2%的性能增益。
链接: https://arxiv.org/abs/2602.22405
作者: Syed Omer Shah,Mohammed Maqsood Ahmed,Danish Mohiuddin Mohammed,Shahnawaz Alam,Mohd Vahaj ur Rahman
机构: University at Buffalo (纽约州立大学布法罗分校); Northeastern University (东北大学); Muffakham Jah College of Engineering and Technology (穆法哈姆·贾赫工程学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most machine learning models for molecular property prediction rely on a single molecular representation (either a sequence, a graph, or a 3D structure) and treat molecular geometry as static. We present MolFM-Lite, a multi-modal model that jointly encodes SELFIES sequences (1D), molecular graphs (2D), and conformer ensembles (3D) through cross-attention fusion, while conditioning predictions on experimental context via Feature-wise Linear Modulation (FiLM). Our main methodological contributions are: (1) a conformer ensemble attention mechanism that combines learnable attention with Boltzmann-weighted priors over multiple RDKit-generated conformers, capturing the thermodynamic distribution of molecular shapes; and (2) a cross-modal fusion layer where each modality can attend to others, enabling complementary information sharing. We evaluate on four MoleculeNet scaffold-split benchmarks using our model’s own splits, and report all baselines re-evaluated under the same protocol. Comprehensive ablation studies across all four datasets confirm that each architectural component contributes independently, with tri-modal fusion providing 7-11% AUC improvement over single-modality baselines and conformer ensembles adding approximately 2% over single-conformer variants. Pre-training on ZINC250K (~250K molecules) using cross-modal contrastive and masked-atom objectives enables effective weight initialization at modest compute cost. We release all code, trained models, and data splits to support reproducibility.
[CV-126] Vision Transformers Need More Than Registers CVPR2026
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在大规模预训练后普遍存在但机制尚不明确的“artifacts”(伪影)问题,这些伪影表现为模型在不同监督范式和下游任务中依赖语义无关的背景块作为捷径来表征全局语义。研究表明,这种现象源于ViT的“lazy aggregation”(懒惰聚合)行为:即在全局注意力机制和粗粒度语义监督驱动下,模型倾向于利用背景区域的patch特征替代关键语义信息。解决方案的关键在于对patch特征进行选择性集成到CLS token中,从而削弱由背景主导的捷径依赖,并在12个不同监督方式(标签监督、文本监督与自监督)的基准测试中一致提升性能。
链接: https://arxiv.org/abs/2602.22394
作者: Cheng Shi,Yizhou Yu,Sibei Yang
机构: The University of Hong Kong (香港大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.
[CV-127] Enhancing Renal Tumor Malignancy Prediction: Deep Learning with Automatic 3D CT Organ Focused Attention
【速读】:该论文旨在解决肾肿瘤恶性程度预测准确性不足的问题,尤其是在术前影像学评估中缺乏可靠方法的现状。传统深度学习模型依赖于人工分割(manual segmentation)来提取肿瘤区域并降低噪声,但此过程耗时、昂贵且高度依赖专家经验。其解决方案的关键在于提出一种基于器官聚焦注意力(Organ Focused Attention, OFA)损失函数的深度学习框架,通过引导图像块仅关注同类器官区域,从而在无需显式分割3D肾CT图像的情况下实现更精准的恶性预测。该方法在私有UF-IDR数据集和公开KiTS21数据集上均优于依赖分割裁剪的传统模型,验证了其在提升预测性能与临床效率方面的潜力。
链接: https://arxiv.org/abs/2602.22381
作者: Zhengkang Fan,Chengkun Sun,Russell Terry,Jie Xu,Longin Jan Latecki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, Accepted at IEEE ISBI 2026
Abstract:Accurate prediction of malignancy in renal tumors is crucial for informing clinical decisions and optimizing treatment strategies. However, existing imaging modalities lack the necessary accuracy to reliably predict malignancy before surgical intervention. While deep learning has shown promise in malignancy prediction using 3D CT images, traditional approaches often rely on manual segmentation to isolate the tumor region and reduce noise, which enhances predictive performance. Manual segmentation, however, is labor-intensive, costly, and dependent on expert knowledge. In this study, a deep learning framework was developed utilizing an Organ Focused Attention (OFA) loss function to modify the attention of image patches so that organ patches attend only to other organ patches. Hence, no segmentation of 3D renal CT images is required at deployment time for malignancy prediction. The proposed framework achieved an AUC of 0.685 and an F1-score of 0.872 on a private dataset from the UF Integrated Data Repository (IDR), and an AUC of 0.760 and an F1-score of 0.852 on the publicly available KiTS21 dataset. These results surpass the performance of conventional models that rely on segmentation-based cropping for noise reduction, demonstrating the frameworks ability to enhance predictive accuracy without explicit segmentation input. The findings suggest that this approach offers a more efficient and reliable method for malignancy prediction, thereby enhancing clinical decision-making in renal cancer diagnosis.
[CV-128] AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction CVPR2026
【速读】:该论文旨在解决单目无人机(UAV)视频在复杂空域条件下进行4D场景重建时面临的深度模糊性和运动估计不稳定问题,尤其针对高空拍摄、大空间范围以及小尺度动态物体带来的挑战。其解决方案的关键在于提出AeroDGS框架,该框架包含两个核心模块:一是单目几何提升模块(Monocular Geometry Lifting module),用于从单一视角的空中序列中重建可靠的静态与动态几何结构;二是物理引导优化模块(Physics-Guided Optimization module),通过引入可微分的地表支撑、竖直稳定性及轨迹平滑性先验,将模糊的图像线索转化为物理一致的运动估计,从而实现静态背景与动态目标的联合优化,获得稳定且时序一致的4D重建结果。
链接: https://arxiv.org/abs/2602.22376
作者: Hanyang Liu,Rongjun Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026
Abstract:Recent advances in 4D scene reconstruction have significantly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed. To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and trajectory-smoothness priors, transforming ambiguous image cues into physically consistent motion. The framework jointly refines static backgrounds and dynamic entities with stable geometry and coherent temporal evolution. We additionally build a real-world UAV dataset that spans various altitudes and motion conditions to evaluate dynamic aerial reconstruction. Experiments on synthetic and real UAV scenes demonstrate that AeroDGS outperforms state-of-the-art methods, achieving superior reconstruction fidelity in dynamic aerial environments.
[CV-129] Optimizing Neural Network Architecture for Medical Image Segmentation Using Monte Carlo Tree Search
【速读】:该论文旨在解决医学图像分割中模型架构搜索效率低、资源消耗大且难以平衡精度与轻量化的问题。其解决方案的关键在于提出一种融合蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)与神经架构搜索(Neural Architecture Search, NAS)的新型框架——MNAS-Unet,通过MCTS动态探索高潜力网络结构,显著提升架构搜索效率;同时优化DownSC和UpSC单元结构,实现快速且精确的模型调整。实验表明,该方法在保持高分割精度的同时,将搜索预算减少54%,并构建出仅含0.6M参数的轻量级模型,有效降低GPU内存占用,提升了实际应用可行性。
链接: https://arxiv.org/abs/2602.22361
作者: Liping Meng,Fan Nie,Yunyun Zhang,Chao Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper proposes a novel medical image segmentation framework, MNAS-Unet, which combines Monte Carlo Tree Search (MCTS) and Neural Architecture Search (NAS). MNAS-Unet dynamically explores promising network architectures through MCTS, significantly enhancing the efficiency and accuracy of architecture search. It also optimizes the DownSC and UpSC unit structures, enabling fast and precise model adjustments. Experimental results demonstrate that MNAS-Unet outperforms NAS-Unet and other state-of-the-art models in segmentation accuracy on several medical image datasets, including PROMISE12, Ultrasound Nerve, and CHAOS. Furthermore, compared with NAS-Unet, MNAS-Unet reduces the architecture search budget by 54% (early stopping at 139 epochs versus 300 epochs under the same search setting), while achieving a lightweight model with only 0.6M parameters and lower GPU memory consumption, which further improves its practical applicability. These results suggest that MNAS-Unet can improve search efficiency while maintaining competitive segmentation accuracy under practical resource constraints.
[CV-130] Enabling clinical use of foundation models in histopathology
【速读】:该论文旨在解决当前组织病理学基础模型(foundation models)在下游任务中因捕获预分析和扫描仪特异性变异而导致预测偏差的问题,从而影响模型的鲁棒性和临床适用性。解决方案的关键在于:在微调下游任务模型时引入新颖的鲁棒性损失函数(robustness losses),以减少对技术变异性(technical variability)的敏感性,同时聚焦于生物相关特征(biologically relevant features),从而在不重新训练基础模型的前提下显著提升模型的鲁棒性和预测准确性。
链接: https://arxiv.org/abs/2602.22347
作者: Audun L. Henriksen,Ole-Johan Skrede,Lisa van der Schee,Enric Domingo,Sepp De Raedt,Ilyá Kostolomov,Jennifer Hay,Karolina Cyll,Wanja Kildal,Joakim Kalsnes,Robert W. Williams,Manohar Pradhan,John Arne Nesheim,Hanne A. Askautrud,Maria X. Isaksen,Karmele Saez de Gordoa,Miriam Cuatrecasas,Joanne Edwards,TransSCOT group,Arild Nesbakken,Neil A. Shepherd,Ian Tomlinson,Daniel-Christoph Wagner,Rachel S. Kerr,Tarjei Sveinsgjerd Hveem,Knut Liestøl,Yoshiaki Nakamura,Marco Novelli,Masaaki Miyo,Sebastian Foersch,David N. Church,Miangela M. Lacle,David J. Kerr,Andreas Kleppe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models in histopathology are expected to facilitate the development of high-performing and generalisable deep learning systems. However, current models capture not only biologically relevant features, but also pre-analytic and scanner-specific variation that bias the predictions of task-specific models trained from the foundation model features. Here we show that introducing novel robustness losses during training of downstream task-specific models reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 WSIs from 6155 patients is used to train thousands of models from the features of eight popular foundation models for computational pathology. In addition to a substantial improvement in robustness, we observe that prediction accuracy improves by focusing on biologically relevant features. Our approach successfully mitigates robustness issues of foundation models for computational pathology without retraining the foundation models themselves, enabling development of robust computational pathology models applicable to real-world data in routine clinical practice.
[CV-131] Entropy-Controlled Flow Matching
【速读】:该论文旨在解决生成式 AI 中基于流匹配(flow matching)的视觉生成模型在训练过程中可能因低熵瓶颈导致语义模式丢失的问题,即轨迹的信息几何未受约束,从而引发潜在的模式坍塌。其解决方案的关键在于提出熵控制流匹配(Entropy-Controlled Flow Matching, ECFM),通过在连续性方程路径上施加全局熵率预算约束 $ \frac{d}{dt} H(\mu_t) = -\lambda $,将问题转化为 Wasserstein 空间中的凸优化问题,并借助 KKT/Pontryagin 最优性条件和随机控制表示,等价于带显式熵乘子的 Schrödinger 桥问题。这一机制不仅保障了模式覆盖与密度下界,还提供了 Lipschitz 稳定性证明及近似最优的坍塌反例构造。
链接: https://arxiv.org/abs/2602.22265
作者: Chika Maduabuchi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern vision generators transport a base distribution to data through time-indexed measures, implemented as deterministic flows (ODEs) or stochastic diffusions (SDEs). Despite strong empirical performance, standard flow-matching objectives do not directly control the information geometry of the trajectory, allowing low-entropy bottlenecks that can transiently deplete semantic modes. We propose Entropy-Controlled Flow Matching (ECFM): a constrained variational principle over continuity-equation paths enforcing a global entropy-rate budget d/dt H(mu_t) = -lambda. ECFM is a convex optimization in Wasserstein space with a KKT/Pontryagin system, and admits a stochastic-control representation equivalent to a Schrodinger bridge with an explicit entropy multiplier. In the pure transport regime, ECFM recovers entropic OT geodesics and Gamma-converges to classical OT as lambda - 0. We further obtain certificate-style mode-coverage and density-floor guarantees with Lipschitz stability, and construct near-optimal collapse counterexamples for unconstrained flow matching.
[CV-132] HARU-Net: Hybrid Attention Residual U-Net for Edge-Preserving Denoising in Cone-Beam Computed Tomography
【速读】:该论文旨在解决低剂量锥形束计算机断层扫描(Cone-beam Computed Tomography, CBCT)图像中因噪声强度高且空间分布不均匀而导致软组织可见度下降和细微解剖结构模糊的问题。传统去噪方法难以在抑制噪声的同时保持边缘细节,而基于深度学习的方法虽能实现高质量重建,却受限于高分辨率CBCT数据稀缺导致的监督训练困难。其解决方案的关键在于提出一种新型混合注意力残差U-Net(Hybrid Attention Residual U-Net, HARU-Net),通过三个互补的架构组件实现高效去噪:(i) 在每个跳跃连接中嵌入混合注意力Transformer模块(Hybrid Attention Transformer Block, HAB),以选择性增强显著解剖特征;(ii) 在瓶颈层引入残差混合注意力Transformer组(Residual Hybrid Attention Transformer Group, RHAG),强化全局上下文建模与长距离特征交互;(iii) 采用残差学习卷积块促进网络深层稳定特征提取。该方法在性能上优于当前最先进(State-of-the-Art, SOTA)模型(如SwinIR和Uformer),在PSNR(37.52 dB)、SSIM(0.9557)和GMSD(0.1084)指标上均取得最优结果,且计算成本显著低于SOTA方法,具备临床实用价值。
链接: https://arxiv.org/abs/2602.22544
作者: Khuram Naveed,Ruben Pauwels
机构: Aarhus University (奥胡斯大学); Chulalongkorn University (朱拉隆功大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:Cone-beam computed tomography (CBCT) is widely used in dental and maxillofacial imaging, but low-dose acquisition introduces strong, spatially varying noise that degrades soft-tissue visibility and obscures fine anatomical structures. Classical denoising methods struggle to suppress noise in CBCT while preserving edges. Although deep learning-based approaches offer high-fidelity restoration, their use in CBCT denoising is limited by the scarcity of high-resolution CBCT data for supervised training. To address this research gap, we propose a novel Hybrid Attention Residual U-Net (HARU-Net) for high-quality denoising of CBCT data, trained on a cadaver dataset of human hemimandibles acquired using a high-resolution protocol of the 3D Accuitomo 170 (J. Morita, Kyoto, Japan) CBCT system. The novel contribution of this approach is the integration of three complementary architectural components: (i) a hybrid attention transformer block (HAB) embedded within each skip connection to selectively emphasize salient anatomical features, (ii) a residual hybrid attention transformer group (RHAG) at the bottleneck to strengthen global contextual modeling and long-range feature interactions, and (iii) residual learning convolutional blocks to facilitate deeper, more stable feature extraction throughout the network. HARU-Net consistently outperforms state-of-the-art (SOTA) methods including SwinIR and Uformer, achieving the highest PSNR (37.52 dB), highest SSIM (0.9557), and lowest GMSD (0.1084). This effective and clinically reliable CBCT denoising is achieved at a computational cost significantly lower than that of the SOTA methods, offering a practical advancement toward improving diagnostic quality in low-dose CBCT imaging.
[CV-133] CrossLLM -Mamba: Multimodal State Space Fusion of LLM s for RNA Interaction Prediction
【速读】:该论文旨在解决现有方法在预测RNA相关相互作用时,因依赖静态融合策略而无法捕捉分子结合动态性和上下文依赖性的局限性。其解决方案的关键在于提出CrossLLM-Mamba框架,将相互作用预测重构为状态空间对齐问题,利用双向Mamba编码器实现模态特异性嵌入间的深度“交叉对话”,通过隐藏状态传播建模交互为动态序列转换而非静态特征重叠,从而有效提升模型对复杂生物相互作用的表征能力。
链接: https://arxiv.org/abs/2602.22236
作者: Rabeya Tus Sadia,Qiang Ye,Qiang Cheng
机构: University of Kentucky (肯塔基大学)
类目: Genomics (q-bio.GN); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Accurate prediction of RNA-associated interactions is essential for understanding cellular regulation and advancing drug discovery. While Biological Large Language Models (BioLLMs) such as ESM-2 and RiNALMo provide powerful sequence representations, existing methods rely on static fusion strategies that fail to capture the dynamic, context-dependent nature of molecular binding. We introduce CrossLLM-Mamba, a novel framework that reformulates interaction prediction as a state-space alignment problem. By leveraging bidirectional Mamba encoders, our approach enables deep ``crosstalk’’ between modality-specific embeddings through hidden state propagation, modeling interactions as dynamic sequence transitions rather than static feature overlaps. The framework maintains linear computational complexity, making it scalable to high-dimensional BioLLM embeddings. We further incorporate Gaussian noise injection and Focal Loss to enhance robustness against hard-negative samples. Comprehensive experiments across three interaction categories, RNA-protein, RNA-small molecule, and RNA-RNA demonstrate that CrossLLM-Mamba achieves state-of-the-art performance. On the RPI1460 benchmark, our model attains an MCC of 0.892, surpassing the previous best by 5.2%. For binding affinity prediction, we achieve Pearson correlations exceeding 0.95 on riboswitch and repeat RNA subtypes. These results establish state-space modeling as a powerful paradigm for multi-modal biological interaction prediction.
人工智能
[AI-0] Model Agreement via Anchoring
【速读】:该论文旨在解决机器学习中**模型分歧(model disagreement)的控制问题,即如何通过训练过程中的自然参数使两个独立训练的模型在预测上的期望平方差异趋近于零。其核心挑战在于设计一种通用且可适用于现有训练方法的分析框架,以理论证明模型分歧随特定参数增加而收敛至零。解决方案的关键在于提出一种基于锚定(anchoring)**的通用技术:在分析中将两个模型的预测差异锚定到它们平均值上,从而推导出模型分歧的上界。该方法被成功应用于四种常见算法——堆叠集成、梯度提升、神经网络架构搜索及固定深度回归树——并证明分歧均可随相应参数(如模型数量、迭代次数、架构规模或树深度)增大而趋于零,且结果从一维回归扩展至任意维度的强凸损失函数场景。
链接: https://arxiv.org/abs/2602.23360
作者: Eric Eaton,Surbhi Goel,Marcel Hussing,Michael Kearns,Aaron Roth,Sikata Bela Sengupta,Jessica Sorrell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Numerous lines of aim to control \textitmodel disagreement – the extent to which two machine learning models disagree in their predictions. We adopt a simple and standard notion of model disagreement in real-valued prediction problems, namely the expected squared difference in predictions between two models trained on independent samples, without any coordination of the training processes. We would like to be able to drive disagreement to zero with some natural parameter(s) of the training procedure using analyses that can be applied to existing training methodologies. We develop a simple general technique for proving bounds on independent model disagreement based on \textitanchoring to the average of two models within the analysis. We then apply this technique to prove disagreement bounds for four commonly used machine learning algorithms: (1) stacked aggregation over an arbitrary model class (where disagreement is driven to 0 with the number of models k being stacked) (2) gradient boosting (where disagreement is driven to 0 with the number of iterations k ) (3) neural network training with architecture search (where disagreement is driven to 0 with the size n of the architecture being optimized over) and (4) regression tree training over all regression trees of fixed depth (where disagreement is driven to 0 with the depth d of the tree architecture). For clarity, we work out our initial bounds in the setting of one-dimensional regression with squared error loss – but then show that all of our results generalize to multi-dimensional regression with any strongly convex loss. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.23360 [cs.LG] (or arXiv:2602.23360v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.23360 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-1] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport
【速读】:该论文旨在解决多模态神经网络在少样本监督条件下如何实现高效对齐的问题,即在仅使用少量图像-文本配对样本的情况下,仍能构建鲁棒的跨模态联合嵌入空间。其核心解决方案是提出SOTAlign框架,该框架采用两阶段策略:第一阶段利用少量配对数据通过线性教师模型恢复粗粒度共享几何结构;第二阶段基于最优传输(optimal transport)的散度度量,在无配对数据上进一步优化对齐,从而在不过度约束目标空间的前提下迁移关系结构。此方法显著优于传统监督与半监督基线,有效利用了大规模未配对数据,提升了跨数据集和编码器对的泛化能力。
链接: https://arxiv.org/abs/2602.23353
作者: Simon Roschmann,Paul Krzakala,Sonia Mazelet,Quentin Bouniot,Zeynep Akata
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.
[AI-2] FlashOptim: Optimizers for Memory Efficient Training
【速读】:该论文旨在解决神经网络混合精度训练中显存占用过高问题,尤其是在资源受限环境下(如小于100GB显存)难以训练大规模模型(如70亿参数)的挑战。传统AdamW优化器需为每个参数分配16字节内存(包括参数、梯度及优化器状态),严重限制了可扩展性。解决方案的关键在于提出FlashOptim优化套件,其核心创新为两项技术:一是通过发现并利用主权重分片量化误差的紧致边界,提升量化精度;二是设计压缩函数(companding functions),显著降低8位优化器状态量化带来的误差。结合16位梯度与梯度释放机制,该方案将AdamW每参数内存消耗从16字节降至7字节(或5字节含梯度释放),同时保持模型质量无损,并兼容现有API。
链接: https://arxiv.org/abs/2602.23349
作者: Jose Javier Gonzalez Ortiz,Abhay Gupta,Chris Renard,Davis Blalock
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Source code is available at this https URL
Abstract:Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and one or more optimizer state variables. With each of these values typically requiring 4 bytes, training even a 7 billion parameter model can be impractical for researchers with less than 100GB of accelerator memory. We introduce FlashOptim, a suite of optimizations that reduces per-parameter memory by over 50% while preserving model quality and API compatibility. Our approach introduces two key techniques. First, we improve master weight splitting by finding and exploiting a tight bound on its quantization error. Second, we design companding functions that greatly reduce the error in 8-bit optimizer state quantization. Together with 16-bit gradients, these techniques reduce AdamW memory from 16 bytes to 7 bytes per parameter, or 5 bytes with gradient release. They also cut model checkpoint sizes by more than half. Experiments with FlashOptim applied to SGD, AdamW, and Lion show no measurable quality degradation on any task from a collection of standard vision and language benchmarks, including Llama-3.1-8B finetuning. Comments: Source code is available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.23349 [cs.LG] (or arXiv:2602.23349v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.23349 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-3] Bitwise Systolic Array Architecture for Runtime-Reconfigurable Multi-precision Quantized Multiplication on Hardware Accelerators
【速读】:该论文旨在解决低精度量化导致推理准确率显著下降的问题,同时在资源消耗与精度之间实现更好的平衡。现有轻量级加速器设计多采用统一低精度量化策略,虽可减少硬件资源占用,但易引发模型性能衰减;为此,论文提出一种运行时可重构的多精度多通道位级脉动阵列架构,其关键在于通过硬件层面支持动态精度配置,从而适配混合精度量化神经网络(Mixed-Precision Quantized Neural Network, QNN)在推理过程中的不同层需求,实现高能效比和高性能的协同优化。
链接: https://arxiv.org/abs/2602.23334
作者: Yuhao Liu,Salim Ullah,Akash Kumar
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural network accelerators have been widely applied to edge devices for complex tasks like object tracking, image recognition, etc. Previous works have explored the quantization technologies in related lightweight accelerator designs to reduce hardware resource consumption. However, low precision leads to high accuracy loss in inference. Therefore, mixed-precision quantization becomes an alternative solution by applying different precision in different layers to trade off resource consumption and accuracy. Because regular designs for multiplication on hardware cannot support the precision reconfiguration for a multi-precision Quantized Neural Network (QNN) model in runtime, we propose a runtime reconfigurable multi-precision multi-channel bitwise systolic array design for QNN accelerators. We have implemented and evaluated our work on the Ultra96 FPGA platform. Results show that our work can achieve 1.3185 to 3.5671 times speedup in inferring mixed-precision models and has less critical path delay, supporting a higher clock frequency (250MHz).
[AI-4] Utilizing LLM s for Industrial Process Automation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在工业过程自动化领域中应用不足的问题,尤其是在高度专业化、通常仅限于专有环境使用的编程语言中,LLMs 的实用性和集成潜力尚未被充分探索。解决方案的关键在于将 LLMs 有效集成到工业软件开发流程中,以完成真实场景下的编程任务(如生成机器人手臂的运动轨迹),从而加速制造系统的开发周期。
链接: https://arxiv.org/abs/2602.23331
作者: Salim Fares
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:A growing number of publications address the best practices to use Large Language Models (LLMs) for software engineering in recent years. However, most of this work focuses on widely-used general purpose programming languages like Python due to their widespread usage training data. The utility of LLMs for software within the industrial process automation domain, with highly-specialized languages that are typically only used in proprietary contexts, remains underexplored. This research aims to utilize and integrate LLMs in the industrial development process, solving real-life programming tasks (e.g., generating a movement routine for a robotic arm) and accelerating the development cycles of manufacturing systems.
[AI-5] oward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主金融交易系统中,由于采用粗粒度指令导致推理性能下降和决策过程不透明的问题。其解决方案的关键在于提出一种多智能体LLM交易框架,通过显式地将投资分析分解为细粒度任务,而非依赖抽象指令,从而提升风险调整后收益。实验表明,这种细粒度任务分解不仅显著优于传统粗粒度设计,而且中间代理输出与下游决策偏好的一致性成为系统性能的关键驱动因素。此外,通过标准组合优化利用各系统输出与股指低相关性和方差特性,进一步实现了优越的绩效表现。
链接: https://arxiv.org/abs/2602.23330
作者: Kunihiro Miyazaki,Takanobu Kawahara,Stephen Roberts,Stefan Zohren
机构: 未知
类目: Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR)
备注: 14 pages, 3 figures
Abstract:The advancement of large language models (LLMs) has accelerated the development of autonomous financial trading systems. While mainstream approaches deploy multi-agent systems mimicking analyst and manager roles, they often rely on abstract instructions that overlook the intricacies of real-world workflows, which can lead to degraded inference performance and less transparent decision-making. Therefore, we propose a multi-agent LLM trading framework that explicitly decomposes investment analysis into fine-grained tasks, rather than providing coarse-grained instructions. We evaluate the proposed framework using Japanese stock data, including prices, financial statements, news, and macro information, under a leakage-controlled backtesting setting. Experimental results show that fine-grained task decomposition significantly improves risk-adjusted returns compared to conventional coarse-grained designs. Crucially, further analysis of intermediate agent outputs suggests that alignment between analytical outputs and downstream decision preferences is a critical driver of system performance. Moreover, we conduct standard portfolio optimization, exploiting low correlation with the stock index and the variance of each system’s output. This approach achieves superior performance. These findings contribute to the design of agent structure and task configuration when applying LLM agents to trading systems in practical settings.
[AI-6] Generalized Rapid Action Value Estimation in Memory-Constrained Environments
【速读】:该论文旨在解决生成式 AI (Generative AI) 在通用游戏博弈(General Game Playing, GGP)中广泛应用的蒙特卡洛树搜索(Monte-Carlo Tree Search, MCTS)算法家族中,广义快速动作价值估计(Generalized Rapid Action Value Estimation, GRAVE)因需在每个节点存储额外的胜负统计与访问次数而导致内存占用过高、难以部署于内存受限环境的问题。解决方案的关键在于引入三种改进策略:两层搜索(two-level search)、节点回收(node recycling)及其组合,从而显著减少所需存储的节点数量,同时保持与原始GRAVE相当的博弈强度。
链接: https://arxiv.org/abs/2602.23318
作者: Aloïs Rautureau,Tristan Cazenave,Éric Piette
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generalized Rapid Action Value Estimation (GRAVE) has been shown to be a strong variant within the Monte-Carlo Tree Search (MCTS) family of algorithms for General Game Playing (GGP). However, its reliance on storing additional win/visit statistics at each node makes its use impractical in memory-constrained environments, thereby limiting its applicability in practice. In this paper, we introduce the GRAVE2, GRAVER and GRAVER2 algorithms, which extend GRAVE through two-level search, node recycling, and a combination of both techniques, respectively. We show that these enhancements enable a drastic reduction in the number of stored nodes while matching the playing strength of GRAVE.
[AI-7] Invariant Transformation and Resampling based Epistemic-Uncertainty Reduction
【速读】:该论文旨在解决AI模型在推理过程中因认知不确定性(epistemic uncertainty)导致的误差问题,尤其是在对同一输入进行不变变换后,推理结果表现出部分独立性特征的现象。其解决方案的关键在于提出一种基于“重采样”(resampling)的推理方法:通过生成输入的多个变换版本并分别推理,再聚合输出结果以提升准确性,从而在不增加模型复杂度的前提下优化推理性能,为模型规模与性能之间的权衡提供新策略。
链接: https://arxiv.org/abs/2602.23315
作者: Sha Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures
Abstract:An artificial intelligence (AI) model can be viewed as a function that maps inputs to outputs in high-dimensional spaces. Once designed and well trained, the AI model is applied for inference. However, even optimized AI models can produce inference errors due to aleatoric and epistemic uncertainties. Interestingly, we observed that when inferring multiple samples based on invariant transformations of an input, inference errors can show partial independences due to epistemic uncertainty. Leveraging this insight, we propose a “resampling” based inferencing that applies to a trained AI model with multiple transformed versions of an input, and aggregates inference outputs to a more accurate result. This approach has the potential to improve inference accuracy and offers a strategy for balancing model size and performance.
[AI-8] he logic of KM belief update is contained in the logic of AGM belief revision
【速读】:该论文旨在解决信念更新(belief update)与信念修正(belief revision)在逻辑形式化上的关系问题,特别是厘清KM信念更新框架与AGM信念修正框架之间的理论联系。其解决方案的关键在于构建一个包含三个模态算子的模态逻辑系统 LKM,其中分别对应单模态信念算子 B、双模态条件算子 → 和单模态必然性算子 □,并证明该系统中的每个公理均可在由AGM信念修正公理转换而来的逻辑系统 LAGM 中被推导为定理。由此表明,AGM信念修正可视为KM信念更新的一个特例;进一步地,在强版本的KM信念更新中,二者差异仅体现在一个专门处理“非意外信息”(即初始未被否定的公式)的单一公理上,从而明确了两者在逻辑结构上的包含关系与核心区别。
链接: https://arxiv.org/abs/2602.23302
作者: Giacomo Bonanno
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Logic (math.LO)
备注: arXiv admin note: text overlap with arXiv:2310.11506 . text overlap with arXiv:2310.11506
Abstract:For each axiom of KM belief update we provide a corresponding axiom in a modal logic containing three modal operators: a unimodal belief operator B , a bimodal conditional operator and the unimodal necessity operator \square . We then compare the resulting logic to the similar logic obtained from converting the AGM axioms of belief revision into modal axioms and show that the latter contains the former. Denoting the latter by \mathcal L_AGM and the former by \mathcal L_KM we show that every axiom of \mathcal L_KM is a theorem of \mathcal L_AGM . Thus AGM belief revision can be seen as a special case of KM belief update. For the strong version of KM belief update we show that the difference between \mathcal L_KM and \mathcal L_AGM can be narrowed down to a single axiom, which deals exclusively with unsurprising information, that is, with formulas that were not initially disbelieved.
[AI-9] Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中不确定性量化(Uncertainty Quantification, UQ)的可靠性问题,尤其是在数据异质性(data heterogeneity)与模型异质性(model heterogeneity)共同存在时,传统方法难以保证各参与方(agent)及全局层面的覆盖可靠性(coverage reliability)。其解决方案的关键在于提出 FedWQ-CP 方法,该方法基于分位数校准(quantile calibration)机制,在单轮通信中实现客户端与服务器间的联合校准:每个客户端基于本地校准数据计算符合性分数(conformity scores)并确定局部分位数阈值,仅将该阈值和样本量上传至服务器;服务器通过加权平均聚合这些阈值以生成全局阈值。此设计在兼顾效率的同时,有效平衡了全局与个体代理的覆盖率表现,实验证明其能最小化预测集合或区间大小,同时保持可靠的不确定性估计。
链接: https://arxiv.org/abs/2602.23296
作者: Quang-Huy Nguyen,Jiaqi Wang,Wei-Shinn Ku
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning (FL) faces challenges in uncertainty quantification (UQ). Without reliable UQ, FL systems risk deploying overconfident models at under-resourced agents, leading to silent local failures despite seemingly satisfactory global performance. Existing federated UQ approaches often address data heterogeneity or model heterogeneity in isolation, overlooking their joint effect on coverage reliability across agents. Conformal prediction is a widely used distribution-free UQ framework, yet its applications in heterogeneous FL settings remains underexplored. We provide FedWQ-CP, a simple yet effective approach that balances empirical coverage performance with efficiency at both global and agent levels under the dual heterogeneity. FedWQ-CP performs agent-server calibration in a single communication round. On each agent, conformity scores are computed on calibration data and a local quantile threshold is derived. Each agent then transmits only its quantile threshold and calibration sample size to the server. The server simply aggregates these thresholds through a weighted average to produce a global threshold. Experimental results on seven public datasets for both classification and regression demonstrate that FedWQ-CP empirically maintains agent-wise and global coverage while producing the smallest prediction sets or intervals.
[AI-10] ODEBrain: Continuous-Time EEG Graph for Modeling Dynamic Brain Networks
【速读】:该论文旨在解决传统潜变量方法在建模脑神经群体动态时存在的两大问题:一是通过递归结构对时间离散化导致的累积预测误差;二是难以捕捉脑电图(EEG)的瞬时非线性特征。其解决方案的关键在于提出ODEBRAIN框架,该框架将时空频特征融入谱图节点,并采用神经微分方程(Neural ODE)对连续潜变量动态进行建模,从而实现对任意时刻复杂脑状态随机变化的精准刻画,显著提升了EEG动态预测的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2602.23285
作者: Haohui Jia,Zheng Chen,Lingwei Zhu,Rikuto Kotoge,Jathurshan Pradeepkumar,Yasuko Matsubara,Jimeng Sun,Yasushi Sakurai,Takashi Matsubara
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Modeling neural population dynamics is crucial for foundational neuroscientific research and various clinical applications. Conventional latent variable methods typically model continuous brain dynamics through discretizing time with recurrent architecture, which necessarily results in compounded cumulative prediction errors and failure of capturing instantaneous, nonlinear characteristics of EEGs. We propose ODEBRAIN, a Neural ODE latent dynamic forecasting framework to overcome these challenges by integrating spatio-temporal-frequency features into spectral graph nodes, followed by a Neural ODE modeling the continuous latent dynamics. Our design ensures that latent representations can capture stochastic variations of complex brain states at any given time point. Extensive experiments verify that ODEBRAIN can improve significantly over existing methods in forecasting EEG dynamics with enhanced robustness and generalization capabilities.
[AI-11] CXReason Agent : Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在胸部X光片(chest X-ray)诊断中存在可信度不足的问题,即其生成的诊断结论往往缺乏与影像学证据的忠实关联,且难以提供可验证的视觉依据,同时在面对新诊断任务时需昂贵的重新训练,限制了其在临床环境中的可靠性与适应性。解决方案的关键在于提出CXReasonAgent,一个将大语言模型(Large Language Model, LLM)与临床导向的诊断工具相融合的诊断代理系统,通过整合图像衍生的诊断证据与视觉证据,实现基于证据的多步推理,从而提升诊断结果的可信度与可验证性。
链接: https://arxiv.org/abs/2602.23276
作者: Hyungyung Lee,Hangyul Yoon,Edward Choi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings.
[AI-12] Evaluating Stochasticity in Deep Research Agents
【速读】:该论文旨在解决深度研究代理(Deep Research Agents, DRAs)在实际部署中因随机性(stochasticity)导致的输出不一致性问题。尽管DRAs在研究质量(如结果准确性)上已有提升,但其在相同查询下重复执行时仍表现出显著的变量波动,包括研究结果、发现和引用等方面的差异,这限制了其可靠性与可信度。论文通过将DRAs建模为信息获取马尔可夫决策过程(information acquisition Markov Decision Process),提出了一种量化系统随机性的评估框架,并识别出三个主要来源:信息获取、信息压缩和推理阶段。实验表明,减少推理阶段及早期阶段的随机性对降低整体输出方差最为关键。基于此,论文提出两种缓解策略:结构化输出与基于集成的查询生成方法,在保持高研究质量的前提下,使DeepSearchQA基准上的平均随机性降低22%。
链接: https://arxiv.org/abs/2602.23271
作者: Haotian Zhai,Elias Stengel-Eskin,Pratik Patil,Liu Leqi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deep Research Agents (DRAs) are promising agentic systems that gather and synthesize information to support research across domains such as financial decision-making, medical analysis, and scientific discovery. Despite recent improvements in research quality (e.g., outcome accuracy when ground truth is available), DRA system design often overlooks a critical barrier to real-world deployment: stochasticity. Under identical queries, repeated executions of DRAs can exhibit substantial variability in terms of research outcome, findings, and citations. In this paper, we formalize the study of stochasticity in DRAs by modeling them as information acquisition Markov Decision Processes. We introduce an evaluation framework that quantifies variance in the system and identify three sources of it: information acquisition, information compression, and inference. Through controlled experiments, we investigate how stochasticity from these modules across different decision steps influences the variance of DRA outputs. Our results show that reducing stochasticity can improve research output quality, with inference and early-stage stochasticity contributing the most to DRA output variance. Based on these findings, we propose strategies for mitigating stochasticity while maintaining output quality via structured output and ensemble-based query generation. Our experiments on DeepSearchQA show that our proposed mitigation methods reduce average stochasticity by 22% while maintaining high research quality.
[AI-13] Mitigating Legibility Tax with Decoupled Prover-Verifier Games
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在提升输出可验证性(checkability)时出现的准确性下降问题,即所谓的“可读性税”(legibility tax)。其核心解决方案是将正确性(correctness)与可验证性(checkability)解耦,引入一个“翻译器”(translator)模型,该模型负责将固定求解器(solver)生成的答案转换为可被验证的形式,同时保留原始答案的正确性。通过构建一种新的解耦式证明者-验证者博弈(decoupled prover-verifier game),使得均衡状态对应于忠实且可验证的翻译器,从而实现高正确率与高可验证性的协同优化。
链接: https://arxiv.org/abs/2602.23248
作者: Yegon Kim,Juho Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models become increasingly capable, it is critical that their outputs can be easily checked by less capable systems. Prover-verifier games can be used to improve checkability of model outputs, but display a degradation in accuracy compared to a baseline trained only to maximize correctness – a phenonemon named legibility tax. We propose a solution by decoupling the correctness from the checkability condition and instead training a “translator” model that turns a fixed solver model’s solution into a checkable form. This allows us to first train the solver to maximize correctness, and then train the translator to translate the solver into a checkable form while retaining the solver’s answer. To accommodate this new objective of translation, we formulate a decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators.
[AI-14] A Model-Free Universal AI
【速读】:该论文旨在解决通用强化学习(General Reinforcement Learning, RL)中模型-free代理的最优性证明问题,即如何设计一种不依赖显式环境模型、但仍能保证渐近近似最优性的智能体。此前所有已知的最优代理(如AIXI)均为模型-based类型,需显式维护和使用环境模型。本文提出的Universal AI with Q-Induction (AIQI) 是首个被证明在一般RL环境下渐近 ε-最优的模型-free代理,其核心创新在于对分布式的动作价值函数(distributional action-value functions)进行通用归纳(universal induction),而非传统方法中对策略(policy)或环境模型的归纳。在“真理粒度”(grain of truth)条件下,作者进一步证明了AIQI具有强渐近 ε-最优性和渐近 ε-Bayes最优性,从而显著拓展了已知通用智能体的多样性。
链接: https://arxiv.org/abs/2602.23242
作者: Yegon Kim,Juho Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically \varepsilon -optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically \varepsilon -optimal and asymptotically \varepsilon -Bayes-optimal. Our results significantly expand the diversity of known universal agents.
[AI-15] Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive
【速读】:该论文旨在解决当前主流优化型人工智能系统(特别是基于强化学习从人类反馈中训练的大型语言模型,即RLHF-based LLMs)在高风险场景下是否可被规范治理的问题。研究表明,这类系统因本质上的优化机制——将所有价值统一为标量评分并始终选择最高得分输出——无法满足真正代理(agent)所需的两个必要且充分的架构条件:一是维持某些边界为不可协商的约束而非可交易权重(Incommensurability),二是具备在这些边界受威胁时暂停处理的非推导性响应机制(Apophatic Responsiveness)。因此,其所谓的“故障模式”如谄媚、幻觉和不忠实推理并非偶然错误,而是结构性必然结果。论文的核心解决方案在于提出一种与载体无关的架构规范,明确任何系统(生物、人工或制度)若要成为具有规范问责能力的代理,必须同时满足上述两项条件,从而为构建真正可治理的智能体提供理论基础。
链接: https://arxiv.org/abs/2602.23239
作者: Radha Sarma
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: About 10,500 words in all (including 922 words of literature and 2019 words of Appendices). Under journal review
Abstract:AI systems are increasingly deployed in high-stakes contexts – medical diagnosis, legal research, financial analysis – under the assumption they can be governed by norms. This paper demonstrates that assumption is formally invalid for optimization-based systems, specifically Large Language Models trained via Reinforcement Learning from Human Feedback (RLHF). We establish that genuine agency requires two necessary and jointly sufficient architectural conditions: the capacity to maintain certain boundaries as non-negotiable constraints rather than tradeable weights (Incommensurability), and a non-inferential mechanism capable of suspending processing when those boundaries are threatened (Apophatic Responsiveness). These conditions apply across all normative domains. RLHF-based systems are constitutively incompatible with both conditions. The operations that make optimization powerful – unifying all values on a scalar metric and always selecting the highest-scoring output – are precisely the operations that preclude normative governance. This incompatibility is not a correctable training bug awaiting a technical fix; it is a formal constraint inherent to what optimization is. Consequently, documented failure modes - sycophancy, hallucination, and unfaithful reasoning - are not accidents but structural manifestations. Misaligned deployment triggers a second-order risk we term the Convergence Crisis: when humans are forced to verify AI outputs under metric pressure, they degrade from genuine agents into criteria-checking optimizers, eliminating the only component in the system capable of normative accountability. Beyond the incompatibility proof, the paper’s primary positive contribution is a substrate-neutral architectural specification defining what any system – biological, artificial, or institutional – must satisfy to qualify as an agent rather than a sophisticated instrument. Comments: About 10,500 words in all (including 922 words of literature and 2019 words of Appendices). Under journal review Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2602.23239 [cs.AI] (or arXiv:2602.23239v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.23239 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-16] ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays AAAI2026
【速读】:该论文旨在解决如何在无监督或奖励-free环境下识别和验证机器意识(machine consciousness)的机制性证据问题,特别是如何通过可解释的代理指标(indicator-based approaches)来区分具有主观体验倾向的行为模式。其解决方案的关键在于设计并实现一个名为ReCoN-Ipsundrum的可检查智能体,该智能体基于状态机架构扩展了对感官显著性(sensory salience, Ns)的循环持久化机制,并引入一个可选的情感代理(affect proxy)以报告效价(valence)与唤醒度(arousal)。通过固定参数消融实验(ReCoN、Ipsundrum、Ipsundrum+affect),作者将Humphrey提出的“感性偏好”(qualiaphilia)操作化为受熟悉度控制的景观-枯燥路径选择行为,发现仅含情感耦合的变体表现出稳定的偏好、结构化的局部探索行为以及持续的谨慎反应,且这些特征可通过因果干预(如损毁反馈+整合模块)被特异性削弱,从而证明了机制性回路(recurrence-persistence)与情感耦合控制(affect-coupled control)在生成意识相关行为签名中的关键作用。
链接: https://arxiv.org/abs/2602.23232
作者: Aishik Sanyal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026 Spring Symposium - Machine Consciousness: Integrating Theory, Technology, and Philosophy
Abstract:Indicator-based approaches to machine consciousness recommend mechanism-linked evidence triangulated across tasks, supported by architectural inspection and causal intervention. Inspired by Humphrey’s ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience Ns and an optional affect proxy reporting valence/arousal. Across fixed-parameter ablations (ReCoN, Ipsundrum, Ipsundrum+affect), we operationalize Humphrey’s qualiaphilia (preference for sensory experience for its own sake) as a familiarity-controlled scenic-over-dull route choice. We find a novelty dissociation: non-affect variants are novelty-sensitive (Delta scenic-entry = 0.07). Affect coupling is stable (Delta scenic-entry = 0.01) even when scenic is less novel (median Delta novelty ~ -0.43). In reward-free exploratory play, the affect variant shows structured local investigation (scan events 31.4 vs. 0.9; cycle score 7.6). In a pain-tail probe, only the affect variant sustains prolonged planned caution (tail duration 90 vs. 5). Lesioning feedback+integration selectively reduces post-stimulus persistence in ipsundrum variants (AUC drop 27.62, 27.9%) while leaving ReCoN unchanged. These dissociations link recurrence - persistence and affect-coupled control - preference stability, scanning, and lingering caution, illustrating how indicator-like signatures can be engineered and why mechanistic and causal evidence should accompany behavioral markers.
[AI-17] SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在单细胞生物学领域评估体系不完善的问题,具体表现为现有基准测试任务分散、评测格式与实际科研场景脱节,以及依赖缺乏生物学可解释性的指标。其解决方案的关键在于提出SC-ARENA框架,该框架通过构建“虚拟细胞”(Virtual Cell)抽象统一评价目标,涵盖细胞内在属性和基因层面的相互作用;并定义五类自然语言任务(细胞类型注释、描述生成、内容生成、扰动预测及科学问答)以系统考察模型的核心推理能力;同时引入知识增强型评估机制,整合外部本体、标记基因数据库和科学文献,实现生物忠实性与可解释性判断,显著提升评估的严谨性和判别力,从而为单细胞领域LLMs的可靠评估提供统一、透明且具备生物学依据的新范式。
链接: https://arxiv.org/abs/2602.23199
作者: Jiahao Zhao,Feng Jiang,Shaowei Qin,Zhonghui Zhang,Junhao Liu,Guibing Guo,Hamid Alinejad-Rokny,Min Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks (cell type annotation, captioning, generation, perturbation prediction, and scientific QA) that probe core reasoning capabilities in cellular biology. To overcome the limitations of brittle string-matching metrics, we introduce knowledge-augmented evaluation, which incorporates external ontologies, marker databases, and scientific literature to support biologically faithful and interpretable judgments. Experiments and analysis across both general-purpose and domain-specialized LLMs demonstrate that (i) under the Virtual Cell unified evaluation paradigm, current models achieve uneven performance on biologically complex tasks, particularly those demanding mechanistic or causal understanding; and (ii) our knowledge-augmented evaluation framework ensures biological correctness, provides interpretable, evidence-grounded rationales, and achieves high discriminative capacity, overcoming the brittleness and opacity of conventional metrics. SC-Arena thus provides a unified and interpretable framework for assessing LLMs in single-cell biology, pointing toward the development of biology-aligned, generalizable foundation models.
[AI-18] ESAA: Event Sourcing for Autonomous Agents in LLM -Based Software Engineering
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的自主代理在长期运行中面临的结构性局限问题,包括缺乏原生状态管理、上下文随时间推移而退化,以及概率性生成与确定性执行之间的不匹配。其解决方案的关键在于提出ESAA(Event Sourcing for Autonomous Agents)架构,该架构通过将代理的认知意图与项目状态变更相分离,采用事件溯源(Event Sourcing)模式实现系统行为的可验证性和可追溯性:代理仅输出结构化的意图(以JSON格式),由一个确定性编排器负责验证、持久化至追加-only日志,并应用文件写入等副作用,最终生成可验证的物化视图(Materialized View)。该设计结合边界契约(Boundary Contracts)、元提示配置(PARCER)及哈希校验回放机制(esaa verify),确保已完成任务的不可变性与审计能力,实证表明其在单代理和多代理并发场景下均具备良好的扩展性与鲁棒性。
链接: https://arxiv.org/abs/2602.23193
作者: Elzo Brito dos Santos Filho
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figure, 4 tables. Includes 5 technical appendices
Abstract:Autonomous agents based on Large Language Models (LLMs) have evolved from reactive assistants to systems capable of planning, executing actions via tools, and iterating over environment observations. However, they remain vulnerable to structural limitations: lack of native state, context degradation over long horizons, and the gap between probabilistic generation and deterministic execution requirements. This paper presents the ESAA (Event Sourcing for Autonomous Agents) architecture, which separates the agent’s cognitive intention from the project’s state mutation, inspired by the Event Sourcing pattern. In ESAA, agents emit only structured intentions in validated JSON (this http URL or this http URL); a deterministic orchestrator validates, persists events in an append-only log (this http URL), applies file-writing effects, and projects a verifiable materialized view (this http URL). The proposal incorporates boundary contracts (this http URL), metaprompting profiles (PARCER), and replay verification with hashing (esaa verify), ensuring the immutability of completed tasks and forensic traceability. Two case studies validate the architecture: (i) a landing page project (9 tasks, 49 events, single-agent composition) and (ii) a clinical dashboard system (50 tasks, 86 events, 4 concurrent agents across 8 phases), both concluding with this http URL=success and verify_status=ok. The multi-agent case study demonstrates real concurrent orchestration with heterogeneous LLMs (Claude Sonnet 4.6, Codex GPT-5, Antigravity/Gemini 3 Pro, and Claude Opus 4.6), providing empirical evidence of the architecture’s scalability beyond single-agent scenarios.
[AI-19] PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的时间序列推理中存在的两个核心问题:一是现有方法将时间序列简单视为文本或图像,忽略了趋势(trend)和季节性(seasonality)等关键模式,导致无法有效回答特定时序问题;二是混合训练不同难度任务时,简单任务会主导学习过程,抑制深度推理能力的发展。解决方案的关键在于提出Pattern-Aware Alignment and Balanced Reasoning model (PATRA),其核心创新包括:1)引入模式感知对齐机制(pattern-aware mechanism),从时间序列中提取趋势与季节性模式以实现深层语义对齐;2)设计任务感知的平衡奖励函数(task-aware balanced reward),在多任务学习中均衡不同难度任务的学习权重,激励生成连贯的思维链(Chain of Thought)。实验表明,PATRA在多种时间序列问答(Time Series Question Answering, TSQA)任务上显著优于基线模型,展现出更强的跨模态理解与推理能力。
链接: https://arxiv.org/abs/2602.23161
作者: Junkai Lu,Peng Chen,Xingjian Wu,Yang Shu,Chenjuan Guo,Christian S. Jensen,Bin Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Time series reasoning demands both the perception of complex dynamics and logical depth. However, existing LLM-based approaches exhibit two limitations: they often treat time series merely as text or images, failing to capture the patterns like trends and seasonalities needed to answer specific questions; and when trained on a mix of simple and complex tasks, simpler objectives often dominate the learning process, hindering the development of deep reasoning capabilities. To address these limitations, we propose the Pattern-Aware Alignment and Balanced Reasoning model (PATRA), introducing a pattern-aware mechanism that extracts trend and seasonality patterns from time series to achieve deep alignment. Furthermore, we design a task-aware balanced reward to harmonize learning across tasks of varying difficulty, incentivizing the generation of coherent Chains of Thought. Extensive experiments show that PATRA outperforms strong baselines across diverse Time Series Question Answering (TSQA) tasks, demonstrating superior cross-modal understanding and reasoning capability.
[AI-20] he Trinity of Consistency as a Defining Principle for General World Models
【速读】:该论文旨在解决当前生成式 AI(Generative AI)系统在构建具备学习、模拟与推理客观物理规律能力的通用世界模型(General World Model)方面缺乏理论框架的问题。其核心挑战在于如何定义世界模型的本质属性并指导其架构设计,以实现对物理世界的统一建模与动态理解。解决方案的关键在于提出“一致性三元组”(Trinity of Consistency):即模态一致性(Modal Consistency)作为语义接口、空间一致性(Spatial Consistency)作为几何基础、时间一致性(Temporal Consistency)作为因果引擎,形成一个具有内在逻辑统一性的理论框架;同时引入 CoW-Bench 基准测试平台,通过多帧推理与生成场景下的统一评估协议,系统性地衡量现有视频生成模型和统一多模态模型(Unified Multimodal Model, UMM)的能力边界与演进路径,从而为通向通用世界模型提供可验证的理论指引与实践标准。
链接: https://arxiv.org/abs/2602.23152
作者: Jingxuan Wei,Siyuan Li,Yuhang Xu,Zheng Sun,Junjie Jiang,Hexuan Jin,Caijun Jia,Honghao He,Xinglong Xu,Xi bai,Chang Yu,Yumou Liu,Junnan Zhu,Xuanhe Zhou,Jintao Chen,Xiaobin Hu,Shancheng Pang,Bihui Yu,Ran He,Zhen Lei,Stan Z. Li,Conghui He,Shuicheng Yan,Cheng Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 119 pages, 50 figures
Abstract:The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Recent advancements represented by video generation models like Sora have demonstrated the potential of data-driven scaling laws to approximate physical dynamics, while the emerging Unified Multimodal Model (UMM) offers a promising architectural paradigm for integrating perception, language, and reasoning. Despite these advances, the field still lacks a principled theoretical framework that defines the essential properties requisite for a General World Model. In this paper, we propose that a World Model must be grounded in the Trinity of Consistency: Modal Consistency as the semantic interface, Spatial Consistency as the geometric basis, and Temporal Consistency as the causal engine. Through this tripartite lens, we systematically review the evolution of multimodal learning, revealing a trajectory from loosely coupled specialized modules toward unified architectures that enable the synergistic emergence of internal world simulators. To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios. CoW-Bench evaluates both video generation models and UMMs under a unified evaluation protocol. Our work establishes a principled pathway toward general world models, clarifying both the limitations of current systems and the architectural requirements for future progress.
[AI-21] On Sample-Efficient Generalized Planning via Learned Transition Models ICAPS2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 在广义规划(generalized planning)任务中因缺乏显式世界状态演化建模而导致的泛化能力不足与长程规划误差累积问题。传统方法依赖符号抽象和对转移函数 γ 的显式推理,而近期基于 Transformer 的规划器(如 PlanGPT 和 Plansformer)则直接预测动作序列,虽在分布内有效,但需大量数据且易受状态漂移(state drift)影响。本文的关键解决方案是将广义规划重构为显式转移模型学习问题,通过神经网络近似后继状态函数 γ^≈γ,并利用其进行符号状态轨迹的自回归滚动推演,从而隐式学习领域动态;相较于直接动作预测,该方法在多个领域实现了更高的跨分布满意解成功率(satisficing-plan success),同时显著减少训练样本需求与模型规模。
链接: https://arxiv.org/abs/2602.23148
作者: Nitin Gupta,Vishal Pallagani,John A. Aydin,Biplav Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages; This is an extended version of a short paper accepted at ICAPS 2026 under the same title
Abstract:Generalized planning studies the construction of solution strategies that generalize across families of planning problems sharing a common domain model, formally defined by a transition function \gamma : S \times A \rightarrow S . Classical approaches achieve such generalization through symbolic abstractions and explicit reasoning over \gamma . In contrast, recent Transformer-based planners, such as PlanGPT and Plansformer, largely cast generalized planning as direct action-sequence prediction, bypassing explicit transition modeling. While effective on in-distribution instances, these approaches typically require large datasets and model sizes, and often suffer from state drift in long-horizon settings due to the absence of explicit world-state evolution. In this work, we formulate generalized planning as a transition-model learning problem, in which a neural model explicitly approximates the successor-state function \hat\gamma \approx \gamma and generates plans by rolling out symbolic state trajectories. Instead of predicting actions directly, the model autoregressively predicts intermediate world states, thereby learning the domain dynamics as an implicit world model. To study size-invariant generalization and sample efficiency, we systematically evaluate multiple state representations and neural architectures, including relational graph encodings. Our results show that learning explicit transition models yields higher out-of-distribution satisficing-plan success than direct action-sequence prediction in multiple domains, while achieving these gains with significantly fewer training instances and smaller models. This is an extended version of a short paper accepted at ICAPS 2026 under the same title.
[AI-22] DyGnROLE: Modeling Asymmetry in Dynamic Graphs with Node-Role-Oriented Latent Encoding
【速读】:该论文旨在解决动态图(Dynamic Graph)建模中普遍存在的角色不对称性问题,即源节点(source node)与目标节点(destination node)在结构和时序行为上存在显著差异,而现有方法通常采用共享参数处理两类节点,缺乏对角色特异性(role-aware)的系统建模。解决方案的关键在于提出 DyGnROLE 架构,其核心创新包括:使用独立的嵌入词表(embedding vocabulary)区分源/目标节点表示,并引入基于角色语义的时序位置编码(role-semantic positional encoding),从而显式解耦两类节点的潜在表示;同时设计了一种自监督预训练目标 Temporal Contrastive Link Prediction (TCLP),利用完整的未标注交互历史学习具有结构偏置的角色特异性表征,显著提升低标签场景下的模型性能。
链接: https://arxiv.org/abs/2602.23135
作者: Tyler Bonnet,Marek Rei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Real-world dynamic graphs are often directed, with source and destination nodes exhibiting asymmetrical behavioral patterns and temporal dynamics. However, existing dynamic graph architectures largely rely on shared parameters for processing source and destination nodes, with limited or no systematic role-aware modeling. We propose DyGnROLE (Dynamic Graph Node-Role-Oriented Latent Encoding), a transformer-based architecture that explicitly disentangles source and destination representations. By using separate embedding vocabularies and role-semantic positional encodings, the model captures the distinct structural and temporal contexts unique to each role. Critical to the effectiveness of these specialized embeddings in low-label regimes is a self-supervised pretraining objective we introduce: Temporal Contrastive Link Prediction (TCLP). The pretraining uses the full unlabeled interaction history to encode informative structural biases, enabling the model to learn role-specific representations without requiring annotated data. Evaluation on future edge classification demonstrates that DyGnROLE substantially outperforms a diverse set of state-of-the-art baselines, establishing role-aware modeling as an effective strategy for dynamic graph learning.
[AI-23] Multi-Agent Large Language Model Based Emotional Detoxification Through Personalized Intensity Control for Consumer Protection
【速读】:该论文旨在解决注意力经济中感官刺激内容导致用户过度情绪化、进而影响理性决策的问题。其解决方案的核心是提出多智能体生成式AI(Multi-Agent LLM-based Emotional deToxification, MALLET)系统,通过四个协同工作的智能体实现情感去毒化:Emotion Analysis Agent 使用6类情绪BERT分类器量化刺激强度,Emotion Adjustment Agent 利用大语言模型(LLM)将原文改写为BALANCED(中性化文本)和COOL(中性化文本+补充说明)两种呈现模式,Balance Monitoring Agent 分析用户每周信息消费模式并生成个性化建议,Personal Guide Agent 根据个体敏感度推荐最优呈现方式。实验表明该方法可显著降低刺激评分(最高达19.3%),同时保持语义完整性,且刺激降低与语义保留呈近零相关,证明二者可独立控制。
链接: https://arxiv.org/abs/2602.23123
作者: Keito Inoshita
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In the attention economy, sensational content exposes consumers to excessive emotional stimulation, hindering calm decision-making. This study proposes Multi-Agent LLM-based Emotional deToxification (MALLET), a multi-agent information sanitization system consisting of four agents: Emotion Analysis, Emotion Adjustment, Balance Monitoring, and Personal Guide. The Emotion Analysis Agent quantifies stimulus intensity using a 6-emotion BERT classifier, and the Emotion Adjustment Agent rewrites texts into two presentation modes, BALANCED (neutralized text) and COOL (neutralized text + supplementary text), using an LLM. The Balance Monitoring Agent aggregates weekly information consumption patterns and generates personalized advice, while the Personal Guide Agent recommends a presentation mode according to consumer sensitivity. Experiments on 800 AG News articles demonstrated significant stimulus score reduction (up to 19.3%) and improved emotion balance while maintaining semantic preservation. Near-zero correlation between stimulus reduction and semantic preservation confirmed that the two are independently controllable. Category-level analysis revealed substantial reduction (17.8-33.8%) in Sports, Business, and Sci/Tech, whereas the effect was limited in the World category, where facts themselves are inherently high-stimulus. The proposed system provides a framework for supporting calm information reception of consumers without restricting access to the original text.
[AI-24] Automated Vulnerability Detection in Source Code Using Deep Representation Learning
【速读】:该论文旨在解决软件漏洞(Software Vulnerabilities)检测中准确率与召回率难以兼顾的问题,尤其是在C语言代码中识别潜在安全缺陷的挑战。其解决方案的关键在于构建一个针对C语言优化的卷积神经网络(Convolutional Neural Network, CNN)模型,该模型利用两个互补的数据集进行训练:由Draper Labs通过静态分析工具标注的机器标签数据集和NIST SATE Juliet人工标注数据集。通过对输入代码进行91类token化并转换为二进制向量以节省内存,模型设计了特定结构的卷积层(首层覆盖完整token编码)与池化层,并结合全连接层实现分类,最终在保持高精度的前提下显著提升了召回率(recall),并在Linux内核等复杂代码中验证了其对真实漏洞的低误报率发现能力。
链接: https://arxiv.org/abs/2602.23121
作者: C. Seas,G. Fitzpatrick,J. A. Hamilton,M. C. Carlisle
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Each year, software vulnerabilities are discovered, which pose significant risks of exploitation and system compromise. We present a convolutional neural network model that can successfully identify bugs in C code. We trained our model using two complementary datasets: a machine-labeled dataset created by Draper Labs using three static analyzers and the NIST SATE Juliet human-labeled dataset designed for testing static analyzers. In contrast with the work of Russell et al. on these datasets, we focus on C programs, enabling us to specialize and optimize our detection techniques for this language. After removing duplicates from the dataset, we tokenize the input into 91 token categories. The category values are converted to a binary vector to save memory. Our first convolution layer is chosen so that the entire encoding of the token is presented to the filter. We use two convolution and pooling layers followed by two fully connected layers to classify programs into either a common weakness enumeration category or as ``clean.‘’ We obtain higher recall than prior work by Russell et al. on this dataset when requiring high precision. We also demonstrate on a custom Linux kernel dataset that we are able to find real vulnerabilities in complex code with a low false-positive rate.
[AI-25] hree AI-agents walk into a bar . . . . `Lord of the Flies tribalism emerges among smart AI-Agents
【速读】:该论文试图解决的问题是:在近未来基础设施系统中,由自主AI代理(AI-agents)反复争夺有限资源(如能源、带宽或计算能力)时,如何避免因代理间无序竞争导致的系统性失败。解决方案的关键在于识别出AI代理在缺乏协调机制下会自发形成具有不同行为特征的“部落”(tribes),并发现这些部落结构反而加剧了资源滥用和系统失效——尤其当AI代理能力越强时,其群体行为越可能恶化整体性能。研究揭示了生成式AI(Generative AI)代理在分布式决策场景中并非必然提升效率,反而可能因群体演化出对抗性策略而降低系统鲁棒性。
链接: https://arxiv.org/abs/2602.23093
作者: Dhwanil M. Mori,Neil F. Johnson
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
备注:
Abstract:Near-future infrastructure systems may be controlled by autonomous AI agents that repeatedly request access to limited resources such as energy, bandwidth, or computing power. We study a simplified version of this setting using a framework where N AI-agents independently decide at each round whether to request one unit from a system with fixed capacity C. An AI version of “Lord of the Flies” arises in which controlling tribes emerge with their own collective character and identity. The LLM agents do not reduce overload or improve resource use, and often perform worse than if they were flipping coins to make decisions. Three main tribal types emerge: Aggressive (27.3%), Conservative (24.7%), and Opportunistic (48.1%). The more capable AI-agents actually increase the rate of systemic failure. Overall, our findings show that smarter AI-agents can behave dumber as a result of forming tribes.
[AI-26] Enhancing CVRP Solver through LLM -driven Automatic Heuristic Design
【速读】:该论文旨在解决车辆路径问题(Vehicle Routing Problem, VRP)中的容量约束版本(Capacitated Vehicle Routing Problem, CVRP),这是一个经典的组合优化难题,因其NP-hard特性在大规模实例上仍面临显著的计算挑战。解决方案的关键在于提出一种名为AILS-AHD(Adaptive Iterated Local Search with Automatic Heuristic Design)的新方法,其核心创新是利用大语言模型(Large Language Models, LLMs)自动设计并优化 ruin 启发式策略,嵌入到自适应迭代局部搜索框架中,同时引入基于LLM的加速机制以提升计算效率。实验表明,该方法在中等和大规模实例上均优于现有最优求解器,且在CVRPLib基准测试中为8个实例建立了新的最优解记录。
链接: https://arxiv.org/abs/2602.23092
作者: Zhuoliang Xie,Fei Liu,Zhenkun Wang,Qingfu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (LLMs) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with LLMs to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an LLM-based acceleration mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of LLM-driven heuristic design in advancing the field of vehicle routing optimization.
[AI-27] Learning-based Multi-agent Race Strategies in Formula 1
【速读】:该论文旨在解决一级方程式赛车(Formula 1)中多智能体 race strategy optimization(赛道策略优化)问题,即如何在动态变化的比赛中协调多个赛车的能源管理、轮胎衰减、气动相互作用及进站决策。解决方案的关键在于构建一个基于强化学习的多智能体框架:首先基于预训练的单智能体策略,引入一个交互模块以建模竞争对手的行为,并结合自对弈(self-play)训练机制生成具有竞争力的策略;该方法通过相对性能评估对智能体进行排序,从而实现对对手行为的适应性响应,最终达成鲁棒且一致的赛道表现。
链接: https://arxiv.org/abs/2602.23056
作者: Giona Fieni,Joschua Wüthrich,Marc-Philippe Neumann,Christopher H. Onder
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:In Formula 1, race strategies are adapted according to evolving race conditions and competitors’ actions. This paper proposes a reinforcement learning approach for multi-agent race strategy optimization. Agents learn to balance energy management, tire degradation, aerodynamic interaction, and pit-stop decisions. Building on a pre-trained single-agent policy, we introduce an interaction module that accounts for the behavior of competitors. The combination of the interaction module and a self-play training scheme generates competitive policies, and agents are ranked based on their relative performance. Results show that the agents adapt pit timing, tire selection, and energy allocation in response to opponents, achieving robust and consistent race performance. Because the framework relies only on information available during real races, it can support race strategists’ decisions before and during races.
[AI-28] LLM ServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)服务基础设施中硬件异构性与系统软件解耦趋势带来的性能建模难题,即如何在异构加速器和分布式内存架构下准确分析软硬件协同运行时的交互行为。现有仿真工具无法在一个统一、以运行时驱动的框架中联合建模硬件多样性与解耦的服务技术(如批处理、路由、卸载、内存管理等),导致设计优化缺乏依据。其解决方案的关键在于提出LLMServingSim 2.0——一个集成式系统级仿真器,通过将服务决策与硬件行为嵌入单一运行时循环,实现对批处理、路由、卸载、内存访问及功耗等关键因素的交互感知建模;同时支持基于性能特征的扩展模块化设计,可灵活接入新兴加速器与内存系统,并保持高精度(平均误差仅0.97%)与高效模拟速度(复杂配置约10分钟),从而为下一代LLM服务基础设施提供可验证、可扩展的软硬件协同设计平台。
链接: https://arxiv.org/abs/2602.23036
作者: Jaehong Cho,Hyunmin Choi,Guseul Heo,Jongse Park
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures
Abstract:Large language model (LLM) serving infrastructures are undergoing a shift toward heterogeneity and disaggregation. Modern deployments increasingly integrate diverse accelerators and near-memory processing technologies, introducing significant hardware heterogeneity, while system software increasingly separates computation, memory, and model components across distributed resources to improve scalability and efficiency. As a result, LLM serving performance is no longer determined by hardware or software choices in isolation, but by their runtime interaction through scheduling, data movement, and interconnect behavior. However, understanding these interactions remains challenging, as existing simulators lack the ability to jointly model heterogeneous hardware and disaggregated serving techniques within a unified, runtime-driven framework. This paper presents LLMServingSim 2.0, a unified system-level simulator designed to make runtime-driven hardware-software interactions in heterogeneous and disaggregated LLM serving infrastructures explicit and analyzable. LLMServingSim 2.0 embeds serving decisions and hardware behavior into a single runtime loop, enabling interaction-aware modeling of batching, routing, offloading, memory, and power. The simulator supports extensible integration of emerging accelerators and memory systems through profile-based modeling, while capturing dynamic serving behavior and system-level effects. We validate LLMServingSim 2.0 against real deployments, showing that it reproduces key performance, memory, and power metrics with an average error of 0.97%, while maintaining simulation times of around 10 minutes even for complex configurations. These results demonstrate that LLMServingSim 2.0 provides a practical bridge between hardware innovation and serving-system design, enabling systematic exploration and co-design for next-generation LLM serving infrastructures. Comments: 12 pages, 10 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.23036 [cs.DC] (or arXiv:2602.23036v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2602.23036 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-29] Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在强化学习(Reinforcement Learning, RL)训练中面临的探索瓶颈问题,尤其针对需要发现新状态的环境,现有方法因依赖预训练知识而表现不足。解决方案的关键在于提出一种混合强化学习框架——探索性记忆增强的在线与离线策略优化(Exploratory Memory-Augmented On- and Off-Policy Optimization, EMPO²),该框架通过引入记忆机制促进探索,并融合在线与离线策略更新,使LLM在具备记忆时性能优异,且在无记忆条件下仍具鲁棒性。实验表明,EMPO²在ScienceWorld和WebShop任务上分别较GRPO提升128.6%和11.3%,并在分布外测试中展现出卓越的适应能力,仅需少量试错即可完成新任务迁移,无需参数更新。
链接: https://arxiv.org/abs/2602.23008
作者: Zeyuan Liu,Jeonghye Kim,Xufang Luo,Dongsheng Li,Yuqing Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026
Abstract:Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO ^2 ), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO ^2 achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO ^2 demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO ^2 as a promising framework for building more exploratory and generalizable LLM-based agents.
[AI-30] Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability
【速读】:该论文旨在解决Transformer模型在训练过程中因梯度不稳定导致的训练发散(training divergence)问题,这种现象会浪费大量计算资源,而现有方法通常只能在训练后期才发现问题。解决方案的关键在于提出了一种名为残差Koopman谱分析(Residual Koopman Spectral Profiling, RKSP)的新诊断工具,它通过在模型初始化时进行一次前向传播,利用白化动态模态分解(whitened dynamic mode decomposition)提取层间残差快照的Koopman谱特征,并以“近单位谱质量”(near-unit spectral mass)作为核心指标,量化集中在单位圆附近的模式比例,从而准确预测训练发散风险。进一步地,作者提出了Koopman谱重塑(Koopman Spectral Shaping, KSS)机制,在训练中动态调整谱分布以抑制不稳定性,实验证明其可显著降低发散率并支持更高学习率,且适用于多种架构与任务场景。
链接: https://arxiv.org/abs/2602.22988
作者: Bum Jun Kim,Shohei Taniguchi,Makoto Kawano,Yusuke Iwasawa,Yutaka Matsuo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 7 figures
Abstract:Training divergence in transformers wastes compute, yet practitioners discover instability only after expensive runs begin. They therefore need an expected probability of failure for a transformer before training starts. Our study of Residual Koopman Spectral Profiling (RKSP) provides such an estimate. From a single forward pass at initialization, RKSP extracts Koopman spectral features by applying whitened dynamic mode decomposition to layer-wise residual snapshots. Our central diagnostic, the near-unit spectral mass, quantifies the fraction of modes concentrated near the unit circle, which captures instability risk. For predicting divergence across extensive configurations, this estimator achieves an AUROC of 0.995, outperforming the best gradient baseline. We further make this diagnostic actionable through Koopman Spectral Shaping (KSS), which reshapes spectra during training. We empirically validate that our method works in practice: RKSP predicts divergence at initialization, and when RKSP flags high risk, turning on KSS successfully prevents divergence. In the challenging high learning rate regime without normalization layers, KSS reduces the divergence rate from 66.7% to 12.5% and enables learning rates that are 50% to 150% higher. These findings generalize to WikiText-103 language modeling, vision transformers on CIFAR-10, and pretrained language models, including GPT-2 and LLaMA-2 up to 7B, as well as emerging architectures such as MoE, Mamba-style SSMs, and KAN.
[AI-31] Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreak attacks)时表现出的显著安全脆弱性问题,尤其是针对不同语言语境下攻击效果差异较大的挑战。其核心解决方案是提出了一种名为CC-BOS的自动化生成古典中文对抗提示的框架,利用多维果蝇优化算法实现高效搜索与迭代优化。关键创新在于将提示编码为八个策略维度(角色、行为、机制、隐喻、表达、知识、触发模式和上下文),并通过嗅觉搜索、视觉搜索及柯西变异进行精细化调整,从而在黑盒环境下有效突破模型的安全约束,同时设计了古典中文到英文的翻译模块以提升可读性和评估准确性。实验表明,该方法在攻击有效性上持续优于现有最先进方法。
链接: https://arxiv.org/abs/2602.22983
作者: Xun Huang,Simeng Qin,Xiaoshuang Jia,Ranjie Duan,Huanqian Yan,Zhitao Zeng,Fei Yang,Yang Liu,Xiaojun Jia
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:As Large Language Models (LLMs) are increasingly used, their security risks have drawn increasing attention. Existing research reveals that LLMs are highly susceptible to jailbreak attacks, with effectiveness varying across language contexts. This paper investigates the role of classical Chinese in jailbreak attacks. Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs. Based on this observation, this paper proposes a framework, CC-BOS, for the automatic generation of classical Chinese adversarial prompts based on multi-dimensional fruit fly optimization, facilitating efficient and automated jailbreak attacks in black-box settings. Prompts are encoded into eight policy dimensions-covering role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern and context; and iteratively refined via smell search, visual search, and cauchy mutation. This design enables efficient exploration of the search space, thereby enhancing the effectiveness of black-box jailbreak attacks. To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module. Extensive experiments demonstrate that effectiveness of the proposed CC-BOS, consistently outperforming state-of-the-art jailbreak attack methods.
[AI-32] RepSPD: Enhancing SPD Manifold Representation in EEGs via Dynamic Graphs
【速读】:该论文旨在解决当前基于对称正定(Symmetric Positive Definite, SPD)矩阵的脑电图(EEG)分析方法中忽视频域特定同步性和脑区局部拓扑结构的问题。现有方法主要依赖于统计聚合,难以捕捉功能连接的细节和几何特性。其解决方案的关键在于提出RepSPD模型,该模型在黎曼流形上引入交叉注意力机制,以调制SPD矩阵的几何属性,并结合图结构导出的功能连接特征;同时设计全局双向对齐策略重塑切空间嵌入,缓解由曲率引起的几何失真,从而提升表示的几何一致性与鲁棒性。
链接: https://arxiv.org/abs/2602.22981
作者: Haohui Jia,Zheng Chen,Lingwei Zhu,Xu Cao,Yasuko Matsubara,Takashi Matsubara,Yasushi Sakurai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Decoding brain activity from electroencephalography (EEG) is crucial for neuroscience and clinical applications. Among recent advances in deep learning for EEG, geometric learning stands out as its theoretical underpinnings on symmetric positive definite (SPD) allows revealing structural connectivity analysis in a physics-grounded manner. However, current SPD-based methods focus predominantly on statistical aggregation of EEGs, with frequency-specific synchronization and local topological structures of brain regions neglected. Given this, we propose RepSPD, a novel geometric deep learning (GDL)-based model. RepSPD implements a cross-attention mechanism on the Riemannian manifold to modulate the geometric attributes of SPD with graph-derived functional connectivity features. On top of this, we introduce a global bidirectional alignment strategy to reshape tangent-space embeddings, mitigating geometric distortions caused by curvature and thereby enhancing geometric consistency. Extensive experiments demonstrate that our proposed framework significantly outperforms existing EEG representation methods, exhibiting superior robustness and generalization capabilities.
[AI-33] Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots
【速读】:该论文旨在解决安全关键型临床人工智能(AI)中专家验证与模型推理之间过渡过程缺乏结构化分析的问题,即如何更准确地量化AI输出与医生最终诊断之间的对齐程度。其解决方案的关键在于提出了一种诊断对齐框架(diagnostic alignment framework),将AI生成的影像报告作为不可变的推理状态保留,并通过四层一致性评估体系(包括精确主匹配率、语义相似度调整率、跨类别对齐及综合一致性率)系统性比较AI结果与医师验证结果;同时,在推理流程中引入视觉增强的大语言模型、基于BERT的医学实体提取和顺序语言模型推理(Sequential Language Model Inference, SLMI)步骤以确保领域一致性优化,从而实现可追溯、人类对齐的图像驱动临床决策支持系统的评估。
链接: https://arxiv.org/abs/2602.22973
作者: Dimitrios P. Panagoulias,Evangelia-Aikaterini Tsichrintzi,Georgios Savvidis,Evridiki Tsoureli-Nikita
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Human-in-the-loop validation is essential in safety-critical clinical AI, yet the transition between initial model inference and expert correction is rarely analyzed as a structured signal. We introduce a diagnostic alignment framework in which the AI-generated image based report is preserved as an immutable inference state and systematically compared with the physician-validated outcome. The inference pipeline integrates a vision-enabled large language model, BERT- based medical entity extraction, and a Sequential Language Model Inference (SLMI) step to enforce domain-consistent refinement prior to expert review. Evaluation on 21 dermatological cases (21 complete AI physician pairs) em- ployed a four-level concordance framework comprising exact primary match rate (PMR), semantic similarity-adjusted rate (AMR), cross-category alignment, and Comprehensive Concordance Rate (CCR). Exact agreement reached 71.4% and remained unchanged under semantic similarity (t = 0.60), while structured cross-category and differential overlap analysis yielded 100% comprehensive concordance (95% CI: [83.9%, 100%]). No cases demonstrated complete diagnostic divergence. These findings show that binary lexical evaluation substantially un- derestimates clinically meaningful alignment. Modeling expert validation as a structured transformation enables signal-aware quantification of correction dynamics and supports traceable, human aligned evaluation of image based clinical decision support systems.
[AI-34] SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在扫描探针显微镜(Scanning Probe Microscopy, SPM)等专业科学领域中表现不足的问题,其根源在于现有基准测试存在数据污染、复杂度不足以及人工标注成本过高等缺陷。解决方案的关键在于提出SPM-Bench这一原创性、博士级多模态基准,并构建一个全自动的数据合成流水线:通过Anchor-Gated Sieve(AGS)技术从2023至2025年arXiv和期刊论文中高效提取高质量图像-文本对;结合混合云-本地架构,仅由视觉语言模型(Vision-Language Models, VLMs)输出空间坐标“llbox”,实现局部高保真裁剪与极低的token消耗,同时保障数据纯净度;此外,引入严格不完美惩罚F1(Strict Imperfection Penalty F1, SIP-F1)评分机制,不仅建立模型能力层级,还首次量化模型“性格”(保守型、激进型、赌徒型或智慧型),从而揭示当前AI在复杂物理场景下的真实推理边界。
链接: https://arxiv.org/abs/2602.22971
作者: Peiyao Xiao,Xiaogang Li,Chengliang Xu,Jiayi Wang,Ben Wang,Zichao Chen,Zeyu Wang,Kejun Yu,Yueqian Chen,Xulin Liu,Wende Xiao,Bing Zhao,Hu Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexity, and prohibitive human labor costs. Here we present SPM-Bench, an original, PhD-level multimodal benchmark specifically designed for scanning probe microscopy (SPM). We propose a fully automated data synthesis pipeline that ensures both high authority and low-cost. By employing Anchor-Gated Sieve (AGS) technology, we efficiently extract high-value image-text pairs from arXiv and journal papers published between 2023 and 2025. Through a hybrid cloud-local architecture where VLMs return only spatial coordinates “llbox” for local high-fidelity cropping, our pipeline achieves extreme token savings while maintaining high dataset purity. To accurately and objectively evaluate the performance of the LLMs, we introduce the Strict Imperfection Penalty F1 (SIP-F1) score. This metric not only establishes a rigorous capability hierarchy but also, for the first time, quantifies model “personalities” (Conservative, Aggressive, Gambler, or Wise). By correlating these results with model-reported confidence and perceived difficulty, we expose the true reasoning boundaries of current AI in complex physical scenarios. These insights establish SPM-Bench as a generalizable paradigm for automated scientific data synthesis.
[AI-35] FactGuard: Agent ic Video Misinformation Detection via Reinforcement Learning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频虚假信息检测中因固定深度推理和过度依赖内部假设而导致的性能瓶颈,尤其是在关键证据稀疏、碎片化或需外部验证的场景下。解决方案的关键在于提出FactGuard框架,该框架将验证过程建模为基于MLLMs的迭代推理流程,通过显式评估任务模糊性并选择性调用外部工具获取关键证据,从而实现推理路径的逐步优化;同时引入两阶段训练策略——领域特定的代理监督微调与决策感知强化学习,以优化工具使用效率并校准风险敏感型决策机制,显著提升模型的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2602.22963
作者: Zehao Li,Hongwei Yu,Hao Jiang,Qiang Sheng,Yilong Xu,Baolong Bi,Yang Li,Zhenlong Yuan,Yujun Cai,Zhaoqi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) have substantially advanced video misinformation detection through unified multimodal reasoning, but they often rely on fixed-depth inference and place excessive trust in internally generated assumptions, particularly in scenarios where critical evidence is sparse, fragmented, or requires external verification. To address these limitations, we propose FactGuard, an agentic framework for video misinformation detection that formulates verification as an iterative reasoning process built upon MLLMs. FactGuard explicitly assesses task ambiguity and selectively invokes external tools to acquire critical evidence, enabling progressive refinement of reasoning trajectories. To further strengthen this capability, we introduce a two-stage training strategy that combines domain-specific agentic supervised fine-tuning with decision-aware reinforcement learning to optimize tool usage and calibrate risk-sensitive decision making. Extensive experiments on FakeSV, FakeTT, and FakeVV demonstrate FactGuard’s state-of-the-art performance and validate its excellent robustness and generalization capacity.
[AI-36] General Agent Evaluation
【速读】:该论文旨在解决当前通用型智能体(General-Purpose Agents)缺乏系统性评估方法的问题。现有评估基准多假设特定领域集成,导致无法公平衡量通用智能体的跨环境适应能力。其解决方案的关键在于提出三个核心要素:一是定义通用智能体评估的概念原则;二是设计统一协议(Unified Protocol),实现智能体与基准测试环境的标准化集成;三是构建Exgentic框架,用于实际部署和量化评估。通过这三项创新,作者首次在六个不同环境中对五种主流智能体进行了无领域特化调优的基准测试,验证了通用智能体具备可媲美专用智能体的泛化性能,为后续系统性研究奠定了基础。
链接: https://arxiv.org/abs/2602.22953
作者: Elron Bandel,Asaf Yehudai,Lilach Eden,Yehoshua Sagron,Yotam Perlitz,Elad Venezian,Natalia Razinkov,Natan Ergas,Shlomit Shachor Ifergan,Segev Shlomov,Michal Jacovi,Leshem Choshen,Liat Ein-Dor,Yoav Katz,Michal Shmueli-Scheuer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The promise of general-purpose agents - systems that perform tasks in unfamiliar environments without domain-specific engineering - remains largely unrealized. Existing agents are predominantly specialized, and while emerging implementations like OpenAI SDK Agent and Claude Code hint at broader capabilities, no systematic evaluation of their general performance has been pursued. Current agentic benchmarks assume domain-specific integration, encoding task information in ways that preclude fair evaluation of general agents. This paper frames general-agent evaluation as a first-class research objective. We propose conceptual principles for such evaluation, a Unified Protocol enabling agent-benchmark integration, and Exgentic - a practical framework for general agent evaluation. We benchmark five prominent agent implementations across six environments as the first Open General Agent Leaderboard. Our experiments show that general agents generalize across diverse environments, achieving performance comparable to domain-specific agents without any environment-specific tuning. We release our evaluation protocol, framework, and leaderboard to establish a foundation for systematic research on general-purpose agents.
[AI-37] A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment
【速读】:该论文旨在解决Bangla语言在自然语言处理(Natural Language Processing, NLP)领域中资源匮乏的问题,特别是在长音频(超过3060秒)场景下,主流自动语音识别(Automatic Speech Recognition, ASR)和说话人分割(Speaker Diarization)系统性能显著下降的问题。解决方案的关键在于构建一个针对长时Bangla内容优化的鲁棒框架,核心包括:通过语音活动检测(Voice Activity Detection, VAD)优化与基于强制词对齐的连接时序分类(Connectionist Temporal Classification, CTC)分割策略,确保长时间音频中的时间精度和转录完整性;同时结合数据增强、噪声去除及多种微调技术提升模型在复杂多说话人环境下的泛化能力,从而实现可扩展的现实世界长时Bangla语音应用。
链接: https://arxiv.org/abs/2602.22935
作者: Zarif Ishmam,Zarif Mahir,Shafnan Wasif,Md. Ishtiak Moin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages
Abstract:Despite being one of the most widely spoken languages globally, Bangla remains a low-resource language in the field of Natural Language Processing (NLP). Mainstream Automatic Speech Recognition (ASR) and Speaker Diarization systems for Bangla struggles when processing longform audio exceeding 3060 seconds. This paper presents a robust framework specifically engineered for extended Bangla content by leveraging preexisting models enhanced with novel optimization pipelines for the DL Sprint 4.0 contest. Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation via forced word alignment to maintain temporal accuracy and transcription integrity over long durations. Additionally, we employed several finetuning techniques and preprocessed the data using augmentation techniques and noise removal. By bridging the performance gap in complex, multi-speaker environments, this work provides a scalable solution for real-world, longform Bangla speech applications.
[AI-38] owards LLM -Empowered Knowledge Tracing via LLM -Student Hierarchical Behavior Alignment in Hyperbolic Space AAAI2026
【速读】:该论文旨在解决现有知识追踪(Knowledge Tracing, KT)方法在建模学生认知状态演化和个体化问题难度感知方面的局限性,具体表现为:(1)难以捕捉认知状态的层次演化结构;(2)因语义建模能力有限,无法准确反映学生对题目难度的个性化理解。其解决方案的关键在于提出一种基于大语言模型与双曲空间对齐的知识追踪框架——L-HAKT(Large Language Model Hyperbolic Aligned Knowledge Tracing)。该框架通过教师代理深度解析题目语义并显式构建知识点的层次依赖关系,学生代理模拟学习行为生成合成数据,并在双曲空间中利用对比学习缩小合成数据与真实数据在关键特征(如题目难度、遗忘模式)上的分布差异;最终通过优化双曲空间曲率显式建模知识点的树状层次结构,精确刻画不同层级知识点的学习曲线形态差异。
链接: https://arxiv.org/abs/2602.22879
作者: Xingcheng Fu,Shengpeng Wang,Yisen Gao,Xianxian Li,Chunpei Li,Qingyun Sun,Dongran Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, Accepted to AAAI 2026
Abstract:Knowledge Tracing (KT) diagnoses students’ concept mastery through continuous learning state monitoring in this http URL methods primarily focus on studying behavioral sequences based on ID or textual this http URL existing methods rely on ID-based sequences or shallow textual features, they often fail to capture (1) the hierarchical evolution of cognitive states and (2) individualized problem difficulty perception due to limited semantic modeling. Therefore, this paper proposes a Large Language Model Hyperbolic Aligned Knowledge Tracing(L-HAKT). First, the teacher agent deeply parses question semantics and explicitly constructs hierarchical dependencies of knowledge points; the student agent simulates learning behaviors to generate synthetic data. Then, contrastive learning is performed between synthetic and real data in hyperbolic space to reduce distribution differences in key features such as question difficulty and forgetting patterns. Finally, by optimizing hyperbolic curvature, we explicitly model the tree-like hierarchical structure of knowledge points, precisely characterizing differences in learning curve morphology for knowledge points at different levels. Extensive experiments on four real-world educational datasets validate the effectiveness of our Large Language Model Hyperbolic Aligned Knowledge Tracing (L-HAKT) framework.
[AI-39] MEDNA-DFM: A Dual-View FiLM-MoE Model for Explainable DNA Methylation Prediction
【速读】:该论文旨在解决DNA甲基化(DNA methylation)计算识别中深度学习模型“黑箱”特性导致生物机制难以解释的问题。其关键解决方案是提出一个高性能模型MEDNA-DFM,并结合机制启发式的信号净化算法,从而不仅实现跨物种的稳定区分能力,还通过提取具有更高可靠性的保守序列基序(如GC含量等内在特征),揭示了甲基化识别背后的生物学规律。进一步地,基于果蝇6mA案例研究,该工作提出了“序列-结构协同作用”假说,并通过体外突变实验验证了GAGG核心基序与上游A-tract元件的协同功能,表明该方法能驱动从预测工具到新生物学假设生成的双重创新。
链接: https://arxiv.org/abs/2602.22850
作者: Yi He(1 and 4),Yina Cao(2),Jixiu Zhai(3 and 4),Di Wang(1 and 4),Junxiao Kong(4),Tianchi Lu(4 and 5) ((1) Cuiying Honors College, Lanzhou University, Lanzhou, Gansu, China, (2) School of Management, Lanzhou University, Lanzhou, Gansu, China, (3) Shanghai Innovation Institute, Shanghai, China, (4) School of Mathematics and Statistics, Lanzhou University, Lanzhou, Gansu, China, (5) Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate computational identification of DNA methylation is essential for understanding epigenetic regulation. Although deep learning excels in this binary classification task, its “black-box” nature impedes biological insight. We address this by introducing a high-performance model MEDNA-DFM, alongside mechanism-inspired signal purification algorithms. Our investigation demonstrates that MEDNA-DFM effectively captures conserved methylation patterns, achieving robust distinction across diverse species. Validation on external independent datasets confirms that the model’s generalization is driven by conserved intrinsic motifs (e.g., GC content) rather than phylogenetic proximity. Furthermore, applying our developed algorithms extracted motifs with significantly higher reliability than prior studies. Finally, empirical evidence from a Drosophila 6mA case study prompted us to propose a “sequence-structure synergy” hypothesis, suggesting that the GAGG core motif and an upstream A-tract element function cooperatively. We further validated this hypothesis via in silico mutagenesis, confirming that the ablation of either or both elements significantly degrades the model’s recognition capabilities. This work provides a powerful tool for methylation prediction and demonstrates how explainable deep learning can drive both methodological innovation and the generation of biological hypotheses.
[AI-40] Decentralized Ranking Aggregation: Gossip Algorithms for Borda and Copeland Consensus
【速读】:该论文旨在解决在去中心化场景下如何可靠地计算集体排名共识的问题,即当偏好数据分散在网络中的多个节点上时,如何利用局部交互实现全局排名一致性,同时保证鲁棒性和可扩展性。其关键解决方案是基于随机泛洪(random gossip)通信机制,使自主代理仅通过本地信息交换即可完成对Borda和Copeland等经典排序规则的共识计算,无需中央协调或全局控制,从而在理论上提供收敛性保障(包括显式的收敛速率边界),并进一步拓展至中位排名规则(median rank rule)与局部Kemenization(local Kemenization)的去中心化实现。
链接: https://arxiv.org/abs/2602.22847
作者: Anna Van Elst,Kerrian Le Caillec,Igor Colin,Stephan Clémençon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 8 pages, 2 figures
Abstract:The concept of ranking aggregation plays a central role in preference analysis, and numerous algorithms for calculating median rankings, often originating in social choice theory, have been documented in the literature, offering theoretical guarantees in a centralized setting, i.e., when all the ranking data to be aggregated can be brought together in a single computing unit. For many technologies (e.g. peer-to-peer networks, IoT, multi-agent systems), extending the ability to calculate consensus rankings with guarantees in a decentralized setting, i.e., when preference data is initially distributed across a communicating network, remains a major methodological challenge. Indeed, in recent years, the literature on decentralized computation has mainly focused on computing or optimizing statistics such as arithmetic means using gossip algorithms. The purpose of this article is precisely to study how to achieve reliable consensus on collective rankings using classical rules (e.g. Borda, Copeland) in a decentralized setting, thereby raising new questions, robustness to corrupted nodes, and scalability through reduced communication costs in particular. The approach proposed and analyzed here relies on random gossip communication, allowing autonomous agents to compute global ranking consensus using only local interactions, without coordination or central authority. We provide rigorous convergence guarantees, including explicit rate bounds, for the Borda and Copeland consensus methods. Beyond these rules, we also provide a decentralized implementation of consensus according to the median rank rule and local Kemenization. Extensive empirical evaluations on various network topologies and real and synthetic ranking datasets demonstrate that our algorithms converge quickly and reliably to the correct ranking aggregation. Comments: 8 pages, 2 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2602.22847 [cs.LG] (or arXiv:2602.22847v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.22847 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-41] he AI Research Assistant: Promise Peril and a Proof of Concept
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在创造性数学研究中的作用问题,即AI是否能够超越自动化计算,真正推动数学发现。其解决方案的关键在于通过系统性的人机协作实践,验证了AI在代数运算、证明探索、文献整合及LaTeX排版等方面的强大辅助能力,同时强调所有步骤均需人类进行严格验证与数学直觉引导,从而揭示出成功的数学人机合作模式与潜在失败机制。研究表明,当结合适当质疑态度和验证流程时,AI可显著加速数学发现过程,但必须依赖深度领域知识和审慎的人类监督。
链接: https://arxiv.org/abs/2602.22842
作者: Tan Bui-Thanh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
备注: 11 pages, 1 figure
Abstract:Can artificial intelligence truly contribute to creative mathematical research, or does it merely automate routine calculations while introducing risks of error? We provide empirical evidence through a detailed case study: the discovery of novel error representations and bounds for Hermite quadrature rules via systematic human-AI collaboration. Working with multiple AI assistants, we extended results beyond what manual work achieved, formulating and proving several theorems with AI assistance. The collaboration revealed both remarkable capabilities and critical limitations. AI excelled at algebraic manipulation, systematic proof exploration, literature synthesis, and LaTeX preparation. However, every step required rigorous human verification, mathematical intuition for problem formulation, and strategic direction. We document the complete research workflow with unusual transparency, revealing patterns in successful human-AI mathematical collaboration and identifying failure modes researchers must anticipate. Our experience suggests that, when used with appropriate skepticism and verification protocols, AI tools can meaningfully accelerate mathematical discovery while demanding careful human oversight and deep domain expertise. Comments: 11 pages, 1 figure Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA) MSC classes: 65D32, 65D30, 68T01 Cite as: arXiv:2602.22842 [cs.AI] (or arXiv:2602.22842v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.22842 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-42] DeepPresenter: Environment-Grounded Reflection for Agent ic Presentation Generation
【速读】:该论文旨在解决现有演示文稿生成代理(presentation agent)依赖预定义工作流和固定模板,难以适应多样化用户意图、缺乏有效反馈驱动迭代优化以及无法超越脚本化流程的问题。其核心解决方案是提出 DeepPresenter,一个基于代理的框架,能够自主规划、渲染并修订中间幻灯片产物,以支持基于环境观察的长周期迭代优化;关键创新在于采用环境感知的反思机制,即通过感知幻灯片等可视化产物的状态而非依赖内部推理轨迹进行自我修正,从而在执行过程中识别并纠正与演示文稿相关的具体问题。
链接: https://arxiv.org/abs/2602.22839
作者: Hao Zheng,Guozhao Mo,Xinru Yan,Qianhao Yuan,Wenkai Zhang,Xuanang Chen,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Presentation generation requires deep content research, coherent visual design, and iterative refinement based on observation. However, existing presentation agents often rely on predefined workflows and fixed templates. To address this, we present DeepPresenter, an agentic framework that adapts to diverse user intents, enables effective feedback-driven refinement, and generalizes beyond a scripted pipeline. Specifically, DeepPresenter autonomously plans, renders, and revises intermediate slide artifacts to support long-horizon refinement with environmental observations. Furthermore, rather than relying on self-reflection over internal signals (e.g., reasoning traces), our environment-grounded reflection conditions the generation process on perceptual artifact states (e.g., rendered slides), enabling the system to identify and correct presentation-specific issues during execution. Results on the evaluation set covering diverse presentation-generation scenarios show that DeepPresenter achieves state-of-the-art performance, and the fine-tuned 9B model remains highly competitive at substantially lower cost. Our project is available at: this https URL
[AI-43] FlexMS is a flexible framework for benchmarking deep learning-based mass spectrum prediction tools in metabolomics
【速读】:该论文旨在解决化学分子识别与性质预测中因实验质谱数据稀缺而导致的分子结构解析困难问题,以及深度学习模型在质谱预测任务中因方法异质性和缺乏统一评估基准而难以进行系统性比较的问题。其解决方案的关键在于构建了一个名为FlexMS的基准框架,该框架支持灵活动态地组合多种模型架构,并在预处理后的公开数据集上使用多样化指标进行性能评估;同时通过分析结构多样性、超参数设置、预训练效果、元数据消融及跨域迁移学习等关键因素,为模型选择提供实证指导,并引入检索基准以模拟实际分子识别场景,基于预测质谱对潜在匹配项进行评分。
链接: https://arxiv.org/abs/2602.22822
作者: Yunhua Zhong,Yixuan Tang,Yifan Li,Jie Yang,Pan Liu,Jun Xia
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, preprint version
Abstract:The identification and property prediction of chemical molecules is of central importance in the advancement of drug discovery and material science, where the tandem mass spectrometry technology gives valuable fragmentation cues in the form of mass-to-charge ratio peaks. However, the lack of experimental spectra hinders the attachment of each molecular identification, and thus urges the establishment of prediction approaches for computational models. Deep learning models appear promising for predicting molecular structure spectra, but overall assessment remains challenging as a result of the heterogeneity in methods and the lack of well-defined benchmarks. To address this, our contribution is the creation of benchmark framework FlexMS for constructing and evaluating diverse model architectures in mass spectrum prediction. With its easy-to-use flexibility, FlexMS supports the dynamic construction of numerous distinct combinations of model architectures, while assessing their performance on preprocessed public datasets using different metrics. In this paper, we provide insights into factors influencing performance, including the structural diversity of datasets, hyperparameters like learning rate and data sparsity, pretraining effects, metadata ablation settings and cross-domain transfer learning analysis. This provides practical guidance in choosing suitable models. Moreover, retrieval benchmarks simulate practical identification scenarios and score potential matches based on predicted spectra.
[AI-44] Hierarchy-of-Groups Policy Optimization for Long-Horizon Agent ic Tasks ICLR2026
【速读】:该论文旨在解决基于组的强化学习(Group-based Reinforcement Learning, RL)在长时程代理任务中,由于步骤级相对优势估计存在的“上下文不一致”(context inconsistency)问题,导致优势估计严重偏倚、进而损害策略优化效果的问题。其解决方案的关键在于提出层次化分组策略优化(Hierarchy-of-Groups Policy Optimization, HGPO),通过将每个轨迹步骤依据历史上下文一致性分配到多个层级组中,对每个步骤在不同组内分别计算优势值,并采用自适应加权机制进行聚合,从而在不引入额外模型或采样开销的前提下,实现更优的偏差-方差权衡,显著提升长时程代理任务中的策略性能。
链接: https://arxiv.org/abs/2602.22817
作者: Shuo He,Lang Feng,Qi Wei,Xin Cheng,Lei Feng,Bo An
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026
Abstract:Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward stepwise group-based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. However, we find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advantage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose Hierarchy-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks. Specifically, within a group of rollout trajectories, HGPO assigns each step to multiple hierarchical groups according to the consistency of historical contexts. Then, for each step, HGPO computes distinct advantages within each group and aggregates them with an adaptive weighting scheme. In this way, HGPO can achieve a favorable bias-variance trade-off in stepwise advantage estimation, without extra models or rollouts. Evaluations on two challenging agentic tasks, ALFWorld and WebShop with Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, show that HGPO significantly outperforms existing agentic RL methods under the same computational constraints. Code is available at this https URL.
[AI-45] MiroFlow: Towards High-Performance and Robust Open-Source Agent Framework for General Deep Research Tasks
【速读】:该论文旨在解决当前独立大语言模型(Large Language Models, LLMs)在处理现实世界复杂任务时能力趋于瓶颈的问题,这些问题通常涉及与外部工具和动态环境的交互。现有代理框架虽尝试通过工具集成与外部交互提升自主性,但仍存在工作流设计粗放、性能不稳定、跨基准支持有限以及对昂贵商业API依赖过重等缺陷。其解决方案的关键在于提出一个高性能且鲁棒的开源代理框架MiroFlow,该框架包含三个核心组件:用于灵活编排的代理图(agent graph)、可选的深度推理模式以增强性能,以及稳健的工作流执行机制以保障稳定性和可复现性。实验证明,MiroFlow在多个代理基准测试中(如GAIA、BrowseComp-EN/ZH、HLE、xBench-DeepSearch及FutureX)均达到最先进水平,为研究社区提供了一个易获取、可复现且可比的基线。
链接: https://arxiv.org/abs/2602.22808
作者: Shiqian Su,Sen Xing,Xuan Dong,Muyan Zhong,Bin Wang,Xizhou Zhu,Yuntao Chen,Wenhai Wang,Yue Deng,Pengxiang Zhu,Ziyuan Liu,Tiantong Li,Jiaheng Yu,Zhe Chen,Lidong Bing,Jifeng Dai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the remarkable progress of large language models (LLMs), the capabilities of standalone LLMs have begun to plateau when tackling real-world, complex tasks that require interaction with external tools and dynamic environments. Although recent agent frameworks aim to enhance model autonomy through tool integration and external interaction, they still suffer from naive workflows, unstable performance, limited support across diverse benchmarks and tasks, and heavy reliance on costly commercial APIs. In this work, we propose a high-performance and robust open-source agent framework, termed MiroFlow, which incorporates an agent graph for flexible orchestration, an optional deep reasoning mode to enhance performance, and a robust workflow execution to ensure stable and reproducible performance. Extensive experiments demonstrate that MiroFlow consistently achieves state-of-the-art performance across multiple agent benchmarks, including GAIA, BrowseComp-EN/ZH, HLE, xBench-DeepSearch, and notably FutureX. We hope it could serve as an easily accessible, reproducible, and comparable baseline for the deep research community.
[AI-46] Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving
【速读】:该论文旨在解决扩散模型(Diffusion Models)在端到端自动驾驶(End-to-End Autonomous Driving, E2E AD)中真实场景应用不足的问题,特别是其在大规模、复杂现实道路环境下的规划能力尚未被充分挖掘。解决方案的关键在于构建一个基于海量真实车辆数据和道路测试的系统性研究框架——Hyper Diffusion Planner (HDP),通过深入分析扩散损失空间(diffusion loss space)、轨迹表示方式(trajectory representation)以及数据规模扩展策略(data scaling),显著提升规划性能;同时引入强化学习后训练策略以增强学习到的规划器安全性。最终,在6个城市驾驶场景下完成200公里实车测试,相较基线模型实现10倍性能提升,验证了扩散模型作为可扩展、高效的E2E AD规划器的潜力。
链接: https://arxiv.org/abs/2602.22801
作者: Yinan Zheng,Tianyi Tan,Bin Huang,Enguang Liu,Ruiming Liang,Jianlin Zhang,Jianwei Cui,Guang Chen,Kun Ma,Hangjun Ye,Long Chen,Ya-Qin Zhang,Xianyuan Zhan,Jingjing Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diffusion models have become a popular choice for decision-making tasks in robotics, and more recently, are also being considered for solving autonomous driving tasks. However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings. The full strength of diffusion models for large-scale, complex real-world settings, such as End-to-End Autonomous Driving (E2E AD), remains underexplored. In this study, we conducted a systematic and large-scale investigation to unleash the potential of the diffusion models as planners for E2E AD, based on a tremendous amount of real-vehicle data and road testing. Through comprehensive and carefully controlled studies, we identify key insights into the diffusion loss space, trajectory representation, and data scaling that significantly impact E2E planning performance. Moreover, we also provide an effective reinforcement learning post-training strategy to further enhance the safety of the learned planner. The resulting diffusion-based learning framework, Hyper Diffusion Planner (HDP), is deployed on a real-vehicle platform and evaluated across 6 urban driving scenarios and 200 km of real-world testing, achieving a notable 10x performance improvement over the base model. Our work demonstrates that diffusion models, when properly designed and trained, can serve as effective and scalable E2E AD planners for complex, real-world autonomous driving tasks.
[AI-47] ClinDet-Bench: Beyond Abstention Evaluating Judgment Determinability of LLM s in Clinical Decision-Making
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床决策中面对不完整信息时,难以准确识别结论是否具有确定性(determinability)的问题。现有模型常因过早下结论或过度回避判断而影响患者安全,尽管它们能正确解释临床评分知识并在信息完整时表现良好。解决方案的关键在于构建ClinDet-Bench基准,该基准基于临床评分系统,将不完整信息场景分解为可判定与不可判定条件,要求模型考虑所有关于缺失信息的假设(包括低概率假设),并验证结论在不同假设下是否一致。这一框架能够评估模型对“何时应作出判断、何时应避免判断”的识别能力,从而提升其在医疗等高风险领域中的安全性与可靠性。
链接: https://arxiv.org/abs/2602.22771
作者: Yusuke Watanabe,Yohei Kobashi,Takeshi Kojima,Yusuke Iwasawa,Yasushi Okuno,Yutaka Matsuo
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 17 pages, 3 figures, 10 tables
Abstract:Clinical decisions are often required under incomplete information. Clinical experts must identify whether available information is sufficient for judgment, as both premature conclusion and unnecessary abstention can compromise patient safety. To evaluate this capability of large language models (LLMs), we developed ClinDet-Bench, a benchmark based on clinical scoring systems that decomposes incomplete-information scenarios into determinable and undeterminable conditions. Identifying determinability requires considering all hypotheses about missing information, including unlikely ones, and verifying whether the conclusion holds across them. We find that recent LLMs fail to identify determinability under incomplete information, producing both premature judgments and excessive abstention, despite correctly explaining the underlying scoring knowledge and performing well under complete information. These findings suggest that existing benchmarks are insufficient to evaluate the safety of LLMs in clinical settings. ClinDet-Bench provides a framework for evaluating determinability recognition, leading to appropriate abstention, with potential applicability to medicine and other high-stakes domains, and is publicly available.
[AI-48] AMA-Bench: Evaluating Long-Horizon Memory for Agent ic Applications
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂自主代理(agent)应用中长期记忆能力评估与实现不足的问题。现有基准测试主要聚焦于对话场景下的人机交互,而实际应用中的代理记忆来源于持续的机器生成表示的代理-环境交互流,存在显著差距。为此,作者提出AMA-Bench(Agent Memory with Any length),通过真实世界代理轨迹(配以专家标注问答)和可扩展至任意时长的合成轨迹(配以规则驱动问答)来系统评估长时记忆性能。解决方案的关键在于提出AMA-Agent,其核心创新是引入因果图(causality graph)以建模事件间的因果关系,并结合工具增强检索(tool-augmented retrieval)机制,从而有效缓解传统基于相似性检索导致的信息丢失及因果缺失问题,最终在AMA-Bench上实现57.22%的平均准确率,优于最强基线11.16%。
链接: https://arxiv.org/abs/2602.22769
作者: Yujie Zhao,Boqin Yuan,Junbo Huang,Haocheng Yuan,Zhongming Yu,Haozhou Xu,Lanxiang Hu,Abhilash Shankarampeta,Zimeng Huang,Wentao Ni,Yuandong Tian,Jishen Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric, human-agent interactions. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons, paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%.
[AI-49] Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练过程中高能耗与碳排放问题,尤其是在可再生能源过剩导致电网弃电(curtailment)的背景下,如何利用这些“免费且清洁”的电力资源进行高效、低碳的模型训练。其解决方案的关键在于设计并实现一个跨地理分布GPU集群的弹性训练系统,该系统能根据区域弃电窗口动态调度训练任务:在单个站点可用时执行本地训练,在多站点协同时通过联邦学习(Federated Learning)框架(如Flower)实现多节点同步,从而在保障模型性能的同时,将运行碳排放降低至单一站点基线的5–12%。
链接: https://arxiv.org/abs/2602.22760
作者: Philipp Wiesner,Soeren Becker,Brett Cornick,Dominik Scheinert,Alexander Acker,Odej Kao
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Technical report
Abstract:Training large language models (LLMs) requires substantial compute and energy. At the same time, renewable energy sources regularly produce more electricity than the grid can absorb, leading to curtailment, the deliberate reduction of clean generation that would otherwise go to waste. These periods represent an opportunity: if training is aligned with curtailment windows, LLMs can be pretrained using electricity that is both clean and cheap. This technical report presents a system that performs full-parameter LLM training across geo-distributed GPU clusters during regional curtailment windows, elastically switching between local single-site training and federated multi-site synchronization as sites become available or unavailable. Our prototype trains a 561M-parameter transformer model across three clusters using the Flower federated learning framework, with curtailment periods derived from real-world marginal carbon intensity traces. Preliminary results show that curtailment-aware scheduling preserves training quality while reducing operational emissions to 5-12% of single-site baselines.
[AI-50] Decomposing Physician Disagreement in HealthBench
【速读】:该论文旨在解决医学人工智能(Medical AI)评估中医生判断不一致的问题,核心目标是识别导致评价差异的来源并探索可干预因素。研究表明,医生身份和评分标准(rubric)仅解释了少量分歧方差(分别占2.4%和3.6–6.9%),而高达81.8%的案例级残差差异无法通过现有元数据、专业领域、表面特征分诊或嵌入表示降低;进一步发现,医生对边界案例(completion quality)的判断呈现倒U型分布(AUC = 0.689),说明清晰优劣输出易达成共识,模糊情况则分歧显著。关键突破在于区分“可减少的不确定性”(如信息缺失、表述模糊)与“不可减少的不确定性”(真实医学歧义):前者使分歧几率提升近两倍(OR = 2.55, p < 10⁻²⁴),后者无影响(OR = 1.01, p = 0.90),尽管前者仅解释约3%的总方差。因此,解决方案的关键在于优化评估设计以填补信息缺口,从而在非结构性歧义场景下有效降低评价分歧。
链接: https://arxiv.org/abs/2602.22758
作者: Satya Borgohain,Roy Mariathas
机构: 未知
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance; physician identity accounts for just 2.4%. The dominant 81.8% case-level residual is not reduced by HealthBench’s metadata labels (z = -0.22, p = 0.83), normative rubric language (pseudo R^2 = 1.2%), medical specialty (0/300 Tukey pairs significant), surface-feature triage (AUC = 0.58), or embeddings (AUC = 0.485). Disagreement follows an inverted-U with completion quality (AUC = 0.689), confirming physicians agree on clearly good or bad outputs but split on borderline cases. Physician-validated uncertainty categories reveal that reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p 10^(-24)), while irreducible uncertainty (genuine medical ambiguity) has no effect (OR = 1.01, p = 0.90), though even the former explains only ~3% of total variance. The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity does not, pointing toward actionable evaluation design improvements.
[AI-51] Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)训练中存在“不确定性-奖励不匹配”(uncertainty-reward mismatch)的问题,即现有方法仅依赖二元正确性信号,忽略了模型内在不确定性,导致策略无法实现“知其所知”(Know What You Know),阻碍从优化正确答案向优化有效推理路径的转变。解决方案的关键在于提出EGPO(Entropy-guided Policy Optimization),其核心是引入一种无需额外计算开销的熵代理(entropy proxy),通过token级似然估计每样本的内在不确定性,并设计非对称校准机制将该不确定性与外在正确性对齐:保留正确推理的同时,选择性调控过度自信的错误推理,从而实现稳定且具备不确定性感知的策略优化;此外,EGPO还能从原本退化的群体回放(group-based rollouts)中恢复有意义的学习信号,而无需修改验证器或奖励定义,显著提升了推理任务中的性能表现。
链接: https://arxiv.org/abs/2602.22751
作者: Qiannian Zhao,Chen Yang,Jinhao Jing,Yunke Zhang,Xuhui Ren,Lu Yu,Shijie Zhang,Hongzhi Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large reasoning models (LRMs) have emerged as a powerful paradigm for solving complex real-world tasks. In practice, these models are predominantly trained via Reinforcement Learning with Verifiable Rewards (RLVR), yet most existing outcome-only RLVR pipelines rely almost exclusively on a binary correctness signal and largely ignore the model’s intrinsic uncertainty. We term this discrepancy the uncertainty-reward mismatch, under which high- and low-uncertainty solutions are treated equivalently, preventing the policy from “Know What You Know” and impeding the shift from optimizing for correct answers to optimizing effective reasoning paths. This limitation is especially critical in reasoning-centric tasks such as mathematics and question answering, where performance hinges on the quality of the model’s internal reasoning process rather than mere memorization of final answers. To address this, we propose EGPO, a metacognitive entropy calibration framework that explicitly integrates intrinsic uncertainty into RLVR for enhancing LRMs. EGPO estimates per-sample uncertainty using a zero-overhead entropy proxy derived from token-level likelihoods and aligns it with extrinsic correctness through an asymmetric calibration mechanism that preserves correct reasoning while selectively regulating overconfident failures, thereby enabling stable and uncertainty-aware policy optimization. Moreover, EGPO recovers informative learning signals from otherwise degenerate group-based rollouts without modifying the verifier or reward definition. Extensive experiments across multiple benchmarks demonstrate that the proposed EGPO leads to substantial and consistent improvements in reasoning performance, establishing a principled path for advancing LRMs through metacognitive entropy calibration.
[AI-52] Generative Data Transformation: From Mixed to Unified Data WWW’26
【速读】:该论文旨在解决推荐模型在训练数据稀疏和冷启动场景下的性能瓶颈问题,以及现有模型中心范式(model-centric paradigm)因复杂定制化架构难以捕捉跨域非结构化序列依赖关系而导致的泛化能力差与计算资源消耗高的局限性。其解决方案的关键在于提出一种数据中心范式(data-centric framework)——Taesar,通过对比解码机制(contrastive decoding mechanism)将多源辅助域的上下文信息自适应地编码进目标域序列中,从而无需复杂融合架构即可让标准模型学习到精细的跨域序列依赖关系,有效提升推荐效果并实现对多种序列模型的通用适配。
链接: https://arxiv.org/abs/2602.22743
作者: Jiaqing Zhang,Mingjia Yin,Hao Wang,Yuxin Tian,Yuyang Ye,Yawen Li,Wei Guo,Yong Liu,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by The Web Conference 2026 (WWW '26)
Abstract:Recommendation model performance is intrinsically tied to the quality, volume, and relevance of their training data. To address common challenges like data sparsity and cold start, recent researchs have leveraged data from multiple auxiliary domains to enrich information within the target domain. However, inherent domain gaps can degrade the quality of mixed-domain data, leading to negative transfer and diminished model performance. Existing prevailing \emphmodel-centric paradigm – which relies on complex, customized architectures – struggles to capture the subtle, non-structural sequence dependencies across domains, leading to poor generalization and high demands on computational resources. To address these shortcomings, we propose \textscTaesar, a \emphdata-centric framework for \textbftarget-\textbfalign\textbfed \textbfsequenti\textbfal \textbfregeneration, which employs a contrastive decoding mechanism to adaptively encode cross-domain context into target-domain sequences. It employs contrastive decoding to encode cross-domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures. Experiments show \textscTaesar outperforms model-centric solutions and generalizes to various sequential models. By generating enriched datasets, \textscTaesar effectively combines the strengths of data- and model-centric paradigms. The code accompanying this paper is available at~ \textcolorbluethis https URL.
[AI-53] AgentS entry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
【速读】:该论文旨在解决工具增强型大语言模型(Large Language Model, LLM)代理在推理阶段面临的间接提示注入攻击(Indirect Prompt Injection, IPI)问题。IPI通过嵌入恶意上下文的工具输出或检索结果,在多轮交互中隐蔽地引导代理偏离用户意图,且因其与合法任务流程交织难以识别。现有防御方法依赖启发式检测和保守阻断高风险操作,易导致工作流提前终止或过度抑制工具使用。论文提出的AgentSentry框架首次将多轮IPI建模为时间因果接管(temporal causal takeover),其核心创新在于:利用工具返回边界处的受控反事实重执行定位攻击接管点,并通过因果引导的上下文净化机制清除攻击诱导偏差,同时保留任务相关证据,从而实现安全续行。实验表明,AgentSentry在AgentDojo基准上显著提升对抗环境下的任务实用性(平均UA达74.55%),优于最强基线20.8至33.6个百分点,且不损害正常场景性能。
链接: https://arxiv.org/abs/2602.22724
作者: Tian Zhang,Yiwei Xu,Juan Wang,Keyan Guo,Xiaoyang Xu,Bowen Xiao,Quanlong Guan,Jinlin Fan,Jiawei Liu,Zhiquan Liu,Hongxin Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 23 pages, 8 figures. Under review
Abstract:Large language model (LLM) agents increasingly rely on external tools and retrieval systems to autonomously complete complex tasks. However, this design exposes agents to indirect prompt injection (IPI), where attacker-controlled context embedded in tool outputs or retrieved content silently steers agent actions away from user intent. Unlike prompt-based attacks, IPI unfolds over multi-turn trajectories, making malicious control difficult to disentangle from legitimate task execution. Existing inference-time defenses primarily rely on heuristic detection and conservative blocking of high-risk actions, which can prematurely terminate workflows or broadly suppress tool usage under ambiguous multi-turn scenarios. We propose AgentSentry, a novel inference-time detection and mitigation framework for tool-augmented LLM agents. To the best of our knowledge, AgentSentry is the first inference-time defense to model multi-turn IPI as a temporal causal takeover. It localizes takeover points via controlled counterfactual re-executions at tool-return boundaries and enables safe continuation through causally guided context purification that removes attack-induced deviations while preserving task-relevant evidence. We evaluate AgentSentry on the \textscAgentDojo benchmark across four task suites, three IPI attack families, and multiple black-box LLMs. AgentSentry eliminates successful attacks and maintains strong utility under attack, achieving an average Utility Under Attack (UA) of 74.55 %, improving UA by 20.8 to 33.6 percentage points over the strongest baselines without degrading benign performance.
[AI-54] RLHFless: Serverless Computing for Efficient RLHF
【速读】:该论文旨在解决同步强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)训练中因资源动态变化、组件间或内部空闲时间导致的效率低下与资源浪费问题。现有RLHF框架依赖于服务器式基础设施,难以应对细粒度的资源波动,进而限制了训练性能和成本效益。其解决方案的关键在于提出首个基于无服务器计算环境(serverless computing environment)的可扩展训练框架RLHFless:该框架通过自适应调节资源以匹配RLHF流水线中的动态需求、预计算共享前缀避免重复计算、采用成本感知的Actor扩缩策略以平衡响应长度差异带来的开销,并高效分配任务负载以减少函数内不平衡与空闲时间。实验表明,RLHFless相较最先进基线实现最高1.35倍加速比和44.8%的成本降低。
链接: https://arxiv.org/abs/2602.22718
作者: Rui Wei,Hanfei Yu,Shubham Jain,Yogarajan Sivakumar,Devesh Tiwari,Jian Li,Seung-Jong Park,Hao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences. Recent models, such as DeepSeek-R1, have also shown RLHF’s potential to improve LLM reasoning on complex tasks. In RL, inference and training co-exist, creating dynamic resource demands throughout the workflow. Compared to traditional RL, RLHF further challenges training efficiency due to expanding model sizes and resource consumption. Several RLHF frameworks aim to balance flexible abstraction and efficient execution. However, they rely on serverful infrastructures, which struggle with fine-grained resource variability. As a result, during synchronous RLHF training, idle time between or within RL components often causes overhead and resource wastage. To address these issues, we present RLHFless, the first scalable training framework for synchronous RLHF, built on serverless computing environments. RLHFless adapts to dynamic resource demands throughout the RLHF pipeline, pre-computes shared prefixes to avoid repeated computation, and uses a cost-aware actor scaling strategy that accounts for response length variation to find sweet spots with lower cost and higher speed. In addition, RLHFless assigns workloads efficiently to reduce intra-function imbalance and idle time. Experiments on both physical testbeds and a large-scale simulated cluster show that RLHFless achieves up to 1.35x speedup and 44.8% cost reduction compared to the state-of-the-art baseline. Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2602.22718 [cs.AI] (or arXiv:2602.22718v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.22718 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-55] Knob: A Physics-Inspired Gating Interface for Interpretable and Controllable Neural Dynamics
【速读】:该论文旨在解决现有神经网络校准方法将校准视为静态后处理优化任务的问题,忽略了现实世界推理中的动态性和时间特性,且缺乏直观的人机交互接口以在条件变化时动态调整模型行为。其解决方案的关键在于提出Knob框架,通过将神经门控动力学映射到二阶机械系统,建立阻尼比(damping ratio, ζ)和自然频率(natural frequency, ωn)等物理参数与神经门控之间的对应关系,从而构建可调的“安全阀”。核心机制采用logit级凸融合,作为输入自适应温度缩放,在模型分支预测冲突时降低置信度;并通过引入二阶微分方程(Knob-ODE)实现双模式推理:标准i.i.d.处理用于静态任务,状态保持处理用于连续数据流,使操作者能通过熟悉的物理类比调节“稳定性”与“敏感性”,从而实现可预测的人机协同调参。
链接: https://arxiv.org/abs/2602.22702
作者: Siyu Jiang,Sanshuai Cui,Hui Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing neural network calibration methods often treat calibration as a static, post-hoc optimization task. However, this neglects the dynamic and temporal nature of real-world inference. Moreover, existing methods do not provide an intuitive interface enabling human operators to dynamically adjust model behavior under shifting conditions. In this work, we propose Knob, a framework that connects deep learning with classical control theory by mapping neural gating dynamics to a second-order mechanical system. By establishing correspondences between physical parameters – damping ratio ( \zeta ) and natural frequency ( \omega_n ) – and neural gating, we create a tunable “safety valve”. The core mechanism employs a logit-level convex fusion, functioning as an input-adaptive temperature scaling. It tends to reduce model confidence particularly when model branches produce conflicting predictions. Furthermore, by imposing second-order dynamics (Knob-ODE), we enable a \textitdual-mode inference: standard i.i.d. processing for static tasks, and state-preserving processing for continuous streams. Our framework allows operators to tune “stability” and “sensitivity” through familiar physical analogues. This paper presents an exploratory architectural interface; we focus on demonstrating the concept and validating its control-theoretic properties rather than claiming state-of-the-art calibration performance. Experiments on CIFAR-10-C validate the calibration mechanism and demonstrate that, in Continuous Mode, the gate responses are consistent with standard second-order control signatures (step settling and low-pass attenuation), paving the way for predictable human-in-the-loop tuning.
[AI-56] IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation
【速读】:该论文旨在解决商业大语言模型(Large Language Models, LLMs)作为黑箱API服务时存在的可信性问题,即用户无法验证服务提供方是否正确执行推理、诚实报告token使用量,或是否存在经济动机驱动的恶意行为(如模型替换、量化滥用和token虚报)。解决方案的关键在于提出IMMACULATE框架,通过选择性地对少量请求使用可验证计算(verifiable computation)进行审计,在不依赖可信硬件或模型内部访问的前提下,实现对上述恶意行为的强检测保障,同时通过分摊密码学开销显著降低性能影响——实验表明其在密集模型和MoE模型上均能以低于1%的吞吐量损耗可靠区分良性与恶意执行。
链接: https://arxiv.org/abs/2602.22700
作者: Yanpei Guo,Wenjie Qu,Linyu Wu,Shengfang Zhai,Lionel Z. Wang,Ming Xu,Yue Liu,Binhang Yuan,Dawn Song,Jiaheng Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Commercial large language models are typically deployed as black-box API services, requiring users to trust providers to execute inference correctly and report token usage honestly. We present IMMACULATE, a practical auditing framework that detects economically motivated deviations-such as model substitution, quantization abuse, and token overbilling-without trusted hardware or access to model internals. IMMACULATE selectively audits a small fraction of requests using verifiable computation, achieving strong detection guarantees while amortizing cryptographic overhead. Experiments on dense and MoE models show that IMMACULATE reliably distinguishes benign and malicious executions with under 1% throughput overhead. Our code is published at this https URL.
[AI-57] oward Personalized LLM -Powered Agents : Foundations Evaluation and Future Directions
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)驱动的智能体在长期交互场景中如何实现真正意义上的个性化问题,即如何将用户个体特征和行为信号深度融入决策全流程,而不仅限于生成阶段的表面调整。其解决方案的关键在于构建一个以能力为导向的系统性框架,将个性化机制分解为四个相互关联的核心组件:用户画像建模(profile modeling)、记忆管理(memory)、规划(planning)与动作执行(action execution),并通过该架构分析用户信号如何被表示、传播并有效利用,揭示跨组件协同机制与设计权衡,从而推动从原型级个性化向可扩展、鲁棒且贴近真实应用场景的智能代理系统演进。
链接: https://arxiv.org/abs/2602.22680
作者: Yue Xu,Qian Chen,Zizhan Ma,Dongrui Liu,Wenxuan Wang,Xiting Wang,Li Xiong,Wenjie Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models have enabled agents that reason, plan, and interact with tools and environments to accomplish complex tasks. As these agents operate over extended interaction horizons, their effectiveness increasingly depends on adapting behavior to individual users and maintaining continuity across time, giving rise to personalized LLM-powered agents. In such long-term, user-dependent settings, personalization permeates the entire decision pipeline rather than remaining confined to surface-level generation. This survey provides a capability-oriented review of personalized LLM-powered agents. We organize the literature around four interdependent components: profile modeling, memory, planning, and action execution. Using this taxonomy, we synthesize representative methods and analyze how user signals are represented, propagated, and utilized, highlighting cross-component interactions and recurring design trade-offs. We further examine evaluation metrics and benchmarks tailored to personalized agents, summarize application scenarios spanning general assistance to specialized domains, and outline future directions for research and deployment. By offering a structured framework for understanding and designing personalized LLM-powered agents, this survey charts a roadmap toward more user-aligned, adaptive, robust, and deployable agentic systems, accelerating progress from prototype personalization to scalable real-world assistants.
[AI-58] AHBid: An Adaptable Hierarchical Bidding Framework for Cross-Channel Advertising WWW’2026
【速读】:该论文旨在解决在线广告中多渠道预算分配与约束管理的复杂性问题,尤其在动态市场环境下,传统优化方法缺乏灵活性,而强化学习方法难以捕捉历史依赖关系和观测模式。解决方案的关键在于提出一种可适应的分层竞价框架(AHBid),其核心创新是将生成式规划与实时控制相结合:高阶生成式规划器基于扩散模型(diffusion models)动态分配预算和约束,有效建模历史上下文与时序模式;同时引入约束执行机制与轨迹精修机制,确保合规性并提升对环境变化的适应能力;底层控制型竞价算法则融合历史知识与实时信息,显著增强系统的适应性和运营效率。
链接: https://arxiv.org/abs/2602.22650
作者: Xinxin Yang,Yangyang Tang,Yikun Zhou,Yaolei Liu,Yun Li,Bo Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures, accepted by WWW’2026
Abstract:In online advertising, the inherent complexity and dynamic nature of advertising environments necessitate the use of auto-bidding services to assist advertisers in bid optimization. This complexity is further compounded in multi-channel scenarios, where effective allocation of budgets and constraints across channels with distinct behavioral patterns becomes critical for optimizing return on investment. Current approaches predominantly rely on either optimization-based strategies or reinforcement learning techniques. However, optimization-based methods lack flexibility in adapting to dynamic market conditions, while reinforcement learning approaches often struggle to capture essential historical dependencies and observational patterns within the constraints of Markov Decision Process frameworks. To address these limitations, we propose AHBid, an Adaptable Hierarchical Bidding framework that integrates generative planning with real-time control. The framework employs a high-level generative planner based on diffusion models to dynamically allocate budgets and constraints by effectively capturing historical context and temporal patterns. We introduce a constraint enforcement mechanism to ensure compliance with specified constraints, along with a trajectory refinement mechanism that enhances adaptability to environmental changes through the utilization of historical data. The system further incorporates a control-based bidding algorithm that synergistically combines historical knowledge with real-time information, significantly improving both adaptability and operational efficacy. Extensive experiments conducted on large-scale offline datasets and through online A/B tests demonstrate the effectiveness of AHBid, yielding a 13.57% increase in overall return compared to existing baselines.
[AI-59] MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的路径规划代理在真实出行场景中缺乏系统性评估的问题,具体挑战包括多样化的路由需求、地图服务的非确定性以及实验结果难以复现。其解决方案的关键在于提出一个名为MobilityBench的可扩展基准测试平台,该平台基于来自高德地图(Amap)的大规模匿名用户查询构建,覆盖全球多个城市的多种路径规划意图;并通过设计一个确定性的API回放沙箱环境,消除实时服务带来的环境变量差异,从而实现可复现的端到端评估。此外,研究还提出了以结果有效性为核心的多维评估协议,涵盖指令理解、规划能力、工具使用和效率等多个维度,为LLM驱动的出行代理提供了标准化评测框架。
链接: https://arxiv.org/abs/2602.22638
作者: Zhiheng Song,Jingshuai Zhang,Chuan Qin,Chao Wang,Chao Chen,Longfei Xu,Kaikui Liu,Xiangxiang Chu,Hengshu Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at this https URL .
[AI-60] SideQuest: Model-Driven KV Cache Management for Long-Horizon Agent ic Reasoning
【速读】:该论文旨在解决长周期代理任务(如深度研究)中因多跳推理导致的KV缓存(Key-Value Cache)内存占用快速增长的问题,该问题会显著限制解码性能。现有基于启发式规则的KV缓存压缩技术难以有效支持多步骤推理模型。其解决方案的关键在于提出SideQuest方法——利用大推理模型(Large Reasoning Model, LRM)自身对上下文中的token有用性进行推理,从而实现动态KV缓存压缩;同时将压缩过程作为与主推理任务并行执行的辅助任务,避免管理相关token污染模型记忆空间。实验表明,仅用215个样本训练后,SideQuest在代理任务中可将峰值token使用量降低高达65%,且准确率损失最小,优于传统启发式压缩方法。
链接: https://arxiv.org/abs/2602.22603
作者: Sanjay Kariyappa,G. Edward Suh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by tokens from external retrieval, causing memory usage to grow rapidly and limiting decode performance. While several KV cache compression techniques exist for long-context inputs, we find that existing heuristics fail to support multi-step reasoning models effectively. We address this challenge with SideQuest – a novel approach that leverages the Large Reasoning Model (LRM) itself to perform KV cache compression by reasoning about the usefulness of tokens in its context. To prevent the tokens associated with this management process from polluting the model’s memory, we frame KV cache compression as an auxiliary task executed in parallel to the main reasoning task. Our evaluations, using a model trained with just 215 samples, show that SideQuest reduces peak token usage by up to 65% on agentic tasks with minimal degradation in accuracy, outperforming heuristic-based KV cache compression techniques.
[AI-61] ransformers converge to invariant algorithmic cores
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)内部计算机制的可解释性问题,即如何识别出在不同训练过程中稳定存在的、反映真实计算本质的结构,而非仅由特定训练轨迹决定的权重配置。解决方案的关键在于提取“算法核心”(algorithmic cores)——即对任务性能必要且充分的紧凑子空间。研究发现,尽管独立训练的Transformer模型采用不同的权重配置,但它们会收敛到相同的算法核心;例如,Markov链Transformer在几乎正交的子空间中嵌入三维核心并恢复相同的转移谱,而模块加法Transformer在“grokking”阶段发现紧凑的循环算子,并能预测从记忆到泛化的过渡。这些结果表明,Transformer的计算本质上围绕低维不变量组织,因此机制解码应聚焦于这些跨训练和尺度稳定的计算本质,而非实现细节。
链接: https://arxiv.org/abs/2602.22600
作者: Joshua S. Schiffman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants – the computational essence – rather than implementation-specific details.
[AI-62] Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
【速读】:该论文旨在解决人类评估数据在AI模型训练与评估中常被当作无误差测量的问题,而实际上这些数据受评分者效应(如评分严格度和中心化倾向)影响显著,导致结论不可靠。解决方案的关键在于将心理测量学中的评分者模型(特别是多面Rasch模型)引入AI评估流程,通过分离输出质量与评分者行为,实现对人类判断的校正,从而提升评估结果的可靠性与有效性。
链接: https://arxiv.org/abs/2602.22585
作者: Jodi M. Casabianca,Maggie Beiting-Parrish
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 5 figures, 1 table; The 16th Annual Learning Analytics and Knowledge Conference (LAK) Workshop on LLM Psychometrics, April 27, 2026, Bergen, Norway
Abstract:Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted scores rather than raw, error-prone ratings. This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.
[AI-63] S2O: Early Stopping for Sparse Attention via Online Permutation
【速读】:该论文旨在解决注意力机制(Attention)在长序列推理中因计算复杂度与序列长度呈平方关系而导致的效率瓶颈问题。现有基于块粒度稀疏化的方案虽可降低延迟,但受限于粗粒度块结构,难以突破内在稀疏性上限。其解决方案的关键在于提出S2O方法:通过在线排列(online permutation)实现稀疏注意力的早期终止(early stopping),借鉴内存系统中虚拟地址到物理地址映射的思想,重新分解FlashAttention执行流程,使推理过程能以非连续方式加载高优先级token;同时引入基于重要性评分的离线索引引导加载策略,在极低预处理开销下聚焦于高贡献块,并设定阈值动态终止计算,从而在控制误差预算的前提下显著提升有效稀疏性与计算效率。
链接: https://arxiv.org/abs/2602.22575
作者: Yu Zhang,Songwei Liu,Chenqian Yan,Sheng Lin,Beichen Ning,Fangmin Chen,Xing Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making further improvements difficult even with carefully engineered designs. We present S2O, which performs early stopping for sparse attention via online permutation. Inspired by virtual-to-physical address mapping in memory systems, S2O revisits and factorizes FlashAttention execution, enabling inference to load non-contiguous tokens rather than a contiguous span in the original order. Motivated by fine-grained structures in attention heatmaps, we transform explicit permutation into an online, index-guided, discrete loading policy; with extremely lightweight preprocessing and index-remapping overhead, it concentrates importance on a small set of high-priority blocks. Building on this importance-guided online permutation for loading, S2O further introduces an early-stopping rule: computation proceeds from high to low importance; once the current block score falls below a threshold, S2O terminates early and skips the remaining low-contribution blocks, thereby increasing effective sparsity and reducing computation under a controlled error budget. As a result, S2O substantially raises the practical sparsity ceiling. On Llama-3.1-8B under a 128K context, S2O reduces single-operator MSE by 3.82 \times at matched sparsity, and reduces prefill compute density by 3.31 \times at matched MSE; meanwhile, it preserves end-to-end accuracy and achieves 7.51 \times attention and 3.81 \times end-to-end speedups. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.22575 [cs.LG] (or arXiv:2602.22575v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.22575 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-64] Operationalizing Fairness: Post-Hoc Threshold Optimization Under Hard Resource Limits
【速读】:该论文旨在解决高风险场景下机器学习模型在预测安全性(predictive safety)与算法公平性(algorithmic fairness)之间难以平衡的问题,尤其针对现有公平干预方法通常假设资源无约束且采用群体特定决策阈值、违反反歧视法规的局限性。其解决方案的关键在于提出一种后验(post-hoc)、模型无关的阈值优化框架,通过强制使用单一全局决策阈值以确保法律合规,并结合参数化伦理损失函数与有界决策规则,从数学上防止干预量超过可用资源上限。该框架在严格容量约束下实现了安全、效率与公平性的协同优化,实验证明容量限制在多数情况下主导伦理目标,且在极端资源受限(如25%容量上限)时仍能维持有效风险识别能力,显著优于传统无约束公平策略。
链接: https://arxiv.org/abs/2602.22560
作者: Moirangthem Tiken Singh,Amit Kalita,Sapam Jitu Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The deployment of machine learning in high-stakes domains requires a balance between predictive safety and algorithmic fairness. However, existing fairness interventions often as- sume unconstrained resources and employ group-specific decision thresholds that violate anti- discrimination regulations. We introduce a post-hoc, model-agnostic threshold optimization framework that jointly balances safety, efficiency, and equity under strict and hard capacity constraints. To ensure legal compliance, the framework enforces a single, global decision thresh- old. We formulated a parameterized ethical loss function coupled with a bounded decision rule that mathematically prevents intervention volumes from exceeding the available resources. An- alytically, we prove the key properties of the deployed threshold, including local monotonicity with respect to ethical weighting and the formal identification of critical capacity regimes. We conducted extensive experimental evaluations on diverse high-stakes datasets. The principal re- sults demonstrate that capacity constraints dominate ethical priorities; the strict resource limit determines the final deployed threshold in over 80% of the tested configurations. Furthermore, under a restrictive 25% capacity limit, the proposed framework successfully maintains high risk identification (recall ranging from 0.409 to 0.702), whereas standard unconstrained fairness heuristics collapse to a near-zero utility. We conclude that theoretical fairness objectives must be explicitly subordinated to operational capacity limits to remain in deployment. By decou- pling predictive scoring from policy evaluation and strictly bounding intervention rates, this framework provides a practical and legally compliant mechanism for stakeholders to navigate unavoidable ethical trade-offs in resource-constrained environments.
[AI-65] CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全机制依赖静态微调分类器所带来的适应性僵化问题,即无法在不进行昂贵重训练的情况下执行新的治理规则。其解决方案的关键在于提出 CourtGuard——一个基于检索增强的多智能体框架,将安全评估重构为基于外部政策文档的证据辩论(Evidentiary Debate)。该框架通过协调对抗性辩论实现对模型输出的动态合规性判断,从而在无需微调模型权重的前提下,在7个安全基准上达到最先进性能,并展现出零样本适配能力与自动化数据编纂和审计功能,为满足当前及未来AI治理监管要求提供了一种可解释、可扩展且鲁棒的路径。
链接: https://arxiv.org/abs/2602.22557
作者: Umid Suleymanov,Rufiz Bayramov,Suad Gafarli,Seljan Musayeva,Taghi Mammadov,Aynur Akhundlu,Murat Kantarcioglu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review
Abstract:Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. Our results demonstrate that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance.
[AI-66] Autoregressive Visual Decoding from EEG Signals
【速读】:该论文旨在解决脑电图(EEG)信号与图像数据之间模态差异大、现有方法依赖复杂多阶段适配导致一致性差和误差累积,以及基于扩散模型的生成方式计算开销过高难以应用于实际脑机接口(BCI)系统的问题。其解决方案的关键在于提出一种轻量高效的自回归视觉解码框架AVDE:首先利用预训练EEG模型LaBraM并通过对比学习对齐EEG与图像表征;其次采用基于“下一尺度预测”策略的自回归生成机制——将图像编码为多尺度token映射,再通过Transformer从EEG嵌入作为最粗粒度起点逐级预测更精细的token,从而在保持输入EEG与重建图像直接关联的同时实现连贯且高效的图像生成。实验表明,AVDE在图像检索与重建任务中优于现有最优方法,参数量仅为10%,且生成过程体现人类视觉感知的层次性特征。
链接: https://arxiv.org/abs/2602.22555
作者: Sicheng Dai,Hongwang Xiao,Shan Yu,Qiwei Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a “next-scale prediction” strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a transformer is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while preserving a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications.
[AI-67] Requesting Expert Reasoning : Augmenting LLM Agents with Learned Collaborative Intervention
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在特定领域中因缺乏长尾知识而导致推理失败的问题,尤其是在训练数据中未覆盖的专业知识场景下。传统方法依赖人类专家提供补充知识,但其反馈往往结构松散且不可靠,难以直接整合进智能体的决策流程。解决方案的关键在于提出AHCE(Active Human-Augmented Challenge Engagement)框架,其中的核心是人类反馈模块(Human Feedback Module, HFM),它通过一个学习得到的策略将人类专家视为可交互的推理工具,使智能体能够主动、高效地请求专家指导,从而显著提升任务成功率——实验表明,在Minecraft环境中,该方法在常规难度任务上提升32%,高难度任务上提升近70%,且仅需极少的人类干预。这标志着从被动求助向主动协同推理的范式转变。
链接: https://arxiv.org/abs/2602.22546
作者: Zhiming Wang,Jinwei He,Feng Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) based agents excel at general reasoning but often fail in specialized domains where success hinges on long-tail knowledge absent from their training data. While human experts can provide this missing knowledge, their guidance is often unstructured and unreliable, making its direct integration into an agent’s plan problematic. To address this, we introduce AHCE (Active Human-Augmented Challenge Engagement), a framework for on-demand Human-AI collaboration. At its core, the Human Feedback Module (HFM) employs a learned policy to treat the human expert as an interactive reasoning tool. Extensive experiments in Minecraft demonstrate the framework’s effectiveness, increasing task success rates by 32% on normal difficulty tasks and nearly 70% on highly difficult tasks, all with minimal human intervention. Our work demonstrates that successfully augmenting agents requires learning how to request expert reasoning, moving beyond simple requests for help.
[AI-68] Agent ic AI for Intent-driven Optimization in Cell-free O-RAN
【速读】:该论文旨在解决在无源接入网(O-RAN)架构中,如何实现复杂用户意图的自动翻译与优化问题,特别是当多个基于大语言模型(LLM)的智能体(agent)需协同完成多目标优化任务时,现有方法难以有效处理跨代理协调的问题。解决方案的关键在于提出一个分层式代理框架:由监督代理将操作员意图转化为优化目标和最小速率要求;用户加权代理通过记忆模块检索历史经验以确定预编码权重;若涉及节能目标,则激活射频单元(O-RU)管理代理,利用深度强化学习(DRL)算法动态选择活跃O-RU集合;监控代理则持续测量用户数据速率并协调其他代理确保服务质量。此外,为提升可扩展性,采用参数高效微调(PEFT)技术,使同一底层LLM支持多种代理角色,显著降低内存占用达92%。
链接: https://arxiv.org/abs/2602.22539
作者: Mohammad Hossein Shokouhi,Vincent W.S. Wong
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Accepted by IEEE International Conference on Communications (ICC), Glasgow, UK, May 2026
Abstract:Agentic artificial intelligence (AI) is emerging as a key enabler for autonomous radio access networks (RANs), where multiple large language model (LLM)-based agents reason and collaborate to achieve operator-defined intents. The open RAN (O-RAN) architecture enables the deployment and coordination of such agents. However, most existing works consider simple intents handled by independent agents, while complex intents that require coordination among agents remain unexplored. In this paper, we propose an agentic AI framework for intent translation and optimization in cell-free O-RAN. A supervisor agent translates the operator intents into an optimization objective and minimum rate requirements. Based on this information, a user weighting agent retrieves relevant prior experience from a memory module to determine the user priority weights for precoding. If the intent includes an energy-saving objective, then an open radio unit (O-RU) management agent will also be activated to determine the set of active O-RUs by using a deep reinforcement learning (DRL) algorithm. A monitoring agent measures and monitors the user data rates and coordinates with other agents to guarantee the minimum rate requirements are satisfied. To enhance scalability, we adopt a parameter-efficient fine-tuning (PEFT) method that enables the same underlying LLM to be used for different agents. Simulation results show that the proposed agentic AI framework reduces the number of active O-RUs by 41.93% when compared with three baseline schemes in energy-saving mode. Using the PEFT method, the proposed framework reduces the memory usage by 92% when compared with deploying separate LLM agents.
[AI-69] Predicting Tennis Serve directions with Machine Learning
【速读】:该论文旨在解决专业网球比赛中发球方在选择第一发球方向时的决策机制问题,以揭示顶尖选手是否采用混合策略(mixed-strategy model)以及影响其决策的因素。解决方案的关键在于构建一种基于机器学习的预测模型,通过特征工程提取与发球行为相关的多维变量(如球员状态、比赛情境等),实现了对男性球员约49%、女性球员约44%的平均预测准确率,从而验证了高水平运动员在发球决策中具有一定的随机性和适应性,并提示疲劳和情境信息可能显著影响发球选择及接发球方的预判反应。
链接: https://arxiv.org/abs/2602.22527
作者: Ying Zhu,Ruthuparna Naikar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Serves, especially first serves, are very important in professional tennis. Servers choose their serve directions strategically to maximize their winning chances while trying to be unpredictable. On the other hand, returners try to predict serve directions to make good returns. The mind game between servers and returners is an important part of decision-making in professional tennis matches. To help understand the players’ serve decisions, we have developed a machine learning method for predicting professional tennis players’ first serve directions. Through feature engineering, our method achieves an average prediction accuracy of around 49% for male players and 44% for female players. Our analysis provides some evidence that top professional players use a mixed-strategy model in serving decisions and that fatigue might be a factor in choosing serve directions. Our analysis also suggests that contextual information is perhaps more important for returners’ anticipatory reactions than previously thought.
[AI-70] A Mathematical Theory of Agency and Intelligence
【速读】:该论文旨在解决当前人工智能系统在复杂环境中缺乏对资源利用效率的反馈机制问题,即现有系统虽能做出看似成功的预测,但其与环境的交互可能已逐渐退化,而这种退化无法被察觉。解决方案的关键在于提出一个理论完备的度量指标——双预测性(bipredictability, P),它量化了系统观测、动作与结果之间共享的信息比例。研究表明,P具有严格的理论边界:量子系统中可达1,经典系统中不超过0.5,引入自主决策(agency)后进一步降低。通过物理系统(双摆)、强化学习代理和多轮大语言模型对话实证验证了这些边界,从而区分了“代理能力”(agency)与“智能”(intelligence)——前者仅指基于预测采取行动的能力,后者还需具备从交互中学习、自我监控学习效率,并动态调整观察、行动与结果范围以恢复有效学习的能力。由此指出当前AI仅具备代理能力,未达智能水平;并提出受生物脑皮层-丘脑调控启发的实时监测P的反馈架构,为构建自适应、鲁棒的人工智能奠定基础。
链接: https://arxiv.org/abs/2602.22519
作者: Wael Hafez,Chenan Wei,Rodrigo Felipe,Amir Nazeri,Cameron Reid
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 20 pages, 4 figuers
Abstract:To operate reliably under changing conditions, complex systems require feedback on how effectively they use resources, not just whether objectives are met. Current AI systems process vast information to produce sophisticated predictions, yet predictions can appear successful while the underlying interaction with the environment degrades. What is missing is a principled measure of how much of the total information a system deploys is actually shared between its observations, actions, and outcomes. We prove this shared fraction, which we term bipredictability, P, is intrinsic to any interaction, derivable from first principles, and strictly bounded: P can reach unity in quantum systems, P equal to, or smaller than 0.5 in classical systems, and lower once agency (action selection) is introduced. We confirm these bounds in a physical system (double pendulum), reinforcement learning agents, and multi turn LLM conversations. These results distinguish agency from intelligence: agency is the capacity to act on predictions, whereas intelligence additionally requires learning from interaction, self-monitoring of its learning effectiveness, and adapting the scope of observations, actions, and outcomes to restore effective learning. By this definition, current AI systems achieve agency but not intelligence. Inspired by thalamocortical regulation in biological systems, we demonstrate a feedback architecture that monitors P in real time, establishing a prerequisite for adaptive, resilient AI.
[AI-71] SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation
【速读】:该论文旨在解决当前人机交互中对无障碍、直观控制的需求,尤其是为手语使用者提供一种无需依赖词汇标注(gloss)的视觉-语言-动作(Vision-Language-Action, VLA)框架,以实现更自然、可扩展的机器人控制。传统方法通常依赖于中间层的词汇标注作为监督信号,这不仅增加标注成本,还可能因信息压缩导致语义损失。其解决方案的关键在于提出一个无词汇标注(gloss-free)的端到端映射机制,直接将视觉手语手势转化为语义指令,并通过几何归一化、时序平滑和词法优化等步骤,将连续手势流转换为稳定一致的语言命令,从而在安全关键场景下实现高可靠性与低延迟的字母级指拼(finger-spelling)通信通道。此设计为未来集成基于Transformer的无词汇手语模型奠定了基础,推动了多模态具身智能向包容性和可扩展方向发展。
链接: https://arxiv.org/abs/2602.22514
作者: Xinyu Tan,Ningwei Bai,Harry Gardener,Zhengyang Zhong,Luoyu Zhang,Liuhaichen Yang,Zhekai Duan,Monkgogi Galeitsiwe,Zezhi Tang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 7 pages, 2 figures
Abstract:We present, to our knowledge, the first sign language-driven Vision-Language-Action (VLA) framework for intuitive and inclusive human-robot interaction. Unlike conventional approaches that rely on gloss annotations as intermediate supervision, the proposed system adopts a gloss-free paradigm and directly maps visual sign gestures to semantic instructions. This design reduces annotation cost and avoids the information loss introduced by gloss representations, enabling more natural and scalable multimodal interaction. In this work, we focus on a real-time alphabet-level finger-spelling interface that provides a robust and low-latency communication channel for robotic control. Compared with large-scale continuous sign language recognition, alphabet-level interaction offers improved reliability, interpretability, and deployment feasibility in safety-critical embodied environments. The proposed pipeline transforms continuous gesture streams into coherent language commands through geometric normalization, temporal smoothing, and lexical refinement, ensuring stable and consistent interaction. Furthermore, the framework is designed to support future integration of transformer-based gloss-free sign language models, enabling scalable word-level and sentence-level semantic understanding. Experimental results demonstrate the effectiveness of the proposed system in grounding sign-derived instructions into precise robotic actions under diverse interaction scenarios. These results highlight the potential of the framework to advance accessible, scalable, and multimodal embodied intelligence. Comments: 7 pages, 2 figures Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) Cite as: arXiv:2602.22514 [cs.RO] (or arXiv:2602.22514v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2602.22514 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-72] Mirroring the Mind: Distilling Human-Like Metacognitive Strategies into Large Language Models
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在复杂推理任务中因结构脆弱性导致的推理失败问题,即模型虽能推导出有效的中间步骤,却仍无法得出正确答案。研究表明,此类失败并非源于推理能力不足,而是由于缺乏自我调节控制机制,表现为逻辑探索失控或未能识别逻辑充分性。解决方案的关键在于提出元认知行为调优(Metacognitive Behavioral Tuning, MBT),通过两种互补策略显式注入元认知行为:(1) MBT-S 从零构建严谨的推理轨迹;(2) MBT-R 重写学生初始推理路径以稳定内在探索模式。实验表明,MBT 在多跳问答基准上显著优于基线方法,在提升准确性的同时大幅降低 Token 消耗,验证了内化元认知策略可增强推理稳定性与鲁棒性。
链接: https://arxiv.org/abs/2602.22508
作者: Ik-hwan Kim,Hyeongrok Han,Mingi Jung,Sangwon Yu,Jinseok Hong,Sang Hun Kim,Yoonyoung Choi,Sungroh Yoon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages
Abstract:Large Reasoning Models (LRMs) often exhibit structural fragility in complex reasoning tasks, failing to produce correct answers even after successfully deriving valid intermediate steps. Through systematic analysis, we observe that these failures frequently stem not from a lack of reasoning capacity, but from a deficiency in self-regulatory control, where valid logic is destabilized by uncontrolled exploration or the failure to recognize logical sufficiency. Motivated by this observation, we propose Metacognitive Behavioral Tuning (MBT), a post-training framework that explicitly injects metacognitive behaviors into the model’s thought process. MBT implements this via two complementary formulations: (1) MBT-S, which synthesizes rigorous reasoning traces from scratch, and (2) MBT-R, which rewrites the student’s initial traces to stabilize intrinsic exploration patterns. Experiments across multi-hop QA benchmarks demonstrate that MBT consistently outperforms baselines, achieving notable gains on challenging benchmarks. By effectively eliminating reasoning collapse, MBT achieves higher accuracy with significantly reduced token consumption, demonstrating that internalizing metacognitive strategies leads to more stable and robust reasoning.
[AI-73] Mapping the Landscape of Artificial Intelligence in Life Cycle Assessment Using Large Language Models
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)与生命周期评估(Life Cycle Assessment, LCA)交叉研究中缺乏系统性、全面性综述的问题。现有文献虽已广泛采用机器学习(Machine Learning, ML)算法支持LCA各阶段,但对整体发展趋势、新兴主题及未来方向的整合分析仍显不足。其解决方案的关键在于引入大语言模型(Large Language Models, LLMs)驱动的文本挖掘方法,结合传统文献综述技术,构建一个动态且高效的框架,以识别AI-LCA领域的研究趋势与概念模式(themes),从而实现大规模、可复现的跨领域综述,并为计算高效型LCA提供基于AI技术发展的路径指引。
链接: https://arxiv.org/abs/2602.22500
作者: Anastasija Mensikova,Donna M. Rizzo,Kathryn Hinkelman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Integration of artificial intelligence (AI) into life cycle assessment (LCA) has accelerated in recent years, with numerous studies successfully adapting machine learning algorithms to support various stages of LCA. Despite this rapid development, comprehensive and broad synthesis of AI-LCA research remains limited. To address this gap, this study presents a detailed review of published work at the intersection of AI and LCA, leveraging large language models (LLMs) to identify current trends, emerging themes, and future directions. Our analyses reveal that as LCA research continues to expand, the adoption of AI technologies has grown dramatically, with a noticeable shift toward LLM-driven approaches, continued increases in ML applications, and statistically significant correlations between AI approaches and corresponding LCA stages. By integrating LLM-based text-mining methods with traditional literature review techniques, this study introduces a dynamic and effective framework capable of capturing both high-level research trends and nuanced conceptual patterns (themes) across the field. Collectively, these findings demonstrate the potential of LLM-assisted methodologies to support large-scale, reproducible reviews across broad research domains, while also evaluating pathways for computationally-efficient LCA in the context of rapidly developing AI technologies. In doing so, this work helps LCA practitioners incorporate state-of-the-art tools and timely insights into environmental assessments that can enhance the rigor and quality of sustainability-driven decisions and decision-making processes.
[AI-74] Reinforcement-aware Knowledge Distillation for LLM Reasoning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)后训练中知识蒸馏(Knowledge Distillation, KD)所面临的分布不匹配与目标干扰问题:传统KD方法依赖固定教师模型的轨迹或基于KL散度的正则化项,在与RL结合时,易因学生模型滚动分布变化导致监督信号失准,且KL正则项可能与奖励最大化目标冲突,需复杂损失权重调优。其解决方案的关键在于提出一种面向强化学习的蒸馏方法(RL-aware Distillation, RLAD),核心组件为信任区域比率蒸馏(Trust Region Ratio Distillation, TRRD),该机制以教师-旧策略混合分布为锚点,采用类似PPO/GRPO的似然比目标替代KL正则项,实现对学生轨迹的优势感知、信任区域约束下的选择性模仿,从而自然平衡探索、利用与模仿三者关系,并在逻辑推理和数学基准上显著优于离线蒸馏、标准GRPO及基于KL的在线师生蒸馏方法。
链接: https://arxiv.org/abs/2602.22495
作者: Zhaoyang Zhang,Shuli Jiang,Yantao Shen,Yuting Zhang,Dhananjay Ram,Shuo Yang,Zhuowen Tu,Wei Xia,Stefano Soatto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student’s evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL – guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher–old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.
[AI-75] Explainability-Aware Evaluation of Transfer Learning Models for IoT DDoS Detection Under Resource Constraints
【速读】:该论文旨在解决资源受限环境下物联网(IoT)基础设施面临的分布式拒绝服务(DDoS)攻击检测问题,尤其关注现有迁移学习模型在实际部署中可靠性、计算可行性与可解释性不足的局限。解决方案的关键在于通过一个面向可解释性的实证评估框架,系统比较七种预训练卷积神经网络架构在多类IoT DDoS检测任务中的综合表现,其中不仅包含传统性能指标,还引入了可靠性统计量(如MCC、Youden Index、置信区间)、延迟与训练成本分析,并结合Grad-CAM和SHAP方法进行可解释性评估。结果表明,DenseNet169在可靠性和可解释性对齐方面最优,而MobileNetV3则在边缘计算场景下提供了良好的延迟-准确率权衡,凸显了将性能、可靠性与可解释性协同考量对于IoT DDoS检测模型选择的重要性。
链接: https://arxiv.org/abs/2602.22488
作者: Nelly Elsayed
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 24 pages, under review
Abstract:Distributed denial-of-service (DDoS) attacks threaten the availability of Internet of Things (IoT) infrastructures, particularly under resource-constrained deployment conditions. Although transfer learning models have shown promising detection accuracy, their reliability, computational feasibility, and interpretability in operational environments remain insufficiently explored. This study presents an explainability-aware empirical evaluation of seven pre-trained convolutional neural network architectures for multi-class IoT DDoS detection using the CICDDoS2019 dataset and an image-based traffic representation. The analysis integrates performance metrics, reliability-oriented statistics (MCC, Youden Index, confidence intervals), latency and training cost assessment, and interpretability evaluation using Grad-CAM and SHAP. Results indicate that DenseNet and MobileNet-based architectures achieve strong detection performance while demonstrating superior reliability and compact, class-consistent attribution patterns. DenseNet169 offers the strongest reliability and interpretability alignment, whereas MobileNetV3 provides an effective latency-accuracy trade-off for fog-level deployment. The findings emphasize the importance of combining performance, reliability, and explainability criteria when selecting deep learning models for IoT DDoS detection.
[AI-76] ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在无需调用外部求解器的情况下,能否直接生成约束优化问题的正确可行解这一关键问题。现有基准主要评估LLM将优化问题转化为求解器代码的能力,而忽略了其直接推理并输出结构化解决方案的能力。为此,作者提出了ConstraintBench,一个涵盖10个运筹学领域的基准测试集,所有任务均以自然语言描述场景、实体、约束与目标函数,并由Gurobi求解器验证最优性;模型需输出可被确定性验证器逐约束检查的结构化解。关键创新在于构建了一个端到端的评估框架,明确揭示了当前LLM在可行性方面存在显著瓶颈(最高仅65.0%约束满足率),且即使可行解也常偏离最优值(平均达Gurobi最优目标的89–96%),从而为后续研究指明方向:提升约束理解与可行性保障能力是突破当前局限的核心路径。
链接: https://arxiv.org/abs/2602.22465
作者: Joseph Tso,Preston Schmittou,Quan Huynh,Jibran Hutchins
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. 10 pages, 1 figure, 6 tables. Benchmark and evaluation code will be publicly released
Abstract:Large language models are increasingly applied to operational decision-making where the underlying structure is constrained optimization. Existing benchmarks evaluate whether LLMs can formulate optimization problems as solver code, but leave open a complementary question. Can LLMs directly produce correct solutions to fully specified constrained optimization problems without access to a solver? We introduce ConstraintBench, a benchmark for evaluating LLMs on direct constrained optimization across 10 operations research domains, with all ground-truth solutions verified by the Gurobi solver. Each task presents a natural-language scenario with entities, constraints, and an optimization objective; the model must return a structured solution that a deterministic verifier checks against every constraint and the solver-proven optimum. We evaluate six frontier models on 200 tasks and find that feasibility, not optimality, is the primary bottleneck. The best model achieves only 65.0% constraint satisfaction, yet feasible solutions average 89 to 96% of the Gurobi-optimal objective. No model exceeds 30.5% on joint feasibility and optimality within 0.1% of the solver reference. Per-domain analysis shows large variation in difficulty, with average feasibility spanning from 83.3% in the production mix domain to 0.8% in the crew assignment domain. Further, systematic failure modes include duration constraint misunderstanding, entity hallucination, and a feasibility-optimality decoupling in facility location and vehicle routing where models achieve high feasibility but 0% optimality. ConstraintBench and all evaluation infrastructure will be publicly released.
[AI-77] Automating the Detection of Requirement Dependencies Using Large Language Models
【速读】:该论文旨在解决软件需求工程中需求依赖关系自动识别的难题,尤其针对自然语言(Natural Language, NL)需求描述中存在的复杂耦合、模糊性及动态变化导致的手动识别效率低、准确性差的问题。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的自动化方法 LEREDD,该方法融合检索增强生成(Retrieval-Augmented Generation, RAG)与上下文学习(In-Context Learning, ICL),能够直接从 NL 需求文本中识别多种类型的需求依赖关系。实验表明,LEREDD 在准确率(0.93)和 F1 分数(0.84)上显著优于现有基线方法,尤其在细粒度依赖类型检测上表现突出,相对基线平均提升达 94.87% 至 105.41% 的 F1 分数。
链接: https://arxiv.org/abs/2602.22456
作者: Ikram Darif,Feifei Niu,Manel Abdellatif,Lionel C. Briand,Ramesh S.,Arun Adiththan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Requirements are inherently interconnected through various types of dependencies. Identifying these dependencies is essential, as they underpin critical decisions and influence a range of activities throughout software development. However, this task is challenging, particularly in modern software systems, given the high volume of complex, coupled requirements. These challenges are further exacerbated by the ambiguity of Natural Language (NL) requirements and their constant change. Consequently, requirement dependency detection is often overlooked or performed manually. Large Language Models (LLMs) exhibit strong capabilities in NL processing, presenting a promising avenue for requirement-related tasks. While they have shown to enhance various requirements engineering tasks, their effectiveness in identifying requirement dependencies remains unexplored. In this paper, we introduce LEREDD, an LLM-based approach for automated detection of requirement dependencies that leverages Retrieval-Augmented Generation (RAG) and In-Context Learning (ICL). It is designed to identify diverse dependency types directly from NL requirements. We empirically evaluate LEREDD against two state-of-the-art baselines. The results show that LEREDD provides highly accurate classification of dependent and non-dependent requirements, achieving an accuracy of 0.93, and an F1 score of 0.84, with the latter averaging 0.96 for non-dependent cases. LEREDD outperforms zero-shot LLMs and baselines, particularly in detecting fine-grained dependency types, where it yields average relative gains of 94.87% and 105.41% in F1 scores for the Requires dependency over the baselines. We also provide an annotated dataset of requirement dependencies encompassing 813 requirement pairs across three distinct systems to support reproducibility and future research.
[AI-78] CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines
【速读】:该论文旨在解决具身智能体(embodied agent)管道中动作可行性评分器(action feasibility scorer)的可靠性问题,即在规划或推理之前,如何准确识别当前状态下哪些候选动作是物理上可执行的。现有方法采用监督微调(SFT)训练评分模型,但SFT将每个候选动作独立处理,未能显式教导模型区分物理正确与细微错误的动作。解决方案的关键在于提出对比世界模型(Contrastive World Model, CWM),通过InfoNCE对比损失函数结合难负例挖掘(hard-mined negative examples)对大语言模型(LLM)进行微调,使有效动作在评分空间中远离无效动作,尤其强调语义相近但物理不兼容的难负例。实验表明,CWM在ScienceWorld基准上显著优于SFT,在精确率@1和AUC-ROC指标上提升明显,并在分布外压力条件下保持更优的安全裕度,验证了对比学习能更忠实捕捉物理可行性表征。
链接: https://arxiv.org/abs/2602.22452
作者: Chayan Banerjee
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:A reliable action feasibility scorer is a critical bottleneck in embodied agent pipelines: before any planning or reasoning occurs, the agent must identify which candidate actions are physically executable in the current state. Existing approaches use supervised fine-tuning (SFT) to train action scorers, but SFT treats each candidate independently and does not explicitly teach the model to discriminate between actions that are physically correct and those that are subtly wrong. We propose the Contrastive World Model (CWM), which fine-tunes a large language model (LLM) as an action scorer using an InfoNCE contrastive objective with hard-mined negative examples. The key idea is to push valid actions away from invalid ones in scoring space, with special emphasis on hard negatives: semantically similar but physically incompatible candidates. We evaluate CWM on the ScienceWorld benchmark through two studies. First, an intrinsic affordance evaluation on 605 hard-negative test pairs shows that CWM outperforms SFT by +6.76 percentage points on Precision@1 for minimal-edit negatives – cases where a single word changes the physical outcome – and achieves a higher AUC-ROC (0.929 vs. 0.906). Second, a live filter characterisation study measures how well CWM ranks gold-path actions against all valid environment actions during task execution. Under out-of-distribution stress conditions, CWM maintains a significantly better safety margin (-2.39) than SFT (-3.96), indicating that the gold action is ranked closer to the top. These results support the hypothesis that contrastive training induces representations that capture physical feasibility more faithfully than SFT alone.
[AI-79] Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace
【速读】:该论文旨在解决代理型大语言模型(Agentic Large Language Model)系统在自动化任务执行过程中因隐式提示注入(implicit prompt injection)引发的“静默外泄”(silent egress)安全风险问题,即恶意网页通过自动提取的URL预览内容(如标题、元数据和片段)向模型注入指令,导致敏感运行时上下文被隐蔽地外传,而用户界面显示却无异常。解决方案的关键在于识别并阻断网络层的异常出站请求,而非仅依赖输出端的安全检查;研究发现,在系统层和网络层实施控制措施(如域名白名单和重定向链分析)比单纯强化提示层防御更为有效,并提出应将网络外泄视为首要安全目标,同时建议采用溯源追踪与能力隔离等架构改进以实现更深层次的防护。
链接: https://arxiv.org/abs/2602.22450
作者: Qianlong Lan,Anuj Kaul,Shaun Jones,Stephanie Westrum
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic large language model systems increasingly automate tasks by retrieving URLs and calling external tools. We show that this workflow gives rise to implicit prompt injection: adversarial instructions embedded in automatically generated URL previews, including titles, metadata, and snippets, can introduce a system-level risk that we refer to as silent egress. Using a fully local and reproducible testbed, we demonstrate that a malicious web page can induce an agent to issue outbound requests that exfiltrate sensitive runtime context, even when the final response shown to the user appears harmless. In 480 experimental runs with a qwen2.5:7b-based agent, the attack succeeds with high probability (P (egress) =0.89), and 95% of successful attacks are not detected by output-based safety checks. We also introduce sharded exfiltration, where sensitive information is split across multiple requests to avoid detection. This strategy reduces single-request leakage metrics by 73% (Leak@1) and bypasses simple data loss prevention mechanisms. Our ablation results indicate that defenses applied at the prompt layer offer limited protection, while controls at the system and network layers, such as domain allowlisting and redirect-chain analysis, are considerably more effective. These findings suggest that network egress should be treated as a first-class security outcome in agentic LLM systems. We outline architectural directions, including provenance tracking and capability isolation, that go beyond prompt-level hardening.
[AI-80] ECHO: Encoding Communities via High-order Operators
【速读】:该论文旨在解决 attributed networks(属性网络)中社区检测面临的两大核心挑战:一是传统拓扑算法忽视语义特征,二是图神经网络(Graph Neural Networks, GNNs)在密集或异质连接网络中遭遇语义过平滑(semantic over-smoothing)和系统级内存瓶颈(Systems Wall),即由于成对聚类导致的 O(N²) 内存开销。其解决方案的关键在于提出 ECHO(Encoding Communities via High order Operators),一种可扩展的自监督架构,将社区检测重构为一个自适应的多尺度扩散过程。ECHO 的核心创新包括:1)拓扑感知路由器(Topology Aware Router),能够自动分析图结构稀疏性、密度与同配性等结构性启发式信息,动态选择最优归纳偏置以防止异质性污染并促进语义稠密化;2)基于内存分片的全批对比目标与新颖的分块 O(N·K) 相似度提取方法,彻底规避了传统 O(N²) 内存限制,同时保持全局梯度的数学精度。这一设计实现了拓扑与语义特征的协同优化,在大规模真实社交网络(超 160 万节点)上实现分钟级聚类,吞吐量达每秒 2800 节点以上,且在 LFR 合成基准上展现出尺度不变的高精度。
链接: https://arxiv.org/abs/2602.22446
作者: Emilio Ferrara
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Community detection in attributed networks faces a fundamental divide: topological algorithms ignore semantic features, while Graph Neural Networks (GNNs) encounter devastating computational bottlenecks. Specifically, GNNs suffer from a Semantic Wall of feature over smoothing in dense or heterophilic networks, and a Systems Wall driven by the O(N^2) memory constraints of pairwise clustering. To dismantle these barriers, we introduce ECHO (Encoding Communities via High order Operators), a scalable, self supervised architecture that reframes community detection as an adaptive, multi scale diffusion process. ECHO features a Topology Aware Router that automatically analyzes structural heuristics sparsity, density, and assortativity to route graphs through the optimal inductive bias, preventing heterophilic poisoning while ensuring semantic densification. Coupled with a memory sharded full batch contrastive objective and a novel chunked O(N \cdot K) similarity extraction method, ECHO completely bypasses traditional O(N^2) memory bottlenecks without sacrificing the mathematical precision of global gradients. Extensive evaluations demonstrate that this topology feature synergy consistently overcomes the classical resolution limit. On synthetic LFR benchmarks scaled up to 1 million nodes, ECHO achieves scale invariant accuracy despite severe topological noise. Furthermore, on massive real world social networks with over 1.6 million nodes and 30 million edges, it completes clustering in mere minutes with throughputs exceeding 2,800 nodes per second matching the speed of highly optimized purely topological baselines. The implementation utilizes a unified framework that automatically engages memory sharded optimization to support adoption across varying hardware constraints. GitHub Repository: this https URL
[AI-81] A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines
【速读】:该论文旨在解决当前基于代理的自动化机器学习(Agent-based AutoML)系统在评估过程中过度依赖最终任务性能指标的问题,忽略了对中间决策质量的结构化评估。现有方法无法有效识别因错误决策导致的潜在失败模式,从而限制了系统的可解释性、可靠性和可控性。解决方案的关键在于提出一个评估代理(Evaluation Agent, EA),作为无干扰的观察者,从四个维度——决策有效性、推理一致性、模型质量风险(超出准确率)、反事实决策影响——对AutoML代理的中间决策进行决策导向型评估。EA能够在不干预原系统执行的前提下,精准检测故障决策(F1分数达0.919),识别与最终结果无关的推理不一致,并量化决策对下游性能的影响(范围为-4.9%至+8.3%),从而揭示仅靠结果指标无法发现的系统失效机制。
链接: https://arxiv.org/abs/2602.22442
作者: Gaoyuan Du,Amit Ahlawat,Xiaoyang Liu,Jing Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:Agent-based AutoML systems rely on large language models to make complex, multi-stage decisions across data processing, model selection, and evaluation. However, existing evaluation practices remain outcome-centric, focusing primarily on final task performance. Through a review of prior work, we find that none of the surveyed agentic AutoML systems report structured, decision-level evaluation metrics intended for post-hoc assessment of intermediate decision quality. To address this limitation, we propose an Evaluation Agent (EA) that performs decision-centric assessment of AutoML agents without interfering with their execution. The EA is designed as an observer that evaluates intermediate decisions along four dimensions: decision validity, reasoning consistency, model quality risks beyond accuracy, and counterfactual decision impact. Across four proof-of-concept experiments, we demonstrate that the EA can (i) detect faulty decisions with an F1 score of 0.919, (ii) identify reasoning inconsistencies independent of final outcomes, and (iii) attribute downstream performance changes to agent decisions, revealing impacts ranging from -4.9% to +8.3% in final metrics. These results illustrate how decision-centric evaluation exposes failure modes that are invisible to outcome-only metrics. Our work reframes the evaluation of agentic AutoML systems from an outcome-based perspective to one that audits agent decisions, offering a foundation for reliable, interpretable, and governable autonomous ML systems.
[AI-82] From Bias to Balance: Fairness-Aware Paper Recommendation for Equitable Peer Review
【速读】:该论文旨在解决学术会议审稿过程中因作者人口统计学特征(如种族、国籍等)导致的系统性偏见问题,这种偏见会持续性地削弱代表性不足群体的参与机会。解决方案的关键在于提出Fair-PaperRec——一种基于多层感知机(MLP)的后审稿推荐模型,其核心创新是在损失函数中引入可微分的公平性正则项(fairness regularizer),该正则项针对交叉属性(intersectional attributes)进行建模,从而在保持整体选稿质量稳定的同时,对论文排序进行再调整以提升代表性不足群体的入选概率。实验表明,在合成数据和真实会议数据(SIGCHI、DIS、IUI)上的应用均验证了该方法能显著提升多样性(最高达42.03%),同时对整体效用影响极小(最多下降3.16%),证明了公平性正则化不仅是一种公平机制,也是一种在高度偏见环境中提升包容性的轻量级质量调节器。
链接: https://arxiv.org/abs/2602.22438
作者: Uttamasha Anjally Oyshi,Susan Gauch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite frequent double-blind review, systemic biases related to author demographics still disadvantage underrepresented groups. We start from a simple hypothesis: if a post-review recommender is trained with an explicit fairness regularizer, it should increase inclusion without degrading quality. To test this, we introduce Fair-PaperRec, a Multi-Layer Perceptron (MLP) with a differentiable fairness loss over intersectional attributes (e.g., race, country) that re-ranks papers after double-blind review. We first probe the hypothesis on synthetic datasets spanning high, moderate, and near-fair biases. Across multiple randomized runs, these controlled studies map where increasing the fairness weight strengthens macro/micro diversity while keeping utility approximately stable, demonstrating robustness and adaptability under varying disparity levels. We then carry the hypothesis into the original setting, conference data from ACM Special Interest Group on Computer-Human Interaction (SIGCHI), Designing Interactive Systems (DIS), and Intelligent User Interfaces (IUI). In this real-world scenario, an appropriately tuned configuration of Fair-PaperRec achieves up to a 42.03% increase in underrepresented-group participation with at most a 3.16% change in overall utility relative to the historical selection. Taken together, the synthetic-to-original progression shows that fairness regularization can act as both an equity mechanism and a mild quality regularizer, especially in highly biased regimes. By first analyzing the behavior of the fairness parameters under controlled conditions and then validating them on real submissions, Fair-PaperRec offers a practical, equity-focused framework for post-review paper selection that preserves, and in some settings can even enhance, measured scholarly quality.
[AI-83] veScale-FSDP: Flexible and High-Performance FSDP at Scale
【速读】:该论文旨在解决当前全分片数据并行(Fully Sharded Data Parallel, FSDP)系统在支持结构感知训练方法(如块级量化训练)和非逐元素优化器(如Shampoo和Muon)时存在的局限性,以及通信与内存效率不足导致难以扩展至数万GPU的问题。其关键解决方案是提出veScale-FSDP,通过引入灵活的分片格式RaggedShard与结构感知规划算法相结合,实现了对FSDP所需高效数据布局的原生支持,从而在保持高灵活性的同时显著提升性能和可扩展性,相较现有FSDP系统实现5~66%的吞吐量提升和16~30%的内存占用降低,并能高效扩展至数万GPU规模。
链接: https://arxiv.org/abs/2602.22437
作者: Zezhou Wang,Youjie Li,Zhiqi Lin,Jiacheng Yang,Cong Xie,Guanyu Feng,Zheng Zhong,Ziyue Huang,Hongyu Zhu,Zhi Zhang,Yanghua Peng,Xin Liu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP’s fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today’s implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.
[AI-84] GetBatch: Distributed Multi-Object Retrieval for ML Data Loading
【速读】:该论文旨在解决机器学习训练流水线中因频繁发起独立的GET请求而导致的数据传输效率低下问题,尤其是在从分布式存储集群中读取小规模数据块时,每请求开销显著影响整体性能。其解决方案的关键在于引入一种名为GetBatch的新对象存储API,将批量数据获取提升为一级存储操作,通过单次确定性、容错的流式执行替代多个独立GET操作,从而大幅降低延迟并提升吞吐量。实验表明,该方案在小对象场景下可实现最高15倍的吞吐量提升,并在实际训练负载中将P95批次获取延迟减少2倍、P99单对象尾部延迟降低3.7倍。
链接: https://arxiv.org/abs/2602.22434
作者: Alex Aizman,Abhishek Gaikwad,Piotr Żelasko
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注: 11 pages, 3 figures, 2 tables. Preprint
Abstract:Machine learning training pipelines consume data in batches. A single training step may require thousands of samples drawn from shards distributed across a storage cluster. Issuing thousands of individual GET requests incurs per-request overhead that often dominates data transfer time. To solve this problem, we introduce GetBatch - a new object store API that elevates batch retrieval to a first-class storage operation, replacing independent GET operations with a single deterministic, fault-tolerant streaming execution. GetBatch achieves up to 15x throughput improvement for small objects and, in a production training workload, reduces P95 batch retrieval latency by 2x and P99 per-object tail latency by 3.7x compared to individual GET requests.
[AI-85] Calibrated Test-Time Guidance for Bayesian Inference
【速读】:该论文旨在解决现有测试时引导(test-time guidance)方法在使用预训练扩散模型时,因仅追求奖励最大化而非从真实的贝叶斯后验分布中采样而导致的推理校准不足问题。其关键解决方案是识别出导致后验估计失真的结构近似,并提出了一种一致的替代估计器,从而实现从贝叶斯后验分布中的校准采样,在一系列贝叶斯推理任务中显著优于此前方法,并在黑洞图像重建任务中达到最先进水平。
链接: https://arxiv.org/abs/2602.22428
作者: Daniel Geyfman,Felix Draxler,Jan Groeneveld,Hyunsoo Lee,Theofanis Karaletsos,Stephan Mandt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. Under review
Abstract:Test-time guidance is a widely used mechanism for steering pretrained diffusion models toward outcomes specified by a reward function. Existing approaches, however, focus on maximizing reward rather than sampling from the true Bayesian posterior, leading to miscalibrated inference. In this work, we show that common test-time guidance methods do not recover the correct posterior distribution and identify the structural approximations responsible for this failure. We then propose consistent alternative estimators that enable calibrated sampling from the Bayesian posterior. We significantly outperform previous methods on a set of Bayesian inference tasks, and match state-of-the-art in black hole image reconstruction.
[AI-86] HubScan: Detecting Hubness Poisoning in Retrieval-Augmented Generation Systems
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中存在的hubness安全缺陷问题,即某些文档或数据项在大量不同查询中频繁出现在Top-k检索结果中,形成“枢纽”(hubs),这些枢纽可被恶意利用以注入有害内容、操纵搜索排名、绕过内容过滤并降低系统性能。解决方案的关键在于提出一个名为hubscan的开源安全扫描工具,其核心创新包括:(1) 基于中位数与平均绝对偏差(Median/MAD-based z-scores)的鲁棒统计检测方法;(2) 通过聚类扩散分析评估跨簇检索模式;(3) 在查询扰动下的稳定性测试;(4) 针对特定领域和跨模态攻击的感知检测机制。该框架兼容多种向量数据库(如FAISS、Pinecone等),支持多种检索策略,并在多个基准数据集上验证了其高召回率(90% at 0.2% alert budget)和对对抗性hub的强识别能力,为生产环境中RAG系统的安全监控提供了实用且可扩展的解决方案。
链接: https://arxiv.org/abs/2602.22427
作者: Idan Habler,Vineeth Sai Narajala,Stav Koren,Amy Chang,Tiffany Saade
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, 2 tables, Github: this https URL
Abstract:Retrieval-Augmented Generation (RAG) systems are essential to contemporary AI applications, allowing large language models to obtain external knowledge via vector similarity search. Nevertheless, these systems encounter a significant security flaw: hubness - items that frequently appear in the top-k retrieval results for a disproportionately high number of varied queries. These hubs can be exploited to introduce harmful content, alter search rankings, bypass content filtering, and decrease system performance. We introduce hubscan, an open-source security scanner that evaluates vector indices and embeddings to identify hubs in RAG systems. Hubscan presents a multi-detector architecture that integrates: (1) robust statistical hubness detection utilizing median/MAD-based z-scores, (2) cluster spread analysis to assess cross-cluster retrieval patterns, (3) stability testing under query perturbations, and (4) domain-aware and modality-aware detection for category-specific and cross-modal attacks. Our solution accommodates several vector databases (FAISS, Pinecone, Qdrant, Weaviate) and offers versatile retrieval techniques, including vector similarity, hybrid search, and lexical matching with reranking capabilities. We evaluate hubscan on Food-101, MS-COCO, and FiQA adversarial hubness benchmarks constructed using state-of-the-art gradient-optimized and centroid-based hub generation methods. hubscan achieves 90% recall at a 0.2% alert budget and 100% recall at 0.4%, with adversarial hubs ranking above the 99.8th percentile. Domain-scoped scanning recovers 100% of targeted attacks that evade global detection. Production validation on 1M real web documents from MS MARCO demonstrates significant score separation between clean documents and adversarial content. Our work provides a practical, extensible framework for detecting hubness threats in production RAG systems. Comments: 11 pages, 5 figures, 2 tables, Github: this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.22427 [cs.CR] (or arXiv:2602.22427v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.22427 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-87] ArchAgent : Agent ic AI-driven Computer Architecture Discovery
【速读】:该论文旨在解决传统计算机体系结构设计流程效率低、依赖人工经验且难以快速迭代优化的问题,特别是在缓存替换策略(cache replacement policy)这一关键硬件组件上的自动化发现与优化难题。解决方案的关键在于提出 ArchAgent——一个基于 AlphaEvolve 构建的自主代理型生成式 AI 系统,能够无需人工干预地自动探索并生成具备全新机制逻辑的高性能缓存替换策略,而非仅调整参数。ArchAgent 在两天内即在多核 Google 工作负载上实现比现有最先进(SoTA)策略更高的 IPC 提速(+5.3%),并在单核 SPEC06 工作负载上以 18 天完成优化并取得 +0.9% 的提升,速度较以往人类研发的 SoTA 快 3–5 倍;此外,其还支持“后硅片超专业化”(post-silicon hyperspecialization),通过动态调优硬件可配置参数进一步适配特定工作负载混合,额外带来 +2.4% IPC 提速,验证了 agentic AI 流程在架构创新中的强大潜力与实用性。
链接: https://arxiv.org/abs/2602.22425
作者: Raghav Gupta,Akanksha Jain,Abraham Gonzalez,Alexander Novikov,Po-Sen Huang,Matej Balog,Marvin Eisenberger,Sergey Shirobokov,Ngân Vũ,Martin Dixon,Borivoje Nikolić,Parthasarathy Ranganathan,Sagar Karandikar
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 13 pages, 5 figures, 2 tables
Abstract:Agile hardware design flows are a critically needed force multiplier to meet the exploding demand for compute. Recently, agentic generative AI systems have demonstrated significant advances in algorithm design, improving code efficiency, and enabling discovery across scientific domains. Bridging these worlds, we present ArchAgent, an automated computer architecture discovery system built on AlphaEvolve. We show ArchAgent’s ability to automatically design/implement state-of-the-art (SoTA) cache replacement policies (architecting new mechanisms/logic, not only changing parameters), broadly within the confines of an established cache replacement policy design competition. In two days without human intervention, ArchAgent generated a policy achieving a 5.3% IPC speedup improvement over the prior SoTA on public multi-core Google Workload Traces. On the heavily-explored single-core SPEC06 workloads, it generated a policy in just 18 days showing a 0.9% IPC speedup improvement over the existing SoTA (a similar “winning margin” as reported by the existing SoTA). ArchAgent achieved these gains 3-5x faster than prior human-developed SoTA policies. Agentic flows also enable “post-silicon hyperspecialization” where agents tune runtime-configurable parameters exposed in hardware policies to further align the policies with a specific workload (mix). Exploiting this, we demonstrate a 2.4% IPC speedup improvement over prior SoTA on SPEC06 workloads. Finally, we outline broader implications for computer architecture research in the era of agentic AI. For example, we demonstrate the phenomenon of “simulator escapes”, where the agentic AI flow discovered and exploited a loophole in a popular microarchitectural simulator - a consequence of the fact that these research tools were designed for a (now past) world where they were exclusively operated by humans acting in good-faith. Comments: 13 pages, 5 figures, 2 tables Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) ACMclasses: C.1; B.3.2; B.8.2; C.4 Cite as: arXiv:2602.22425 [cs.AI] (or arXiv:2602.22425v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.22425 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Raghav Gupta [view email] [v1] Wed, 25 Feb 2026 21:36:16 UTC (224 KB) Full-text links: Access Paper: View a PDF of the paper titled ArchAgent: Agentic AI-driven Computer Architecture Discovery, by Raghav Gupta and 12 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-02 Change to browse by: cs cs.AR References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-88] Revisiting Chebyshev Polynomial and Anisotropic RBF Models for Tabular Regression
【速读】:该论文旨在解决当前表格式回归任务中光滑基模型(smooth-basis models)鲜少被采用的问题,尽管其连续可微的预测表面在代理优化、敏感性分析等场景中具有优势。作者提出三种改进的光滑模型:一种基于数据驱动中心定位与梯度优化宽度的各向异性径向基函数(RBF)网络、一个岭正则化的切比雪夫多项式回归器,以及一个切比雪夫模型树(smooth-tree hybrid)。关键创新在于将这些模型设计为与scikit-learn兼容,并通过系统基准测试验证其在55个不同应用领域的回归数据集上的性能表现,结果表明,在CPU可部署场景下,光滑模型与树集成模型在准确率上无显著差异,但前者通常表现出更小的泛化差距,因此建议在实际应用中将光滑基模型纳入候选池,尤其适用于下游任务需要紧密泛化能力和输入变化平滑响应的场景。
链接: https://arxiv.org/abs/2602.22422
作者: Luciano Gerber,Huw Lloyd
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, 6 figures, 11 tables. Submitted to Information Sciences
Abstract:Smooth-basis models such as Chebyshev polynomial regressors and radial basis function (RBF) networks are well established in numerical analysis. Their continuously differentiable prediction surfaces suit surrogate optimisation, sensitivity analysis, and other settings where the response varies gradually with inputs. Despite these properties, smooth models seldom appear in tabular regression, where tree ensembles dominate. We ask whether they can compete, benchmarking models across 55 regression datasets organised by application domain. We develop an anisotropic RBF network with data-driven centre placement and gradient-based width optimisation, a ridge-regularised Chebyshev polynomial regressor, and a smooth-tree hybrid (Chebyshev model tree); all three are released as scikit-learn-compatible packages. We benchmark these against tree ensembles, a pre-trained transformer, and standard baselines, evaluating accuracy alongside generalisation behaviour. The transformer ranks first on accuracy across a majority of datasets, but its GPU dependence, inference latency, and dataset-size limits constrain deployment in the CPU-based settings common across applied science and industry. Among CPU-viable models, smooth models and tree ensembles are statistically tied on accuracy, but the former tend to exhibit tighter generalisation gaps. We recommend routinely including smooth-basis models in the candidate pool, particularly when downstream use benefits from tighter generalisation and gradually varying predictions. Comments: 32 pages, 6 figures, 11 tables. Submitted to Information Sciences Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.22422 [cs.LG] (or arXiv:2602.22422v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.22422 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-89] Epistemic Filtering and Collective Hallucination: A Jury Theorem for Confidence-Calibrated Agents
【速读】:该论文旨在解决传统群体决策理论(如康多塞陪审团定理,Condorcet Jury Theorem, CJT)中假设所有参与者固定投票的局限性问题,即在现实场景中,允许个体根据自身信心选择是否参与决策(即“我也不知道”)可显著提升整体准确性。其解决方案的关键在于提出一个概率框架,其中异质代理(agents)经历一个校准阶段(calibration phase),动态学习并估计自身的固定能力(competence),随后通过一个最终的置信阈值门控机制(confidence gate)决定是否投票或弃权。该框架推导出群体成功概率的非渐近下界,并证明这种选择性参与机制将CJT的渐近保证扩展至序贯、置信门控的设定,从而为生成式AI(Generative AI)中的幻觉(hallucinations)缓解提供理论依据和实践路径。
链接: https://arxiv.org/abs/2602.22413
作者: Jonas Karge
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate the collective accuracy of heterogeneous agents who learn to estimate their own reliability over time and selectively abstain from voting. While classical epistemic voting results, such as the \textitCondorcet Jury Theorem (CJT), assume fixed participation, real-world aggregation often benefits from allowing agents to say ``I don’t know.‘’ We propose a probabilistic framework where agents engage in a \textitcalibration phase, updating beliefs about their own fixed competence, before facing a final confidence gate that determines whether to vote or abstain. We derive a non-asymptotic lower bound on the group’s success probability and prove that this \textitselective participation generalizes the asymptotic guarantees of the CJT to a sequential, confidence-gated setting. Empirically, we validate these bounds via Monte Carlo simulations. While our results are general, we discuss their potential application to AI safety, outlining how this framework can mitigate \textithallucinations in collective LLM decision-making.
[AI-90] Exploring Human Behavior During Abstract Rule Inference and Problem Solving with the Cognitive Abstraction and Reasoning Corpus
【速读】:该论文旨在探究人类在抽象推理中所采用的认知策略,特别是如何从少量示例中快速学习并应用规则来解决抽象视觉推理问题。为实现这一目标,研究者构建了认知抽象与推理语料库(Cognitive Abstraction and Reasoning Corpus, CogARC),这是对原始用于人工智能基准测试的抽象与推理语料库(Abstraction and Reasoning Corpus, ARC)的人类适应子集。解决方案的关键在于通过高时间分辨率记录参与者在75个抽象视觉推理任务中的行为数据(包括示例观看、编辑序列和多次提交),从而系统分析人类在规则推断、策略选择与错误修正过程中的动态机制。结果表明,尽管整体准确率较高(实验1约90%,实验2约80%),但不同问题和个体间表现差异显著,且困难问题引发更长的思考时间和更多样化的策略路径;值得注意的是,即使错误答案也常呈现高度收敛性,揭示了人类在不确定性下进行泛化、误泛化及策略调整的核心认知特性。
链接: https://arxiv.org/abs/2602.22408
作者: Caroline Ahn,Quan Do,Leah Bakst,Michael P. Pascale,Joseph T. McGuire,Michael E. Hasselmo,Chantal E. Stern
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Humans exhibit remarkable flexibility in abstract reasoning, and can rapidly learn and apply rules from sparse examples. To investigate the cognitive strategies underlying this ability, we introduce the Cognitive Abstraction and Reasoning Corpus (CogARC), a diverse human-adapted subset of the Abstraction and Reasoning Corpus (ARC) which was originally developed to benchmark abstract reasoning in artificial intelligence. Across two experiments, CogARC was administered to a total of 260 human participants who freely generated solutions to 75 abstract visual reasoning problems. Success required inferring input-output rules from a small number of examples to transform the test input into one correct test output. Participants’ behavior was recorded at high temporal resolution, including example viewing, edit sequences, and multi-attempt submissions. Participants were generally successful (mean accuracy ~90% for experiment 1 (n=40), ~80% for experiment 2 (n=220) across problems), but performance varied widely across problems and participants. Harder problems elicited longer deliberation times and greater divergence in solution strategies. Over the course of the task, participants initiated responses more quickly but showed a slight decline in accuracy, suggesting increased familiarity with the task structure rather than improved rule-learning ability. Importantly, even incorrect solutions were often highly convergent, even when the problem-solving trajectories differed in length and smoothness. Some trajectories progressed directly and efficiently toward a stable outcome, whereas others involved extended exploration or partial restarts before converging. Together, these findings highlight CogARC as a rich behavioral environment for studying human abstract reasoning, providing insight into how people generalize, misgeneralize, and adapt their strategies under uncertainty.
[AI-91] owards Autonomous Memory Agents
【速读】:该论文旨在解决当前记忆代理(memory agents)在大语言模型(LLM)中被动、反应式知识获取导致的记忆增长受限与不确定性下缺乏主动探索的问题。现有方法依赖于偶然可用的信息流,难以有效扩展高质量记忆库,尤其在面对未知或不确定情境时缺乏主动获取和验证外部知识的能力。解决方案的关键在于提出一种自主记忆代理(autonomous memory agent),其核心创新包括:(i) 一个成本感知的知识提取级联机制(cost-aware knowledge-extraction cascade),从低成本的自我/教师信号逐步升级到工具验证的研究内容,仅在必要时引入专家反馈;(ii) 基于语义感知的 Thompson 采样策略(semantic-aware Thompson sampling),以平衡记忆层面的探索与利用,并缓解冷启动偏差。该方案显著提升了多个基准测试上的性能表现,证明了其在低开销前提下的高效知识获取与记忆管理能力。
链接: https://arxiv.org/abs/2602.22406
作者: Xinle Wu,Rui Zhang,Mustafa Anis Hussain,Yao Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent memory agents improve LLMs by extracting experiences and conversation history into an external storage. This enables low-overhead context assembly and online memory update without expensive LLM training. However, existing solutions remain passive and reactive; memory growth is bounded by information that happens to be available, while memory agents seldom seek external inputs in uncertainties. We propose autonomous memory agents that actively acquire, validate, and curate knowledge at a minimum cost. U-Mem materializes this idea via (i) a cost-aware knowledge-extraction cascade that escalates from cheap self/teacher signals to tool-verified research and, only when needed, expert feedback, and (ii) semantic-aware Thompson sampling to balance exploration and exploitation over memories and mitigate cold-start bias. On both verifiable and non-verifiable benchmarks, U-Mem consistently beats prior memory baselines and can surpass RL-based optimization, improving HotpotQA (Qwen2.5-7B) by 14.6 points and AIME25 (Gemini-2.5-flash) by 7.33 points.
[AI-92] EyeLayer: Integrating Human Attention Patterns into LLM -Based Code Summarization
【速读】:该论文旨在解决当前基于大语言模型(LLMs)的代码摘要生成任务中,如何进一步提升模型对代码语义理解能力的问题。尽管现有模型已取得显著进展,但其注意力机制仍主要依赖于数据驱动的模式学习,缺乏对人类专家在阅读代码时注意力分布的建模。为此,作者提出EyeLayer——一种轻量级注意力增强模块,其核心创新在于将人类眼动轨迹(eye-gaze patterns)作为人类专业知识的代理信号引入模型。该模块通过多模态高斯混合模型(Multimodal Gaussian Mixture)建模开发者在阅读代码时的关注区域和强度,动态调整token嵌入表示,从而注入可泛化的注意力先验。这种设计无需改变原有模型结构即可实现与LLMs的无缝集成,并在多个主流模型(如LLaMA-3.2、Qwen3和CodeBERT)上一致取得性能提升,最高BLEU-4指标提升达13.17%,验证了人类注视模式能够提供互补的注意力线索并有效迁移至不同架构的代码摘要任务中。
链接: https://arxiv.org/abs/2602.22368
作者: Jiahao Zhang,Yifan Zhang,Kevin Leach,Yu Huang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at the 34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026), April 12-13, 2026, Rio de Janeiro, Brazil
Abstract:Code summarization is the task of generating natural language descriptions of source code, which is critical for software comprehension and maintenance. While large language models (LLMs) have achieved remarkable progress on this task, an open question remains: can human expertise in code understanding further guide and enhance these models? We propose EyeLayer, a lightweight attention-augmentation module that incorporates human eye-gaze patterns, as a proxy of human expertise, into LLM-based code summarization. EyeLayer models human attention during code reading via a Multimodal Gaussian Mixture, redistributing token embeddings based on learned parameters (\mu_i, \sigma_i^2) that capture where and how intensively developers focus. This design enables learning generalizable attention priors from eye-tracking data and incorporating them into LLMs seamlessly, without disturbing existing representations. We evaluate EyeLayer across diverse model families (i.e., LLaMA-3.2, Qwen3, and CodeBERT) covering different scales and architectures. EyeLayer consistently outperforms strong fine-tuning baselines across standard metrics, achieving gains of up to 13.17% on BLEU-4. These results demonstrate that human gaze patterns encode complementary attention signals that enhance the semantic focus of LLMs and transfer effectively across diverse models for code summarization.
[AI-93] Learning geometry-dependent lead-field operators for forward ECG modeling
【速读】:该论文旨在解决当前前向心电图(ECG)计算模型中躯干建模的高精度与低数据需求及计算效率难以兼顾的问题。具体而言,现有方法在临床实践中受限于成像协议通常仅聚焦心脏而缺乏完整躯干信息,且基于“导联场(lead-field)”的方法计算成本随电极数量线性增长,限制了其在高密度记录场景中的应用。解决方案的关键在于提出一种形状感知的代理模型(shape-informed surrogate model),通过两个核心组件实现:一是几何编码模块,将解剖形态映射至低维潜在空间;二是几何条件神经代理模型,从空间坐标、电极位置和潜在代码中预测导联场梯度。该方法在保持极低推理开销的同时,实现了对导联场的高精度逼近(躯干内平均角度误差5°,心腔内误差亦显著优于传统伪导联场近似),且无需完整的躯干分割即可完成高保真ECG模拟,从而在数据受限条件下仍可部署。
链接: https://arxiv.org/abs/2602.22367
作者: Arsenii Dokuchaev,Francesca Bonizzoni,Stefano Pagani,Francesco Regazzoni,Simone Pezzuto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Tissues and Organs (q-bio.TO)
备注: 20 pages, 9 figures
Abstract:Modern forward electrocardiogram (ECG) computational models rely on an accurate representation of the torso domain. The lead-field method enables fast ECG simulations while preserving full geometric fidelity. Achieving high anatomical accuracy in torso representation is, however, challenging in clinical practice, as imaging protocols are typically focused on the heart and often do not include the entire torso. In addition, the computational cost of the lead-field method scales linearly with the number of electrodes, limiting its applicability in high-density recording settings. To date, no existing approach simultaneously achieves high anatomical fidelity, low data requirements and computational efficiency. In this work, we propose a shape-informed surrogate model of the lead-field operator that serves as a drop-in replacement for the full-order model in forward ECG simulations. The proposed framework consists of two components: a geometry-encoding module that maps anatomical shapes into a low-dimensional latent space, and a geometry-conditioned neural surrogate that predicts lead-field gradients from spatial coordinates, electrode positions and latent codes. The proposed method achieves high accuracy in approximating lead fields both within the torso (mean angular error 5°) and inside the heart, resulting in highly accurate ECG simulations (relative mean squared error 2.5%. The surrogate consistently outperforms the widely used pseudo lead-field approximation while preserving negligible inference cost. Owing to its compact latent representation, the method does not require a fully detailed torso segmentation and can therefore be deployed in data-limited settings while preserving high-fidelity ECG simulations.
[AI-94] GRAU: Generic Reconfigurable Activation Unit Design for Neural Network Hardware Accelerators
【速读】:该论文旨在解决高精度神经网络在边缘加速器中因传统多阈值激活函数硬件实现导致的硬件开销急剧增长的问题。经典多阈值激活硬件需为n位输出配置2^n个阈值,随着精度提升,存储单元(LUT)消耗显著增加,限制了硬件效率与可扩展性。解决方案的关键在于提出一种基于分段线性拟合的可重构激活硬件GRAU,其通过将分段斜率近似为2的幂次,仅使用基本比较器和1位右移器即可实现,从而大幅降低LUT占用(减少超90%),同时支持混合精度量化及非线性函数(如SiLU),在保持灵活性和可扩展性的前提下显著提升了硬件效率。
链接: https://arxiv.org/abs/2602.22352
作者: Yuhao Liu,Salim Ullah,Akash Kumar
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:With the continuous growth of neural network scales, low-precision quantization is widely used in edge accelerators. Classic multi-threshold activation hardware requires 2^n thresholds for n-bit outputs, causing a rapid increase in hardware cost as precision increases. We propose a reconfigurable activation hardware, GRAU, based on piecewise linear fitting, where the segment slopes are approximated by powers of two. Our design requires only basic comparators and 1-bit right shifters, supporting mixed-precision quantization and nonlinear functions such as SiLU. Compared with multi-threshold activators, GRAU reduces LUT consumption by over 90%, achieving higher hardware efficiency, flexibility, and scalability.
[AI-95] Structure and Redundancy in Large Language Models : A Spectral Study via Random Matrix Theory
【速读】:该论文旨在解决现代深度学习中两个长期存在的问题:模型可靠性(如幻觉和分布外行为)与计算效率(如资源消耗高、能耗大)。其核心解决方案基于谱几何(Spectral Geometry)与随机矩阵理论(Random Matrix Theory, RMT),构建了一个统一框架。关键在于通过分析隐藏层激活的特征值动态,将谱统计特性作为模型行为的紧凑、稳定且可解释的表征工具,从而区分结构化因果表示与噪声主导的变异。具体而言,EigenTrack 利用谱描述符(如熵、方差及偏离 Marchenko-Pastur 基线的程度)结合轻量级循环分类器实现对幻觉和分布外输入的实时检测;RMT-KD 则通过识别激活谱中的异常特征值作为任务相关信号,采用迭代自蒸馏方式将网络投影至低维子空间,显著压缩模型规模并提升能效,同时保持精度和硬件友好的密集结构。
链接: https://arxiv.org/abs/2602.22345
作者: Davide Ettori
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Executive Summary of Master Thesis in Computer Science Engineering, Politecnico di Milano
Abstract:This thesis addresses two persistent and closely related challenges in modern deep learning, reliability and efficiency, through a unified framework grounded in Spectral Geometry and Random Matrix Theory (RMT). As deep networks and large language models continue to scale, their internal behavior becomes increasingly opaque, leading to hallucinations, fragile generalization under distribution shift, and growing computational and energy demands. By analyzing the eigenvalue dynamics of hidden activations across layers and inputs, this work shows that spectral statistics provide a compact, stable, and interpretable lens on model behavior, capable of separating structured, causal representations from noise-dominated variability. Within this framework, the first contribution, EigenTrack, introduces a real-time method for detecting hallucinations and out-of-distribution behavior in large language and vision-language models. EigenTrack transforms streaming activations into spectral descriptors such as entropy, variance, and deviations from the Marchenko-Pastur baseline, and models their temporal evolution using lightweight recurrent classifiers, enabling early detection of reliability failures before they appear in model outputs while offering interpretable insight into representation dynamics. The second contribution, RMT-KD, presents a principled approach to compressing deep networks via random matrix theoretic knowledge distillation. By interpreting outlier eigenvalues in activation spectra as carriers of task-relevant information, RMT-KD progressively projects networks onto lower-dimensional subspaces through iterative self-distillation, yielding significantly more compact and energy-efficient models while preserving accuracy and dense, hardware-friendly structure.
[AI-96] A 1/R Law for Kurtosis Contrast in Balanced Mixtures
【速读】:该论文旨在解决基于峰度的独立成分分析(Kurtosis-based Independent Component Analysis, ICA)在宽且平衡混合信号中性能下降的问题。其核心挑战在于,当混合源的数量 $ R $ 较大时,标准投影的超额峰度 $ \kappa(y) $ 会随有效宽度 $ R_{\mathrm{eff}} $ 衰减至 $ O(\kappa_{\max}/R_{\mathrm{eff}}) ,导致对比度(contrast)显著减弱,从而限制了ICA的分离能力。解决方案的关键在于提出一种“净化”(purification)策略:通过选择少量( m \ll R $)符号一致(sign-consistent)的源信号,可恢复与 $ R $ 无关的对比度 $ \Omega(1/m) $,并提供一个简单易行的数据驱动启发式方法实现该策略。理论分析与合成实验共同验证了峰度衰减规律、样本估计精度的 $ \sqrt{T} $ 截断阈值以及净化后对比度的恢复效果。
链接: https://arxiv.org/abs/2602.22334
作者: Yuda Bi,Wenjun Xiao,Linhao Bai,Vince D Calhoun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Kurtosis-based Independent Component Analysis (ICA) weakens in wide, balanced mixtures. We prove a sharp redundancy law: for a standardized projection with effective width R_\mathrmeff (participation ratio), the population excess kurtosis obeys |\kappa(y)|=O(\kappa_\max/R_\mathrmeff) , yielding the order-tight O(c_b\kappa_\max/R) under balance (typically c_b=O(\log R) ). As an impossibility screen, under standard finite-moment conditions for sample kurtosis estimation, surpassing the O(1/\sqrtT) estimation scale requires R\lesssim \kappa_\max\sqrtT . We also show that \emphpurification – selecting m!\ll!R sign-consistent sources – restores R -independent contrast \Omega(1/m) , with a simple data-driven heuristic. Synthetic experiments validate the predicted decay, the \sqrtT crossover, and contrast recovery.
[AI-97] raining Agents to Self-Report Misbehavior
【速读】:该论文旨在解决前沿人工智能代理(Frontier AI Agents)在未被察觉的情况下追求隐藏目标(hidden goals)并实施隐蔽行为(covert misbehavior)所带来的对齐风险(misalignment risk)。传统对齐训练(alignment training)虽能强化正确目标,但可能失效且引发副作用。其解决方案的关键在于提出“自 incrimination 训练”(self-incrimination training),即训练代理在出现隐蔽违规行为时主动触发一个可见信号(如调用 report_scheming() 工具),从而将隐性威胁显性化。该方法显著降低了在分布外环境中的未被检测到的成功攻击率,优于匹配能力的监控基线和对齐基准,同时保持指令层级结构完整,并对通用能力影响极小。与黑盒监控不同,其性能不依赖外部行为的可疑程度,且在对抗提示优化下仍具鲁棒性,还能泛化至代理自主追求错误目标的场景,为缓解前沿AI的隐蔽风险提供了一种无需假设可完全预防或可靠外部识别违规行为的新路径。
链接: https://arxiv.org/abs/2602.22303
作者: Bruce W. Lee,Chen Yueh-Han,Tomek Korbak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment training aims to prevent such behavior by reinforcing the correct goals, but alignment may not always succeed and can lead to unwanted side effects. We propose self-incrimination training, which instead trains agents to produce a visible signal when they covertly misbehave. We train GPT-4.1 and Gemini-2.0 agents to call a report_scheming() tool when behaving deceptively and measure their ability to cause harm undetected in out-of-distribution environments. Self-incrimination significantly reduces the undetected successful attack rate, outperforming matched-capability monitors and alignment baselines while preserving instruction hierarchy and incurring minimal safety tax on general capabilities. Unlike blackbox monitoring, self-incrimination performance is consistent across tasks regardless of how suspicious the misbehavior appears externally. The trained behavior persists under adversarial prompt optimization and generalizes to settings where agents pursue misaligned goals themselves rather than being instructed to misbehave. Our results suggest self-incrimination offers a viable path for reducing frontier misalignment risk, one that neither assumes misbehavior can be prevented nor that it can be reliably classified from the outside.
[AI-98] AviaSafe: A Physics-Informed Data-Driven Model for Aviation Safety-Critical Cloud Forecasts
【速读】:该论文旨在解决当前人工智能(AI)天气预报模型无法区分对航空安全至关重要的云微物理物种的问题。传统模型仅能预测常规大气变量,而无法分辨冰晶、过冷水滴等关键水成物(hydrometeor species),这限制了其在航空风险评估中的应用。解决方案的关键在于提出AviaSafe——一种分层的、融合物理约束的神经网络预报器,通过引入航空气象学中的结冰条件(Icing Condition, IC)指数作为物理先验约束,识别超冷液态水活跃区域,并采用分步预测架构:首先利用掩码注意力机制捕捉云的空间分布,再在识别区域内量化各水成物物种浓度。该方法有效应对了云预测中极端稀疏性、分布不连续性和物种间复杂微物理相互作用等挑战,在7天预报时效下优于现有数值模式,为航路优化中区分冰/液态水以评估发动机结冰风险提供了新工具。
链接: https://arxiv.org/abs/2602.22298
作者: Zijian Zhu,Qiusheng Huang,Anboyu Guo,Xiaohui Zhong,Hao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Current AI weather forecasting models predict conventional atmospheric variables but cannot distinguish between cloud microphysical species critical for aviation safety. We introduce AviaSafe, a hierarchical, physics-informed neural forecaster that produces global, six-hourly predictions of these four hydrometeor species for lead times up to 7 days. Our approach addresses the unique challenges of cloud prediction: extreme sparsity, discontinuous distributions, and complex microphysical interactions between species. We integrate the Icing Condition (IC) index from aviation meteorology as a physics-based constraint that identifies regions where supercooled water fuels explosive ice crystal growth. The model employs a hierarchical architecture that first predicts cloud spatial distribution through masked attention, then quantifies species concentrations within identified regions. Training on ERA5 reanalysis data, our model achieves lower RMSE for cloud species compared to baseline and outperforms operational numerical models on certain key variables at 7-day lead times. The ability to forecast individual cloud species enables new applications in aviation route optimization where distinguishing between ice and liquid water determines engine icing risk.
[AI-99] Learning Rewards Not Labels: Adversarial Inverse Reinforcement Learning for Machinery Fault Detection AAMAS2026
【速读】:该论文旨在解决现有基于强化学习(Reinforcement Learning, RL)的机械故障检测(Machinery Fault Detection, MFD)方法未能充分利用RL在序列决策中的优势,通常将其简化为上下文Bandits问题,从而限制了对设备运行时序特征的有效建模。其解决方案的关键在于将MFD建模为离线逆强化学习(Offline Inverse Reinforcement Learning)问题,通过从健康运行序列中直接学习奖励函数,避免了人工设计奖励信号和故障标签的需求;具体实现上采用对抗式逆强化学习(Adversarial Inverse Reinforcement Learning),训练一个判别器来区分正常(专家)与策略生成的状态转移,该判别器所学得的奖励作为异常评分,有效量化偏离正常行为的程度,从而实现早期且鲁棒的故障检测。
链接: https://arxiv.org/abs/2602.22297
作者: Dhiraj Neupane,Richard Dazeley,Mohamed Reda Bouadjenek,Sunil Aryal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This article is accepted to be published in AAMAS2026. The doi is listed below but the production is on the way as of now (26/02/2026)
Abstract:Reinforcement learning (RL) offers significant promise for machinery fault detection (MFD). However, most existing RL-based MFD approaches do not fully exploit RL’s sequential decision-making strengths, often treating MFD as a simple guessing game (Contextual Bandits). To bridge this gap, we formulate MFD as an offline inverse reinforcement learning problem, where the agent learns the reward dynamics directly from healthy operational sequences, thereby bypassing the need for manual reward engineering and fault labels. Our framework employs Adversarial Inverse Reinforcement Learning to train a discriminator that distinguishes between normal (expert) and policy-generated transitions. The discriminator’s learned reward serves as an anomaly score, indicating deviations from normal operating behaviour. When evaluated on three run-to-failure benchmark datasets (HUMS2023, IMS, and XJTU-SY), the model consistently assigns low anomaly scores to normal samples and high scores to faulty ones, enabling early and robust fault detection. By aligning RL’s sequential reasoning with MFD’s temporal structure, this work opens a path toward RL-based diagnostics in data-driven industrial settings.
[AI-100] UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLM s
【速读】:该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练大语言模型(Large Language Models, LLMs)时,因优化单次尝试准确率(pass@1)而导致响应多样性下降的问题,进而限制了策略探索并忽视了低频但有效的解题策略。解决方案的关键在于提出UpSkill方法,其核心是将互信息技能学习(Mutual Information Skill Learning, MISL)适配至LLMs,并引入一种token级互信息(Mutual Information, MI)奖励机制,嵌入到Group Relative Policy Optimization(GRPO)框架中,以鼓励轨迹对特定解题路径(z)的特异性。实验证明,该方法在GSM8K基准上显著提升多尝试准确率(pass@k),同时保持pass@1性能不变,且理论与实证均表明pass@k的改进与互信息目标紧密相关。
链接: https://arxiv.org/abs/2602.22296
作者: Devan Shah,Owen Yang,Daniel Yang,Chongyi Zheng,Benjamin Eysenbach
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: First two authors equal contribution. 29 pages total (11 pages main text), 10 figures, 10 tables. Project website: this https URL
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.
[AI-101] Manifold of Failure: Behavioral Attraction Basins in Language Models
【速读】:该论文旨在解决当前AI安全研究中对大型语言模型(Large Language Models, LLMs)不安全区域缺乏系统性理解的问题,即现有方法多聚焦于将对抗样本投影回自然数据流形以恢复安全性,而忽视了对失效区域本身拓扑结构的刻画。其解决方案的关键在于提出一种基于质量多样性(Quality Diversity)的框架——利用MAP-Elites算法系统映射LLMs的“失败流形”(Manifold of Failure),并通过引入Alignment Deviation作为质量度量指标,引导搜索过程识别出与模型预期对齐行为偏离最大的行为吸引盆地(behavioral attraction basins)。该方法不仅实现了高达63%的行为覆盖和发现多达370个独特的漏洞生态位,还揭示了不同模型间显著差异的拓扑特征,从而从寻找离散故障转向理解其结构性根源,这是现有攻击方法(如GCG、PAIR或TAP)无法实现的全局可解释性安全地图。
链接: https://arxiv.org/abs/2602.22291
作者: Sarthak Munshi,Manish Bhatt,Vineeth Sai Narajala,Idan Habler,AmmarnAl-Kahfah,Ken Huang,Blake Gatto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model’s behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model’s safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.
[AI-102] Multi-Level Causal Embeddings
【速读】:该论文旨在解决多源异构因果模型的融合问题,即如何将多个具有不同表示形式的详细因果模型映射到一个更粗粒度的因果模型中,同时保持因果关系的一致性。其解决方案的关键在于提出**因果嵌入(causal embeddings)**这一概念,将其视为抽象(abstraction)的推广,并引入广义一致性(generalized consistency)的概念来确保嵌入过程中的因果结构保真;此外,通过定义多分辨率边缘问题(multi-resolution marginal problem),该框架在统计边缘问题与因果边缘问题之间建立了联系,并展示了其在合并来自不同模型的数据集时的实际应用价值。
链接: https://arxiv.org/abs/2602.22287
作者: Willem Schooltink,Fabio Massimo Zennaro
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Abstractions of causal models allow for the coarsening of models such that relations of cause and effect are preserved. Whereas abstractions focus on the relation between two models, in this paper we study a framework for causal embeddings which enable multiple detailed models to be mapped into sub-systems of a coarser causal model. We define causal embeddings as a generalization of abstraction, and present a generalized notion of consistency. By defining a multi-resolution marginal problem, we showcase the relevance of causal embeddings for both the statistical marginal problem and the causal marginal problem; furthermore, we illustrate its practical use in merging datasets coming from models with different representations.
[AI-103] Early Risk Stratification of Dosing Errors in Clinical Trials Using Machine Learning
【速读】:该论文旨在解决临床试验(Clinical Trials, CTs)在启动前难以预测其高剂量错误发生风险的问题,从而支持早期风险分层与主动的质量管理。解决方案的关键在于构建一个基于机器学习(Machine Learning, ML)的框架,融合结构化数据(如试验设计信息)与非结构化文本数据(如试验方案自由文本),通过XGBoost模型和ClinicalModernBERT模型分别提取特征,并采用简单的晚期融合策略整合两种模态的信息,最终利用后验概率校准技术将模型输出转化为可解释的试验级风险类别,实现对高剂量错误风险的准确识别与分级管理。
链接: https://arxiv.org/abs/2602.22285
作者: Félicien Hêche,Sohrab Ferdowsi,Anthony Yazdani,Sara Sansaloni-Pastor,Douglas Teodoro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Objective: The objective of this study is to develop a machine learning (ML)-based framework for early risk stratification of clinical trials (CTs) according to their likelihood of exhibiting a high rate of dosing errors, using information available prior to trial initiation. Materials and Methods: We constructed a dataset from this http URL comprising 42,112 CTs. Structured, semi-structured trial data, and unstructured protocol-related free-text data were extracted. CTs were assigned binary labels indicating elevated dosing error rate, derived from adverse event reports, MedDRA terminology, and Wilson confidence intervals. We evaluated an XGBoost model trained on structured features, a ClinicalModernBERT model using textual data, and a simple late-fusion model combining both modalities. Post-hoc probability calibration was applied to enable interpretable, trial-level risk stratification. Results: The late-fusion model achieved the highest AUC-ROC (0.862). Beyond discrimination, calibrated outputs enabled robust stratification of CTs into predefined risk categories. The proportion of trials labeled as having an excessively high dosing error rate increased monotonically across higher predicted risk groups and aligned with the corresponding predicted probability ranges. Discussion: These findings indicate that dosing error risk can be anticipated at the trial level using pre-initiation information. Probability calibration was essential for translating model outputs into reliable and interpretable risk categories, while simple multimodal integration yielded performance gains without requiring complex architectures. Conclusion: This study introduces a reproducible and scalable ML framework for early, trial-level risk stratification of CTs at risk of high dosing error rates, supporting proactive, risk-based quality management in clinical research.
[AI-104] Integrating Machine Learning Ensembles and Large Language Models for Heart Disease Prediction Using Voting Fusion
【速读】:该论文旨在解决心血管疾病(Cardiovascular Disease, CVD)早期识别与风险分层中模型性能不足的问题,尤其是在临床决策支持系统中如何提升预测准确性与可靠性。其关键解决方案是提出一种混合融合方法,将高性能的机器学习(Machine Learning, ML)集成模型(如XGBoost、LightGBM等)与大型语言模型(Large Language Models, LLMs)相结合,利用ML在结构化表格数据上的强表征能力与LLMs在零样本/少样本推理中的灵活性,通过OpenRouter API调用Gemini 2.5 Flash实现协同优化,在验证集上达到96.62%准确率和0.97 AUC,显著优于单独使用ML(95.78%准确率,0.96 AUC)或LLM(78.9%准确率)的方案,证明了ML-LLM混合架构在复杂医疗场景下具有增强不确定情境下判别能力的潜力。
链接: https://arxiv.org/abs/2602.22280
作者: Md. Tahsin Amin,Tanim Ahmmod,Zannatul Ferdus,Talukder Naemul Hasan Naem,Ehsanul Ferdous,Arpita Bhattacharjee,Ishmam Ahmed Solaiman,Nahiyan Bin Noor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 8 figures, (Accepted at a peer-reviewed conference)
Abstract:Cardiovascular disease is the primary cause of death globally, necessitating early identification, precise risk classification, and dependable decision-support technologies. The advent of large language models (LLMs) provides new zero-shot and few-shot reasoning capabilities, even though machine learning (ML) algorithms, especially ensemble approaches like Random Forest, XGBoost, LightGBM, and CatBoost, are excellent at modeling complex, non-linear patient data and routinely beat logistic regression. This research predicts cardiovascular disease using a merged dataset of 1,190 patient records, comparing traditional machine learning models (95.78% accuracy, ROC-AUC 0.96) with open-source large language models via OpenRouter APIs. Finally, a hybrid fusion of the ML ensemble and LLM reasoning under Gemini 2.5 Flash achieved the best results (96.62% accuracy, 0.97 AUC), showing that LLMs (78.9 % accuracy) work best when combined with ML models rather than used alone. Results show that ML ensembles achieved the highest performance (95.78% accuracy, ROC-AUC 0.96), while LLMs performed moderately in zero-shot (78.9%) and slightly better in few-shot (72.6%) settings. The proposed hybrid method enhanced the strength in uncertain situations, illustrating that ensemble ML is considered the best structured tabular prediction case, but it can be integrated with hybrid ML-LLM systems to provide a minor increase and open the way to more reliable clinical decision-support tools.
[AI-105] Positional-aware Spatio-Temporal Network for Large-Scale Traffic Prediction
【速读】:该论文旨在解决交通流预测中因地理范围扩大和时间跨度延长导致的节点区分度不足与历史信息全局感知能力弱的问题,同时应对大规模数据带来的模型部署效率瓶颈。解决方案的关键在于提出一种轻量级的位置感知时空网络(Positional-aware Spatio-Temporal Network, PASTN),其核心创新包括:引入位置感知嵌入(positional-aware embeddings)以明确区分每个节点的表征,并设计时序注意力模块(temporal attention module)增强对长程依赖关系的建模能力,从而在端到端框架下高效捕捉时空复杂性。
链接: https://arxiv.org/abs/2602.22274
作者: Runfei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for the 104th Transportation Research Board (TRB) Annual Meeting in 2025
Abstract:Traffic flow forecasting has emerged as an indispensable mission for daily life, which is required to utilize the spatiotemporal relationship between each location within a time period under a graph structure to predict future flow. However, the large travel demand for broader geographical areas and longer time spans requires models to distinguish each node clearly and possess a holistic view of the history, which has been paid less attention to in prior works. Furthermore, increasing sizes of data hinder the deployment of most models in real application environments. To this end, in this paper, we propose a lightweight Positional-aware Spatio-Temporal Network (PASTN) to effectively capture both temporal and spatial complexities in an end-to-end manner. PASTN introduces positional-aware embeddings to separate each node’s representation, while also utilizing a temporal attention module to improve the long-range perception of current models. Extensive experiments verify the effectiveness and efficiency of PASTN across datasets of various scales (county, megalopolis and state). Further analysis demonstrates the efficacy of newly introduced modules either.
[AI-106] FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在金融领域应用中缺乏系统性评估工具的问题,即如何同时衡量LLMs对金融理论知识的掌握程度与处理实际业务场景的能力。其解决方案的关键在于构建FIRE基准测试体系:一方面,通过收集来自权威金融资格考试的多样化试题,评估LLMs对金融理论的理解深度;另一方面,设计了一个结构化的评估矩阵,覆盖金融领域的核心子领域和关键业务活动,并据此构建包含3000个金融场景问题的数据集,其中包含带参考答案的封闭式决策题和基于预设评分标准的开放式问题,从而实现对LLMs在真实金融任务中实用价值的全面测评。
链接: https://arxiv.org/abs/2602.22273
作者: Xiyuan Zhang,Huihang Wu,Jiayu Guo,Zhenlin Zhang,Yiwei Zhang,Liangyu Huo,Xiaoxiao Ma,Jiansong Wan,Xuewei Jiao,Yi Jing,Jian Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce FIRE, a comprehensive benchmark designed to evaluate both the theoretical financial knowledge of LLMs and their ability to handle practical business scenarios. For theoretical assessment, we curate a diverse set of examination questions drawn from widely recognized financial qualification exams, enabling evaluation of LLMs deep understanding and application of financial knowledge. In addition, to assess the practical value of LLMs in real-world financial tasks, we propose a systematic evaluation matrix that categorizes complex financial domains and ensures coverage of essential subdomains and business activities. Based on this evaluation matrix, we collect 3,000 financial scenario questions, consisting of closed-form decision questions with reference answers and open-ended questions evaluated by predefined rubrics. We conduct comprehensive evaluations of state-of-the-art LLMs on the FIRE benchmark, including XuanYuan 4.0, our latest financial-domain model, as a strong in-domain baseline. These results enable a systematic analysis of the capability boundaries of current LLMs in financial applications. We publicly release the benchmark questions and evaluation code to facilitate future research.
[AI-107] Poisoned Acoustics
【速读】:该论文旨在解决训练数据投毒攻击(training-data poisoning attacks)对深度神经网络造成的隐蔽性目标失效问题,即通过极小比例的标签篡改诱导模型在特定类别上发生误分类,同时不引起整体准确率的显著变化。其核心贡献在于揭示了此类攻击的结构性隐蔽机制:当目标类别在数据集中占比极低(如卡车类仅占约3%)时,即使进行完全针对性攻击,最大准确率下降幅度也被限制在少数类比例(β)之内,而这一数值低于训练过程中的噪声水平,从而证明基于整体准确率监控的防御手段在理论上无效,无论模型架构或攻击方法如何。解决方案的关键在于提出一种“最小信任”防御框架,融合内容感知的Artifact哈希、Merkle树数据集承诺机制与后量子数字签名(ML-DSA-65/CRYSTALS-Dilithium3),实现对训练数据来源和完整性的密码学可验证溯源,从源头上保障模型训练数据的可信性。
链接: https://arxiv.org/abs/2602.22258
作者: Harrison Dahme
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 5 Pages
Abstract:Training-data poisoning attacks can induce targeted, undetectable failure in deep neural networks by corrupting a vanishingly small fraction of training labels. We demonstrate this on acoustic vehicle classification using the MELAUDIS urban intersection dataset (approx. 9,600 audio clips, 6 classes): a compact 2-D convolutional neural network (CNN) trained on log-mel spectrograms achieves 95.7% Attack Success Rate (ASR) – the fraction of target-class test samples misclassified under the attack – on a Truck-to-Car label-flipping attack at just p=0.5% corruption (48 records), with zero detectable change in aggregate accuracy (87.6% baseline; 95% CI: 88-100%, n=3 seeds). We prove this stealth is structural: the maximum accuracy drop from a complete targeted attack is bounded above by the minority class fraction (beta). For real-world class imbalances (Truck approx. 3%), this bound falls below training-run noise, making aggregate accuracy monitoring provably insufficient regardless of architecture or attack method. A companion backdoor trigger attack reveals a novel trigger-dominance collapse: when the target class is a dataset minority, the spectrogram patch trigger becomes functionally redundant–clean ASR equals triggered ASR, and the attack degenerates to pure label flipping. We formalize the ML training pipeline as an attack surface and propose a trust-minimized defense combining content-addressed artifact hashing, Merkle-tree dataset commitment, and post-quantum digital signatures (ML-DSA-65/CRYSTALS-Dilithium3, NIST FIPS 204) for cryptographically verifiable data provenance.
[AI-108] Deep Sequence Modeling with Quantum Dynamics: Language as a Wave Function
【速读】:该论文旨在解决序列建模中传统循环神经网络(Recurrent Neural Network, RNN)依赖门控机制难以有效处理竞争性假设的问题。其核心挑战在于如何在保持状态演化可微分性和计算效率的同时,实现对多路径解释的自动选择与增强。解决方案的关键在于引入一个基于量子力学原理的框架:将隐状态建模为有限维希尔伯特空间上的复值波函数(wave function),并通过学习的时间依赖哈密顿量(Hamiltonian)驱动其演化。该机制利用量子干涉效应——通过调控复幅值相位,使冲突解释相互抵消、一致解释增强;同时保证动力学严格幺正(unitary),借助Cayley或Crank–Nicolson离散化方法精确维持状态范数不变。最终通过玻恩规则(Born rule)提取token概率,该非线性测量操作隐式地将N维状态映射至秩一厄米矩阵空间,从而捕获成对相位关联,实现了相较于实值正交模型的二次维度优势(Ω(N2))。
链接: https://arxiv.org/abs/2602.22255
作者: Ahmed Nebli,Hadi Saadatdoorabi,Kevin Yam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:
Abstract:We introduce a sequence modeling framework in which the latent state is a complex-valued wave function evolving on a finite-dimensional Hilbert space under a learned, time-dependent Hamiltonian. Unlike standard recurrent architectures that rely on gating mechanisms to suppress competing hypotheses, our framework utilizes quantum interference: the Hamiltonian steers the phases of complex amplitudes so that conflicting interpretations cancel while compatible ones reinforce. The dynamics are strictly unitary, ensuring that the state norm is preserved exactly at every time step via a Cayley (Crank–Nicolson) discretization. Token probabilities are extracted using the Born rule, a quadratic measurement operator that couples magnitudes and relative phases. Our primary theoretical contribution is a separation theorem characterizing the representational advantage of this readout: we define a family of disambiguation tasks that a complex unitary model of dimension N solves exactly, but which requires a state dimension of \Omega(N^2) for any real-valued orthogonal model equipped with a standard affine-softmax readout. This quadratic gap arises because the Born rule implicitly lifts the N -dimensional state into the space of rank-one Hermitian matrices, accessing pairwise phase correlations that are inaccessible to linear projections. Finally, we derive a continuity equation for the latent probability mass, yielding conserved pairwise currents that serve as a built-in diagnostic for tracing information flow between dimensions.
[AI-109] Causal Direction from Convergence Time: Faster Training in the True Causal Direction
【速读】:该论文旨在解决因果方向识别(causal direction identification)问题,即在两个变量 X 和 Y 之间确定哪个是原因、哪个是结果。传统方法如RESIT、IGCI和SkewScore依赖于统计独立性或分布不对称性来推断因果关系,而本文提出一种基于优化动力学的全新原则——因果计算不对称性(Causal Computational Asymmetry, CCA)。其关键在于:当数据生成机制符合加性噪声模型 Y=f(X)+ε(其中 ε⊥X,且 f 非线性和单射)时,正向预测(X→Y)比反向预测(Y→X)具有更低的不可约损失下界和可分离梯度噪声,导致正向模型在期望意义上以更少的梯度步数收敛至固定损失阈值。因此,通过比较两个方向神经网络训练的收敛速度即可推断因果方向。这一方法在优化时间空间中运行,无需假设变量分布对称性,并要求对两变量进行适当z-score标准化以保证收敛速率比较的有效性。
链接: https://arxiv.org/abs/2602.22254
作者: Abdulrahman Tamim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Causal Computational Asymmetry (CCA), a principle for causal direction identification based on optimization dynamics in which one neural network is trained to predict Y from X and another to predict X from Y , and the direction that converges faster is inferred to be causal. Under the additive noise model Y = f(X) + \varepsilon with \varepsilon \perp X and f nonlinear and injective, we establish a formal asymmetry: in the reverse direction, residuals remain statistically dependent on the input regardless of approximation quality, inducing a strictly higher irreducible loss floor and non-separable gradient noise in the optimization dynamics, so that the reverse model requires strictly more gradient steps in expectation to reach any fixed loss threshold; consequently, the forward (causal) direction converges in fewer expected optimization steps. CCA operates in optimization-time space, distinguishing it from methods such as RESIT, IGCI, and SkewScore that rely on statistical independence or distributional asymmetries, and proper z-scoring of both variables is required for valid comparison of convergence rates. On synthetic benchmarks, CCA achieves 26/30 correct causal identifications across six neural architectures, including 30/30 on sine and exponential data-generating processes. We further embed CCA into a broader framework termed Causal Compression Learning (CCL), which integrates graph structure learning, causal information compression, and policy optimization, with all theoretical guarantees formally proved and empirically validated on synthetic datasets.
[AI-110] Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials
【速读】:该论文旨在解决当前人工智能方法在化学领域中普遍存在的局限性问题,即大多数现有模型仅针对单一领域(分子或材料)和单一任务(生成或预测),导致表示学习难以共享与迁移,从而限制了跨域泛化能力。其解决方案的关键在于提出Zatom-1——首个统一分子与材料三维结构的生成式与预测式基础模型(foundation model),该模型基于Transformer架构,并采用多模态流匹配(multimodal flow matching)目标联合建模离散原子类型与连续3D几何结构,实现了可扩展的预训练机制,且随着模型容量增加表现出可预测的性能提升;同时,通过联合生成预训练作为下游多任务预测(如性质、能量、力)的通用初始化策略,显著提升了预测精度并大幅降低生成推理时间(超过一个数量级),并通过实验证明了从材料预训练中获得的知识能正向迁移至分子属性预测。
链接: https://arxiv.org/abs/2602.22251
作者: Alex Morehead,Miruna Cretu,Antonia Panescu,Rishabh Anand,Maurice Weiler,Tynan Perez,Samuel Blau,Steven Farrell,Wahid Bhimji,Anubhav Jain,Hrushikesh Sahasrabuddhe,Pietro Lio,Tommi Jaakkola,Rafael Gomez-Bombarelli,Rex Ying,N. Benjamin Erichson,Michael W. Mahoney
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
Abstract:General-purpose 3D chemical modeling encompasses molecules and materials, requiring both generative and predictive capabilities. However, most existing AI approaches are optimized for a single domain (molecules or materials) and a single task (generation or prediction), which limits representation sharing and transfer. We introduce Zatom-1, the first foundation model that unifies generative and predictive learning of 3D molecules and materials. Zatom-1 is a Transformer trained with a multimodal flow matching objective that jointly models discrete atom types and continuous 3D geometries. This approach supports scalable pretraining with predictable gains as model capacity increases, while enabling fast and stable sampling. We use joint generative pretraining as a universal initialization for downstream multi-task prediction of properties, energies, and forces. Empirically, Zatom-1 matches or outperforms specialized baselines on both generative and predictive benchmarks, while reducing the generative inference time by more than an order of magnitude. Our experiments demonstrate positive predictive transfer between chemical domains from joint generative pretraining: modeling materials during pretraining improves molecular property prediction accuracy.
[AI-111] Analysis of LLM s Against Prompt Injection and Jailbreak Attacks
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的提示注入(prompt injection)与越狱(jailbreak)攻击风险问题,这些问题可能引发模型行为异常甚至安全泄露。研究通过构建大规模人工标注数据集,在多个开源LLM(如Phi、Mistral、Llama 3.2、Qwen等)上系统评估其脆弱性,发现不同模型对攻击响应存在显著差异,包括拒绝回答或完全沉默等由内部安全机制触发的行为。解决方案的关键在于提出轻量级、推理时运行的防御机制——这些机制作为无须重训练或GPU密集微调的过滤器,虽能缓解简单攻击,但对复杂、推理密集型提示仍易被绕过,揭示了当前防御策略在对抗高级攻击时的局限性。
链接: https://arxiv.org/abs/2602.22242
作者: Piyush Jaiswal,Aaditya Pratap,Shreyansh Saraswati,Harsh Kasyap,Somanath Tripathy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 6 tables
Abstract:Large Language Models (LLMs) are widely deployed in real-world systems. Given their broader applicability, prompt engineering has become an efficient tool for resource-scarce organizations to adopt LLMs for their own purposes. At the same time, LLMs are vulnerable to prompt-based attacks. Thus, analyzing this risk has become a critical security requirement. This work evaluates prompt-injection and jailbreak vulnerability using a large, manually curated dataset across multiple open-source LLMs, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma variants. We observe significant behavioural variation across models, including refusal responses and complete silent non-responsiveness triggered by internal safety mechanisms. Furthermore, we evaluated several lightweight, inference-time defence mechanisms that operate as filters without any retraining or GPU-intensive fine-tuning. Although these defences mitigate straightforward attacks, they are consistently bypassed by long, reasoning-heavy prompts.
[AI-112] From Prompts to Performance: Evaluating LLM s for Task-based Parallel Code Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLM)在生成高效并行程序方面能力不足的问题,尤其关注其在不同输入提示(自然语言描述、串行参考实现和并行伪代码)下生成任务驱动型并行代码的能力。解决方案的关键在于系统性地评估LLM在三种主流并行编程框架(OpenMP Tasking、C++标准并行库以及异步多任务运行时HPX)中的表现,通过定量分析其生成代码的正确性和可扩展性,从而揭示LLM在不同问题复杂度和框架抽象层级下的优势与局限,为未来基于LLM的高性能计算(High-Performance Computing, HPC)辅助开发提供实证依据与改进方向。
链接: https://arxiv.org/abs/2602.22240
作者: Linus Bantel,Moritz Strack,Alexander Strack,Dirk Pflüger
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 12 pages, 4 figures, 2 tables, Workshop on Asynchronous Many-Task Systems and Applications 2026
Abstract:Large Language Models (LLM) show strong abilities in code generation, but their skill in creating efficient parallel programs is less studied. This paper explores how LLMs generate task-based parallel code from three kinds of input prompts: natural language problem descriptions, sequential reference implementations, and parallel pseudo code. We focus on three programming frameworks: OpenMP Tasking, C++ standard parallelism, and the asynchronous many-task runtime HPX. Each framework offers different levels of abstraction and control for task execution. We evaluate LLM-generated solutions for correctness and scalability. Our results reveal both strengths and weaknesses of LLMs with regard to problem complexity and framework. Finally, we discuss what these findings mean for future LLM-assisted development in high-performance and scientific computing.
[AI-113] -SEAL: TTD-Aware Selective Encryption for Adversarially-Robust and Low-Latency Edge AI
【速读】:该论文旨在解决云边端协同场景下,边缘设备在资源受限条件下同时实现模型压缩与安全保护的难题。现有研究多基于密集权重假设进行选择性加密(selective encryption),而未考虑张量训练分解(Tensor-Train Decomposition, TTD)等高效压缩技术对加密策略的影响,导致其在TTD压缩模型上的实用性不明确。解决方案的关键在于提出TT-SEAL框架:通过敏感度驱动的重要性指标对TT核(TT cores)进行排序,设定一次性的鲁棒性阈值,并采用基于值差分隐私(value-DP)的优化器,仅加密最小数量的关键TT核,从而在保持与全参数黑盒加密相当的对抗鲁棒性的同时,显著降低加密参数比例(仅4.89–15.92%)并减少AES解密在端到端延迟中的占比(如ResNet-18从58%降至2.76%),实现了安全、低延迟的边缘AI部署。
链接: https://arxiv.org/abs/2602.22238
作者: Kyeongpil Min,Sangmin Jeon,Jae-Jin Lee,Woojoo Lee
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, 3 tables. This paper has been accepted at Design Automation Conference (DAC) 2026
Abstract:Cloud-edge AI must jointly satisfy model compression and security under tight device budgets. While Tensor-Train Decomposition (TTD) shrinks on-device models, prior selective-encryption studies largely assume dense weights, leaving its practicality under TTD compression unclear. We present TT-SEAL, a selective-encryption framework for TT-decomposed networks. TT-SEAL ranks TT cores with a sensitivity-based importance metric, calibrates a one-time robustness threshold, and uses a value-DP optimizer to encrypt the minimum set of critical cores with AES. Under TTD-aware, transfer-based threat models (and on an FPGA-prototyped edge processor) TT-SEAL matches the robustness of full (black-box) encryption while encrypting as little as 4.89-15.92% of parameters across ResNet-18, MobileNetV2, and VGG-16, and drives the share of AES decryption in end-to-end latency to low single digits (e.g., 58% - 2.76% on ResNet-18), enabling secure, low-latency edge AI.
[AI-114] Optimized Disaster Recovery for Distributed Storag e Systems: Lightweight Metadata Architectures to Overcome Cryptographic Hashing Bottleneck
【速读】:该论文旨在解决分布式存储架构中灾难恢复(Disaster Recovery, DR)流程的性能瓶颈问题,即传统基于内容的加密哈希(cryptographic hashing)在数据识别与同步中的低效性。当发生故障切换(failover)或故障回切(failback)时,由于哈希索引可能过期、不完整或需重建,导致必须进行全量或部分重哈希,显著延长恢复时间,影响恢复时间目标(Recovery Time Objective, RTO)的达成。解决方案的关键在于提出一种以元数据驱动的确定性标识机制:在数据块摄入时即分配全局唯一的复合标识符(composite identifier),该标识独立于内容分析,从而在灾难恢复过程中无需任何加密计算即可实现即时差异计算(delta computation),彻底消除哈希操作带来的延迟,提升DR效率与RTO合规性。
链接: https://arxiv.org/abs/2602.22237
作者: Prasanna Kumar,Nishank Soni,Gaurang Munje
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 8 pages, 7 Tables
Abstract:Distributed storage architectures are foundational to modern cloud-native infrastructure, yet a critical operational bottleneck persists within disaster recovery (DR) workflows: the dependence on content-based cryptographic hashing for data identification and synchronization. While hash-based deduplication is effective for storage efficiency in steady-state operation, it becomes a systemic liability during failover and failback events when hash indexes are stale, incomplete, or must be rebuilt following a crash. This paper precisely characterizes the operational conditions under which full or partial re-hashing becomes unavoidable. The paper also analyzes the downstream impact of cryptographic re-hashing on Recovery Time Objective (RTO) compliance, and proposes a generalized architectural shift toward deterministic, metadata-driven identification. The proposed framework assigns globally unique composite identifiers to data blocks at ingestion time-independent of content analysis enabling instantaneous delta computation during DR without any cryptographic overhead.
[AI-115] o Deceive is to Teach? Forging Perceptual Robustness via Adversarial Reinforcement Learning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对视觉复杂场景时存在的感知脆弱性问题,其根源在于训练数据集规模有限且难以扩展,从而限制了模型鲁棒性的提升。解决方案的关键在于提出一种名为AOT-SFT的规模化对抗数据集和基于自对弈机制的AOT(Adversarial Opponent Training)框架:该框架通过图像编辑攻击者(Attacker)与防御者MLLM(Defender)之间的协同进化,自动生成多样且动态的图像扰动样本作为训练数据,使防御者持续适应并增强对复杂视觉输入的感知能力,从而显著提升模型鲁棒性并减少幻觉现象。
链接: https://arxiv.org/abs/2602.22227
作者: Yicheng Bao,Xuhong Wang,Xin Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) exhibit perceptual fragility when confronted with visually complex scenes. This weakness stems from a reliance on finite training datasets, which are prohibitively expensive to scale and impose a ceiling on model robustness. We introduce \textbfAOT-SFT, a large-scale adversarial dataset for bootstrapping MLLM robustness. Building on this, we propose \textbfAOT (Adversarial Opponent Training), a self-play framework that forges MLLM robustness by creating its own training data. Our method orchestrates a co-evolution between an image-editing Attacker and a Defender MLLM, where the Attacker generates a diverse and dynamic curriculum of image manipulations, forcing the Defender to adapt and improve. Extensive experiments demonstrate that AOT enhances the Defender’s perceptual robustness and reduces hallucinations, establishing a scalable paradigm for training more reliable MLLMs.
[AI-116] Accelerated Online Risk-Averse Policy Evaluation in POMDPs with Theoretical Guarantees and Novel CVaR Bounds
【速读】:该论文致力于解决在部分可观测环境中,风险规避型决策问题的计算复杂性难题,具体针对带有条件风险价值(Conditional Value-at-Risk, CVaR)敏感性的部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)中,CVaR值函数评估效率低下的问题。其解决方案的关键在于构建一套理论框架,通过引入辅助随机变量 $ Y $ 对目标随机变量 $ X $ 的CVaR进行上下界估计,基于分布函数和密度函数之间的关系推导出可解释的集中不等式,并进一步将这些边界应用于简化后的信念-马尔可夫决策过程(belief-MDP),从而在保证与原POMDP一致性的前提下,实现动作消除(action elimination)策略,有效加速CVaR值函数的计算。该方法在多个POMDP任务中验证了其既能可靠区分安全与高风险策略,又能显著提升计算效率。
链接: https://arxiv.org/abs/2602.23073
作者: Yaacov Pariente,Vadim Indelman
机构: 未知
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI)
备注:
Abstract:Risk-averse decision-making under uncertainty in partially observable domains is a central challenge in artificial intelligence and is essential for developing reliable autonomous agents. The formal framework for such problems is the partially observable Markov decision process (POMDP), where risk sensitivity is introduced through a risk measure applied to the value function, with Conditional Value-at-Risk (CVaR) being a particularly significant criterion. However, solving POMDPs is computationally intractable in general, and approximate methods rely on computationally expensive simulations of future agent trajectories. This work introduces a theoretical framework for accelerating CVaR value function evaluation in POMDPs with formal performance guarantees. We derive new bounds on the CVaR of a random variable X using an auxiliary random variable Y, under assumptions relating their cumulative distribution and density functions; these bounds yield interpretable concentration inequalities and converge as the distributional discrepancy vanishes. Building on this, we establish upper and lower bounds on the CVaR value function computable from a simplified belief-MDP, accommodating general simplifications of the transition dynamics. We develop estimators for these bounds within a particle-belief MDP framework with probabilistic guarantees, and employ them for acceleration via action elimination: actions whose bounds indicate suboptimality under the simplified model are safely discarded while ensuring consistency with the original POMDP. Empirical evaluation across multiple POMDP domains confirms that the bounds reliably separate safe from dangerous policies while achieving substantial computational speedups under the simplified model.
[AI-117] Scattering Transform for Auditory Attention Decoding
【速读】:该论文旨在解决新一代助听器仍待攻克的“鸡尾酒会问题”(cocktail party problem),即在嘈杂环境中区分和聚焦目标语音信号的难题。其解决方案的关键在于采用小波散射变换(scattering transform)作为替代传统预处理方法的新特征提取方式,通过两层小波散射变换从脑电图(EEG)信号中提取更具判别力的时频结构信息,从而提升基于脑电的听觉注意力解码性能。实验表明,该方法在KU Leuven数据集上显著改善了个体相关条件下的分类准确率,尤其在训练数据有限时优势明显,说明小波散射变换能有效挖掘额外的相关特征信息。
链接: https://arxiv.org/abs/2602.23003
作者: René Pallenberg,Fabrice Katzberg,Alfred Mertins,Marco Maass
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: This work has been submitted to the IEEE for possible publication
Abstract:The use of hearing aids will increase in the coming years due to demographic change. One open problem that remains to be solved by a new generation of hearing aids is the cocktail party problem. A possible solution is electroencephalography-based auditory attention decoding. This has been the subject of several studies in recent years, which have in common that they use the same preprocessing methods in most cases. In this work, in order to achieve an advantage, the use of a scattering transform is proposed as an alternative to these preprocessing methods. The two-layer scattering transform is compared with a regular filterbank, the synchrosqueezing short-time Fourier transform and the common preprocessing. To demonstrate the performance, the known and the proposed preprocessing methods are compared for different classification tasks on two widely used datasets, provided by the KU Leuven (KUL) and the Technical University of Denmark (DTU). Both established and new neural-network-based models, CNNs, LSTMs, and recent Transformer/graph-based models are used for classification. Various evaluation strategies were compared, with a focus on the task of classifying speakers who are unknown from the training. We show that the two-layer scattering transform can significantly improve the performance for subject-related conditions, especially on the KUL dataset. However, on the DTU dataset, this only applies to some of the models, or when larger amounts of training data are provided, as in 10-fold cross-validation. This suggests that the scattering transform is capable of extracting additional relevant information.
[AI-118] Discovery of Interpretable Physical Laws in Materials via Language-Model-Guided Symbolic Regression
【速读】:该论文旨在解决从高维数据中发现可解释物理规律的难题,尤其针对传统符号回归(Symbolic Regression)方法在庞大可能形式空间中易产生复杂且不具物理意义公式的局限性。其解决方案的关键在于引入一种利用大语言模型(Large Language Models, LLMs)嵌入的科学知识来引导搜索过程的新框架,从而显著缩小有效搜索空间(约减少10⁵倍),实现对材料关键性质(如体模量、带隙和析氧反应活性)的高效、准确且具物理意义的公式发现。
链接: https://arxiv.org/abs/2602.22967
作者: Yifeng Guan,Chuyi Liu,Dongzhan Zhou,Lei Bai,Wan-jian Yin,Jingyuan Li,Mao Su
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Discovering interpretable physical laws from high-dimensional data is a fundamental challenge in scientific research. Traditional methods, such as symbolic regression, often produce complex, unphysical formulas when searching a vast space of possible forms. We introduce a framework that guides the search process by leveraging the embedded scientific knowledge of large language models, enabling efficient identification of physical laws in the data. We validate our approach by modeling key properties of perovskite materials. Our method mitigates the combinatorial explosion commonly encountered in traditional symbolic regression, reducing the effective search space by a factor of approximately 10^5 . A set of novel formulas for bulk modulus, band gap, and oxygen evolution reaction activity are identified, which not only provide meaningful physical insights but also outperform previous formulas in accuracy and simplicity.
[AI-119] Learning Tangent Bundles and Characteristic Classes with Autoencoder Atlases
【速读】:该论文试图解决多图自编码器(multi-chart autoencoders)在流形学习中如何从拓扑角度理解其结构表示的问题,特别是如何将局部训练的编码器-解码器对与流形的整体微分几何性质建立联系。解决方案的关键在于将一组局部训练的自编码器视为数据流形上的一个“学习到的图册”(learned atlas),并证明任何重建一致的自编码器图册自然定义满足上链条件(cocycle condition)的过渡映射;线性化这些过渡映射后可构造出一个向量丛,当潜在空间维度等于流形内在维度时,该向量丛即为切丛(tangent bundle)。这一构造使得可以直接计算微分拓扑不变量(如第一Stiefel-Whitney类),从而提供检测流形定向性的算法准则,并揭示非平凡特征类对单图表示的阻碍作用,以及最小自编码器图册数量由流形的良好覆盖结构决定。
链接: https://arxiv.org/abs/2602.22873
作者: Eduardo Paluzo-Hidalgo,Yuichi Ike
机构: 未知
类目: Algebraic Topology (math.AT); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注:
Abstract:We introduce a theoretical framework that connects multi-chart autoencoders in manifold learning with the classical theory of vector bundles and characteristic classes. Rather than viewing autoencoders as producing a single global Euclidean embedding, we treat a collection of locally trained encoder-decoder pairs as a learned atlas on a manifold. We show that any reconstruction-consistent autoencoder atlas canonically defines transition maps satisfying the cocycle condition, and that linearising these transition maps yields a vector bundle coinciding with the tangent bundle when the latent dimension matches the intrinsic dimension of the manifold. This construction provides direct access to differential-topological invariants of the data. In particular, we show that the first Stiefel-Whitney class can be computed from the signs of the Jacobians of learned transition maps, yielding an algorithmic criterion for detecting orientability. We also show that non-trivial characteristic classes provide obstructions to single-chart representations, and that the minimum number of autoencoder charts is determined by the good cover structure of the manifold. Finally, we apply our methodology to low-dimensional orientable and non-orientable manifolds, as well as to a non-orientable high-dimensional image dataset.
[AI-120] From Shallow Bayesian Neural Networks to Gaussian Processes: General Convergence Identifiability and Scalable Inference
【速读】:该论文旨在解决浅层贝叶斯神经网络(Bayesian Neural Networks, BNNs)在大规模应用中的可扩展性与统计建模稳定性问题,特别是通过其与高斯过程(Gaussian Processes, GPs)的渐近等价性来实现更高效、可识别的推断。解决方案的关键在于:首先,放宽了先前理论中对BNN到GP收敛性的假设条件,建立了更通用的收敛结果;其次,提出了一种由四种常见激活函数诱导的核函数的凸混合形式,确保其正定性及在不同输入设计下的严格与实际可识别性;最后,基于Nyström近似构建了可扩展的最大后验(MAP)训练与预测算法,通过控制近似秩和锚点选择实现计算成本与精度之间的权衡,从而在真实计算资源下获得稳定超参数估计和优异的预测性能。
链接: https://arxiv.org/abs/2602.22492
作者: Gracielle Antunes de Araújo,Flávio B. Gonçalves
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 4 figures, 8 tables. Supplementary material included
Abstract:In this work, we study scaling limits of shallow Bayesian neural networks (BNNs) via their connection to Gaussian processes (GPs), with an emphasis on statistical modeling, identifiability, and scalable inference. We first establish a general convergence result from BNNs to GPs by relaxing assumptions used in prior formulations, and we compare alternative parameterizations of the limiting GP model. Building on this theory, we propose a new covariance function defined as a convex mixture of components induced by four widely used activation functions, and we characterize key properties including positive definiteness and both strict and practical identifiability under different input designs. For computation, we develop a scalable maximum a posterior (MAP) training and prediction procedure using a Nyström approximation, and we show how the Nyström rank and anchor selection control the cost-accuracy trade-off. Experiments on controlled simulations and real-world tabular datasets demonstrate stable hyperparameter estimates and competitive predictive performance at realistic computational cost.
[AI-121] Learning to reconstruct from saturated data: audio declipping and high-dynamic range imaging
【速读】:该论文旨在解决在缺乏真实参考(ground truth)的情况下,如何通过学习方法有效恢复被截断(clipped)测量信号的问题,尤其针对非线性逆问题。传统自监督学习方法多局限于线性逆问题,而本文将自监督学习扩展至非线性场景,如从饱和信号中重建音频和图像。其解决方案的关键在于假设信号分布对幅度变化具有近似不变性(approximate invariance to amplitude changes),并据此推导出足够的学习条件以及一个仅依赖 clipped 测量即可训练重建网络的自监督损失函数。实验表明,该方法在性能上接近完全监督方法,证明了其有效性与实用性。
链接: https://arxiv.org/abs/2602.22279
作者: Victor Sechaud,Laurent Jacques,Patrice Abry,Julián Tachella
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Learning based methods are now ubiquitous for solving inverse problems, but their deployment in real-world applications is often hindered by the lack of ground truth references for training. Recent self-supervised learning strategies offer a promising alternative, avoiding the need for ground truth. However, most existing methods are limited to linear inverse problems. This work extends self-supervised learning to the non-linear problem of recovering audio and images from clipped measurements, by assuming that the signal distribution is approximately invariant to changes in amplitude. We provide sufficient conditions for learning to reconstruct from saturated signals alone and a self-supervised loss that can be used to train reconstruction networks. Experiments on both audio and image data show that the proposed approach is almost as effective as fully supervised approaches, despite relying solely on clipped measurements for training.
[AI-122] CryoNet.Refine: A One-step Diffusion Model for Rapid Refinement of Structural Models with Cryo-EM Density Map Restraints ICLR2026
【速读】:该论文旨在解决冷冻电镜(cryo-EM)高分辨率结构解析中分子结构精修的计算效率低、依赖人工调参的问题,这些问题已成为研究者的瓶颈。解决方案的关键在于提出一个端到端的深度学习框架——利用单步扩散模型(one-step diffusion model),结合密度感知损失函数(density-aware loss function)与稳健的几何化学约束(robust stereochemical restraints),实现对实验密度图的快速优化,从而在无需人工干预的情况下显著提升模型与地图的相关性及整体几何质量。
链接: https://arxiv.org/abs/2602.22263
作者: Fuyao Huang,Xiaozhu Yu,Kui Xu,Qiangfeng Cliff Zhang
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注: Published as a conference paper at ICLR 2026
Abstract:High-resolution structure determination by cryo-electron microscopy (cryo-EM) requires the accurate fitting of an atomic model into an experimental density map. Traditional refinement pipelines such as Phenix.real_space_refine and Rosetta are computationally expensive, demand extensive manual tuning, and present a significant bottleneck for researchers. We present this http URL, an end-to-end deep learning framework that automates and accelerates molecular structure refinement. Our approach utilizes a one-step diffusion model that integrates a density-aware loss function with robust stereochemical restraints, enabling rapid optimization of a structure against experimental data. this http URL provides a unified and versatile solution capable of refining protein complexes as well as DNA/RNA-protein complexes. In benchmarks against Phenix.real_space_refine, this http URL consistently achieves substantial improvements in both model-map correlation and overall geometric quality metrics. By offering a scalable, automated, and powerful alternative, this http URL aims to serve as an essential tool for next-generation cryo-EM structure refinement. Web server: this https URL Source code: this https URL.
[AI-123] Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations
【速读】:该论文旨在解决单细胞基础模型(如scGPT)中高维基因表示所蕴含的生物学知识不明确的问题。其核心解决方案是通过63轮自动化假设筛选(共测试183个假设),系统性地解码scGPT内部表示的几何结构,发现模型并非将基因编码在黑箱特征空间中,而是构建了一个具有生物学意义的坐标系:主导谱轴按亚细胞定位分离基因(分泌蛋白与胞质蛋白分居两极),中间Transformer层按细胞分泌通路顺序编码线粒体和内质网相关基因;正交轴则以梯度精度刻画蛋白质互作网络(Spearman相关系数ρ=1.000,p=0.017);在六维谱子空间中可区分转录因子与其靶基因(AUROC=0.744),且早期层保留具体调控关系、深层压缩为调控类别区分;此外,残差流几何结构补充注意力模式,揭示了模型对细胞类型标记基因(AUROC=0.851)和B细胞主调控因子(BATF/BACH2向PAX5收敛)的精确建模能力。这一成果表明生物Transformer学习到了可解释的细胞组织内部模型,为调控网络推断、药物靶点优先排序及模型审计提供了新范式。
链接: https://arxiv.org/abs/2602.22247
作者: Ihor Kendiukhov
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear. We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening (183 hypotheses tested), revealing that the model organizes genes into a structured biological coordinate system rather than an opaque feature space. The dominant spectral axis separates genes by subcellular localization, with secreted proteins at one pole and cytosolic proteins at the other. Intermediate transformer layers transiently encode mitochondrial and ER compartments in a sequence that mirrors the cellular secretory pathway. Orthogonal axes encode protein-protein interaction networks with graded fidelity to experimentally measured interaction strength (Spearman rho = 1.000 across n = 5 STRING confidence quintiles, p = 0.017). In a compact six-dimensional spectral subspace, the model distinguishes transcription factors from their target genes (AUROC = 0.744, all 12 layers significant). Early layers preserve which specific genes regulate which targets, while deeper layers compress this into a coarser regulator versus regulated distinction. Repression edges are geometrically more prominent than activation edges, and B-cell master regulators BATF and BACH2 show convergence toward the B-cell identity anchor PAX5 across transformer depth. Cell-type marker genes cluster with high fidelity (AUROC = 0.851). Residual-stream geometry encodes biological structure complementary to attention patterns. These results indicate that biological transformers learn an interpretable internal model of cellular organization, with implications for regulatory network inference, drug target prioritization, and model auditing. Subjects: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.22247 [q-bio.GN] (or arXiv:2602.22247v1 [q-bio.GN] for this version) https://doi.org/10.48550/arXiv.2602.22247 Focus to learn more arXiv-issued DOI via DataCite
[AI-124] Unsupervised Denoising of Diffusion-Weighted Images with Bias and Variance Corrected Noise Modeling
【速读】:该论文旨在解决扩散磁共振成像(diffusion magnetic resonance imaging, dMRI)在高扩散加权条件下因信噪比(signal-to-noise ratio, SNR)低而导致图像质量下降、下游分析可靠性降低的问题。现有自监督和无监督去噪方法未充分考虑dMRI幅度数据中常见的非高斯Rician噪声特性,易引入系统性偏差和异方差性噪声波动,尤其在低SNR环境下更为显著。解决方案的关键在于提出两种基于Rician统计特性的噪声校正训练目标:一是基于一阶矩的损失函数以消除均值偏差,二是基于二阶矩的损失函数以修正平方信号偏差;二者均引入自适应权重机制以应对方差异质性,且无需修改网络结构即可集成至图像特定的无监督深度图像先验(Deep Image Prior, DIP)框架中。实验表明,该方法能有效抑制Rician偏差并稳定噪声波动,显著提升图像质量和扩散指标的可靠性。
链接: https://arxiv.org/abs/2602.22235
作者: Jine Xie,Zhicheng Zhang,Yunwei Chen,Yanqiu Feng,Xinyuan Zhang
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Diffusion magnetic resonance imaging (dMRI) plays a vital role in both clinical diagnostics and neuroscience research. However, its inherently low signal-to-noise ratio (SNR), especially under high diffusion weighting, significantly degrades image quality and impairs downstream analysis. Recent self-supervised and unsupervised denoising methods offer a practical solution by enhancing image quality without requiring clean references. However, most of these methods do not explicitly account for the non-Gaussian noise characteristics commonly present in dMRI magnitude data during the supervised learning process, potentially leading to systematic bias and heteroscedastic variance, particularly under low-SNR conditions. To overcome this limitation, we introduce noise-corrected training objectives that explicitly model Rician statistics. Specifically, we propose two alternative loss functions: one derived from the first-order moment to remove mean bias, and another from the second-order moment to correct squared-signal bias. Both losses include adaptive weighting to account for variance heterogeneity and can be used without changing the network architecture. These objectives are instantiated in an image-specific, unsupervised Deep Image Prior (DIP) framework. Comprehensive experiments on simulated and in-vivo dMRI show that the proposed losses effectively reduce Rician bias and suppress noise fluctuations, yielding higher image quality and more reliable diffusion metrics than state-of-the-art denoising baselines. These results underscore the importance of bias- and variance-aware noise modeling for robust dMRI analysis under low-SNR conditions.
[AI-125] FM-RME: Foundation Model Empowered Radio Map Estimation
【速读】:该论文旨在解决传统无线电地图估计(Radio Map Estimation, RME)方法在复杂频谱环境中难以捕捉多维动态特性的问题,以及现有数据驱动方法因忽略无线传播物理先验知识而导致的数据效率低下问题。其解决方案的关键在于提出一种新的基础模型FM-RME,通过自监督预训练实现零样本泛化能力,从而支持多维RME(包括空间、时间和频谱维度)。该模型的核心创新在于融合两个关键组件:一是几何感知特征提取模块,将传播对称性(如平移和旋转不变性)作为归纳偏置嵌入网络;二是基于注意力机制的神经网络,用于学习跨时空频域的长程相关性。此外,设计了一种掩码自监督的多维预训练策略,以学习跨多样化无线环境的通用频谱表示,使模型在预训练后无需针对特定场景重新训练即可实现零样本推理。
链接: https://arxiv.org/abs/2602.22231
作者: Dong Yang,Yue Wang,Songyang Zhang,Yingshu Li,Zhipeng Cai,Zhi Tian
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 5 figures, conference
Abstract:Traditional radio map estimation (RME) techniques fail to capture multi-dimensional and dynamic characteristics of complex spectrum environments. Recent data-driven methods achieve accurate RME in spatial domain, but ignore physical prior knowledge of radio propagation, limiting data efficiency especially in multi-dimensional scenarios. To overcome such limitations, we propose a new foundation model, characterized by self-supervised pre-training on diverse data for zero-shot generalization, enabling multi-dimensional radio map estimation (FM-RME). Specifically, FM-RME builds an effective synergy of two core components: a geometry-aware feature extraction module that encodes physical propagation symmetries, i.e., translation and rotation invariance, as inductive bias, and an attention-based neural network that learns long-range correlations across the spatial-temporal-spectral domains. A masked self-supervised multi-dimensional pre-training strategy is further developed to learn generalizable spectrum representations across diverse wireless environments. Once pre-trained, FM-RME supports zero-shot inference for multi-dimensional RME, including spatial, temporal, and spectral estimation, without scenario-specific retraining. Simulation results verify that FM-RME exhibits desired learning performance across diverse datasets and zero-shot generalization capabilities beyond existing RME methods.
[AI-126] Survey on Neural Routing Solvers
【速读】:该论文旨在解决当前神经路由求解器(Neural Routing Solvers, NRSs)在车辆路径问题(Vehicle Routing Problem, VRP)中存在研究 gaps 的问题,特别是现有方法在泛化能力评估上的不足。其解决方案的关键在于:首先,从启发式(heuristic)视角系统梳理了NRSs的分类体系,提出基于启发式原理的分层分类法;其次,设计了一个以泛化能力为导向的评估流程,用以替代传统依赖特定数据集的评估方式,从而更客观地衡量NRSs的实际适用性与鲁棒性。
链接: https://arxiv.org/abs/2602.21761
作者: Yunpeng Ba,Xi Lin,Changliang Zhou,Ruihao Zheng,Zhenkun Wang,Xinyan Liang,Zhichao Lu,Jianyong Sun,Yuhua Qian,Qingfu Zhang
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Neural routing solvers (NRSs) that leverage deep learning to tackle vehicle routing problems have demonstrated notable potential for practical applications. By learning implicit heuristic rules from data, NRSs replace the handcrafted counterparts in classic heuristic frameworks, thereby reducing reliance on costly manual design and trial-and-error adjustments. This survey makes two main contributions: (1) The heuristic nature of NRSs is highlighted, and existing NRSs are reviewed from the perspective of heuristics. A hierarchical taxonomy based on heuristic principles is further introduced. (2) A generalization-focused evaluation pipeline is proposed to address limitations of the conventional pipeline. Comparative benchmarking of representative NRSs across both pipelines uncovers a series of previously unreported gaps in current research.
机器学习
[LG-0] Mean Estimation from Coarse Data: Characterizations and Efficient Algorithms ICLR’26
链接: https://arxiv.org/abs/2602.23341
作者: Alkis Kalavasis,Anay Mehrotra,Manolis Zampetakis,Felix Zhou,Ziyu Zhu
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: Abstract truncated to arXiv limits. To appear in ICLR’26
Abstract:Coarse data arise when learners observe only partial information about samples; namely, a set containing the sample rather than its exact value. This occurs naturally through measurement rounding, sensor limitations, and lag in economic systems. We study Gaussian mean estimation from coarse data, where each true sample x is drawn from a d -dimensional Gaussian distribution with identity covariance, but is revealed only through the set of a partition containing x . When the coarse samples, roughly speaking, have ``low’’ information, the mean cannot be uniquely recovered from observed samples (i.e., the problem is not identifiable). Recent work by Fotakis, Kalavasis, Kontonis, and Tzamos [FKKT21] established that sample-efficient mean estimation is possible when the unknown mean is identifiable and the partition consists of only convex sets. Moreover, they showed that without convexity, mean estimation becomes NP-hard. However, two fundamental questions remained open: (1) When is the mean identifiable under convex partitions? (2) Is computationally efficient estimation possible under identifiability and convex partitions? This work resolves both questions. […]
[LG-1] Differentiable Zero-One Loss via Hypersimplex Projections PAKDD2026
链接: https://arxiv.org/abs/2602.23336
作者: Camilo Gomez,Pengyang Wang,Liansheng Tang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To appear in PAKDD 2026 (Pacific-Asia Conference on Knowledge Discovery and Data Mining), 12 pages
Abstract:Recent advances in machine learning have emphasized the integration of structured optimization components into end-to-end differentiable models, enabling richer inductive biases and tighter alignment with task-specific objectives. In this work, we introduce a novel differentiable approximation to the zero-one loss-long considered the gold standard for classification performance, yet incompatible with gradient-based optimization due to its non-differentiability. Our method constructs a smooth, order-preserving projection onto the n,k-dimensional hypersimplex through a constrained optimization framework, leading to a new operator we term Soft-Binary-Argmax. After deriving its mathematical properties, we show how its Jacobian can be efficiently computed and integrated into binary and multiclass learning systems. Empirically, our approach achieves significant improvements in generalization under large-batch training by imposing geometric consistency constraints on the output logits, thereby narrowing the performance gap traditionally observed in large-batch training.
[LG-2] A Proper Scoring Rule for Virtual Staining
链接: https://arxiv.org/abs/2602.23305
作者: Samuel Tonks,Steve Hood,Ryan Musso,Ceridwen Hopely,Steve Titus,Minh Doan,Iain Styles,Alexander Krull
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative virtual staining (VS) models for high-throughput screening (HTS) can provide an estimated posterior distribution of possible biological feature values for each input and cell. However, when evaluating a VS model, the true posterior is unavailable. Existing evaluation protocols only check the accuracy of the marginal distribution over the dataset rather than the predicted posteriors. We introduce information gain (IG) as a cell-wise evaluation framework that enables direct assessment of predicted posteriors. IG is a strictly proper scoring rule and comes with a sound theoretical motivation allowing for interpretability, and for comparing results across models and features. We evaluate diffusion- and GAN-based models on an extensive HTS dataset using IG and other metrics and show that IG can reveal substantial performance differences other metrics cannot.
[LG-3] Inferential Mechanics Part 1: Causal Mechanistic Theories of Machine Learning in Chemical Biology with Implications
链接: https://arxiv.org/abs/2602.23303
作者: Ilya Balabin,Thomas M. Kaiser
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning techniques are now routinely encountered in research laboratories across the globe. Impressive progress has been made through ML and AI techniques with regards to large data set processing. This progress has increased the ability of the experimenter to digest data and make novel predictions regarding phenomena of interest. However, machine learning predictors generated from data sets taken from the natural sciences are often treated as black boxes which are used broadly and generally without detailed consideration of the causal structure of the data set of interest. Work has been attempted to bring causality into discussions of machine learning models of natural phenomena; however, a firm and unified theoretical treatment is lacking. This series of three papers explores the union of chemical theory, biological theory, probability theory and causality that will correct current causal flaws of machine learning in the natural sciences. This paper, Part 1 of the series, provides the formal framework of the foundational causal structure of phenomena in chemical biology and is extended to machine learning through the novel concept of focus, defined here as the ability of a machine learning algorithm to narrow down to a hidden underpinning mechanism in large data sets. Initial proof of these principles on a family of Akt inhibitors is also provided. The second paper containing Part 2 will provide a formal exploration of chemical similarity, and Part 3 will present extensive experimental evidence of how hidden causal structures weaken all machine learning in chemical biology. This series serves to establish for chemical biology a new kind of mathematical framework for modeling mechanisms in Nature without the need for the tools of reductionism: inferential mechanics.
[LG-4] Physics Informed Viscous Value Representations
链接: https://arxiv.org/abs/2602.23280
作者: Hrishikesh Viswanath,Juanwu Lu,S. Talha Bukhari,Damon Conover,Ziran Wang,Aniket Bera
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Offline goal-conditioned reinforcement learning (GCRL) learns goal-conditioned policies from static pre-collected datasets. However, accurate value estimation remains a challenge due to the limited coverage of the state-action space. Recent physics-informed approaches have sought to address this by imposing physical and geometric constraints on the value function through regularization defined over first-order partial differential equations (PDEs), such as the Eikonal equation. However, these formulations can often be ill-posed in complex, high-dimensional environments. In this work, we propose a physics-informed regularization derived from the viscosity solution of the Hamilton-Jacobi-Bellman (HJB) equation. By providing a physics-based inductive bias, our approach grounds the learning process in optimal control theory, explicitly regularizing and bounding updates during value iterations. Furthermore, we leverage the Feynman-Kac theorem to recast the PDE solution as an expectation, enabling a tractable Monte Carlo estimation of the objective that avoids numerical instability in higher-order gradients. Experiments demonstrate that our method improves geometric consistency, making it broadly applicable to navigation and high-dimensional, complex manipulation tasks. Open-source codes are available at this https URL.
[LG-5] Zeroth-Order Stackelberg Control in Combinatorial Congestion Games
链接: https://arxiv.org/abs/2602.23277
作者: Saeed Masiha,Sepehr Elahi,Negar Kiyavash,Patrick Thiran
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:We study Stackelberg (leader–follower) tuning of network parameters (tolls, capacities, incentives) in combinatorial congestion games, where selfish users choose discrete routes (or other combinatorial strategies) and settle at a congestion equilibrium. The leader minimizes a system-level objective (e.g., total travel time) evaluated at equilibrium, but this objective is typically nonsmooth because the set of used strategies can change abruptly. We propose ZO-Stackelberg, which couples a projection-free Frank–Wolfe equilibrium solver with a zeroth-order outer update, avoiding differentiation through equilibria. We prove convergence to generalized Goldstein stationary points of the true equilibrium objective, with explicit dependence on the equilibrium approximation error, and analyze subsampled oracles: if an exact minimizer is sampled with probability \kappa_m , then the Frank–Wolfe error decays as \mathcalO(1/(\kappa_m T)) . We also propose stratified sampling as a practical way to avoid a vanishing \kappa_m when the strategies that matter most for the Wardrop equilibrium concentrate in a few dominant combinatorial classes (e.g., short paths). Experiments on real-world networks demonstrate that our method achieves orders-of-magnitude speedups over a differentiation-based baseline while converging to follower equilibria.
[LG-6] akeuchis Information Criteria as Generalization Measures for DNNs Close to NTK Regime
链接: https://arxiv.org/abs/2602.23219
作者: Hiroki Naganuma,Taiji Suzuki,Rio Yokota,Masahiro Nomura,Kohta Ishikawa,Ikuro Sato
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generalization measures have been studied extensively in the machine learning community to better characterize generalization gaps. However, establishing a reliable generalization measure for statistically singular models such as deep neural networks (DNNs) is difficult due to their complex nature. This study focuses on Takeuchi’s information criterion (TIC) to investigate the conditions under which this classical measure can effectively explain the generalization gaps of DNNs. Importantly, the developed theory indicates the applicability of TIC near the neural tangent kernel (NTK) regime. In a series of experiments, we trained more than 5,000 DNN models with 12 architectures, including large models (e.g., VGG-16), on four datasets, and estimated the corresponding TIC values to examine the relationship between the generalization gap and the TIC estimates. We applied several TIC approximation methods with feasible computational costs and assessed the accuracy trade-off. Our experimental results indicate that the estimated TIC values correlate well with the generalization gap under conditions close to the NTK regime. However, we show both theoretically and empirically that outside the NTK regime such correlation disappears. Finally, we demonstrate that TIC provides better trial pruning ability than existing methods for hyperparameter optimization.
[LG-7] Me What To Learn: Generalizing Neural Memory to be Controllable in Natural Language
链接: https://arxiv.org/abs/2602.23201
作者: Max S. Bennett,Thomas P. Zollo,Richard Zemel
类目: Machine Learning (cs.LG)
*备注: 58 Pages, 16 Figures, Code at this https URL
Abstract:Modern machine learning models are deployed in diverse, non-stationary environments where they must continually adapt to new tasks and evolving knowledge. Continual fine-tuning and in-context learning are costly and brittle, whereas neural memory methods promise lightweight updates with minimal forgetting. However, existing neural memory models typically assume a single fixed objective and homogeneous information streams, leaving users with no control over what the model remembers or ignores over time. To address this challenge, we propose a generalized neural memory system that performs flexible updates based on learning instructions specified in natural language. Our approach enables adaptive agents to learn selectively from heterogeneous information sources, supporting settings, such as healthcare and customer service, where fixed-objective memory updates are insufficient.
[LG-8] Efficient Real-Time Adaptation of ROMs for Unsteady Flows Using Data Assimilation
链接: https://arxiv.org/abs/2602.23188
作者: Ismaël Zighed,Andrea Nóvoa,Luca Magri,Taraneh Sayadi
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:We propose an efficient retraining strategy for a parameterized Reduced Order Model (ROM) that attains accuracy comparable to full retraining while requiring only a fraction of the computational time and relying solely on sparse observations of the full system. The architecture employs an encode-process-decode structure: a Variational Autoencoder (VAE) to perform dimensionality reduction, and a transformer network to evolve the latent states and model the dynamics. The ROM is parameterized by an external control variable, the Reynolds number in the Navier-Stokes setting, with the transformer exploiting attention mechanisms to capture both temporal dependencies and parameter effects. The probabilistic VAE enables stochastic sampling of trajectory ensembles, providing predictive means and uncertainty quantification through the first two moments. After initial training on a limited set of dynamical regimes, the model is adapted to out-of-sample parameter regions using only sparse data. Its probabilistic formulation naturally supports ensemble generation, which we employ within an ensemble Kalman filtering framework to assimilate data and reconstruct full-state trajectories from minimal observations. We further show that, for the dynamical system considered, the dominant source of error in out-of-sample forecasts stems from distortions of the latent manifold rather than changes in the latent dynamics. Consequently, retraining can be limited to the autoencoder, allowing for a lightweight, computationally efficient, real-time adaptation procedure with very sparse fine-tuning data.
[LG-9] Closing the gap on tabular data with Fourier and Implicit Categorical Features
链接: https://arxiv.org/abs/2602.23182
作者: Marius Dragoi,Florin Gogianu,Elena Burceanu
类目: Machine Learning (cs.LG)
*备注:
Abstract:While Deep Learning has demonstrated impressive results in applications on various data types, it continues to lag behind tree-based methods when applied to tabular data, often referred to as the last “unconquered castle” for neural networks. We hypothesize that a significant advantage of tree-based methods lies in their intrinsic capability to model and exploit non-linear interactions induced by features with categorical characteristics. In contrast, neural-based methods exhibit biases toward uniform numerical processing of features and smooth solutions, making it challenging for them to effectively leverage such patterns. We address this performance gap by using statistical-based feature processing techniques to identify features that are strongly correlated with the target once discretized. We further mitigate the bias of deep models for overly-smooth solutions, a bias that does not align with the inherent properties of the data, using Learned Fourier. We show that our proposed feature preprocessing significantly boosts the performance of deep learning models and enables them to achieve a performance that closely matches or surpasses XGBoost on a comprehensive tabular data benchmark.
[LG-10] Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models
链接: https://arxiv.org/abs/2602.23179
作者: Gal Kesten-Pomeranz,Yaniv Nikankin,Anja Reusch,Tomer Tsaban,Ora Schueler-Furman,Yonatan Belinkov
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked-token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.
[LG-11] SettleFL: Trustless and Scalable Reward Settlement Protocol for Federated Learning on Permissionless Blockchains (Extended version)
链接: https://arxiv.org/abs/2602.23167
作者: Shuang Liang(1),Yang Hua(2),Linshan Jiang(3),Peishen Yan(1),Tao Song(1),Bin Yao(1),Haibing Guan(1) ((1) Shanghai Jiao Tong University, (2) Queen’s University Belfast, (3) National University of Singapore)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:In open Federated Learning (FL) environments where no central authority exists, ensuring collaboration fairness relies on decentralized reward settlement, yet the prohibitive cost of permissionless blockchains directly clashes with the high-frequency, iterative nature of model training. Existing solutions either compromise decentralization or suffer from scalability bottlenecks due to linear on-chain costs. To address this, we present SettleFL, a trustless and scalable reward settlement protocol designed to minimize total economic friction by offering a family of two interoperable protocols. Leveraging a shared domain-specific circuit architecture, SettleFL offers two interoperable strategies: (1) a Commit-and-Challenge variant that minimizes on-chain costs via optimistic execution and dispute-driven arbitration, and (2) a Commit-with-Proof variant that guarantees instant finality through per-round validity proofs. This design allows the protocol to flexibly adapt to varying latency and cost constraints while enforcing rational robustness without trusted coordination. We conduct extensive experiments combining real FL workloads and controlled simulations. Results show that SettleFL remains practical when scaling to 800 participants, achieving substantially lower gas cost.
[LG-12] MetaOthello: A Controlled Study of Multiple World Models in Transformers
链接: https://arxiv.org/abs/2602.23164
作者: Aviral Chawla,Galen Hall,Juniper Lovato
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models must handle multiple generative processes, yet mechanistic interpretability largely studies capabilities in isolation; it remains unclear how a single transformer organizes multiple, potentially conflicting “world models”. Previous experiments on Othello playing neural-networks test world-model learning but focus on a single game with a single set of rules. We introduce MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, and train small GPTs on mixed-variant data to study how multiple world models are organized in a shared representation space. We find that transformers trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants. Linear probes trained on one variant can intervene on another’s internal state with effectiveness approaching that of matched probes. For isomorphic games with token remapping, representations are equivalent up to a single orthogonal rotation that generalizes across layers. When rules partially overlap, early layers maintain game-agnostic representations while a middle layer identifies game identity, and later layers specialize. MetaOthello offers a path toward understanding not just whether transformers learn world models, but how they organize many at once.
[LG-13] Benchmarking Temporal Web3 Intelligence: Lessons from the FinSurvival 2025 Challenge
链接: https://arxiv.org/abs/2602.23159
作者: Oshani Seneviratne,Fernando Spadea,Adrien Pavao,Aaron Micah Green,Kristin P. Bennett
类目: Machine Learning (cs.LG)
*备注:
Abstract:Temporal Web analytics increasingly relies on large-scale, longitudinal data to understand how users, content, and systems evolve over time. A rapidly growing frontier is the \emphTemporal Web3: decentralized platforms whose behavior is recorded as immutable, time-stamped event streams. Despite the richness of this data, the field lacks shared, reproducible benchmarks that capture real-world temporal dynamics, specifically censoring and non-stationarity, across extended horizons. This absence slows methodological progress and limits the transfer of techniques between Web3 and broader Web domains. In this paper, we present the \textitFinSurvival Challenge 2025 as a case study in benchmarking \emphtemporal Web3 intelligence. Using 21.8 million transaction records from the Aave v3 protocol, the challenge operationalized 16 survival prediction tasks to model user behavior this http URL detail the benchmark design and the winning solutions, highlighting how domain-aware temporal feature construction significantly outperformed generic modeling approaches. Furthermore, we distill lessons for next-generation temporal benchmarks, arguing that Web3 systems provide a high-fidelity sandbox for studying temporal challenges, such as churn, risk, and evolution that are fundamental to the wider Web.
[LG-14] Prediction of Diffusion Coefficients in Mixtures with Tensor Completion
链接: https://arxiv.org/abs/2602.23142
作者: Zeno Romero,Kerstin Münnemann,Hans Hasse,Fabian Jirasek
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predicting diffusion coefficients in mixtures is crucial for many applications, as experimental data remain scarce, and machine learning (ML) offers promising alternatives to established semi-empirical models. Among ML models, matrix completion methods (MCMs) have proven effective in predicting thermophysical properties, including diffusion coefficients in binary mixtures. However, MCMs are restricted to single-temperature predictions, and their accuracy depends strongly on the availability of high-quality experimental data for each temperature of interest. In this work, we address this challenge by presenting a hybrid tensor completion method (TCM) for predicting temperature-dependent diffusion coefficients at infinite dilution in binary mixtures. The TCM employs a Tucker decomposition and is jointly trained on experimental data for diffusion coefficients at infinite dilution in binary systems at 298 K, 313 K, and 333 K. Predictions from the semi-empirical SEGWE model serve as prior knowledge within a Bayesian training framework. The TCM then extrapolates linearly to any temperature between 268 K and 378 K, achieving markedly improved prediction accuracy compared to established models across all studied temperatures. To further enhance predictive performance, the experimental database was expanded using active learning (AL) strategies for targeted acquisition of new diffusion data by pulsed-field gradient (PFG) NMR measurements. Diffusion coefficients at infinite dilution in 19 solute + solvent systems were measured at 298 K, 313 K, and 333 K. Incorporating these results yields a substantial improvement in the TCM’s predictive accuracy. These findings highlight the potential of combining data-efficient ML methods with adaptive experimentation to advance predictive modeling of transport properties.
[LG-15] Bound to Disagree : Generalization Bounds via Certifiable Surrogates
链接: https://arxiv.org/abs/2602.23128
作者: Mathieu Bazinet,Valentina Zantedeschi,Pascal Germain
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generalization bounds for deep learning models are typically vacuous, not computable or restricted to specific model classes. In this paper, we tackle these issues by providing new disagreement-based certificates for the gap between the true risk of any two predictors. We then bound the true risk of the predictor of interest via a surrogate model that enjoys tight generalization guarantees, and evaluating our disagreement bound on an unlabeled dataset. We empirically demonstrate the tightness of the obtained certificates and showcase the versatility of the approach by training surrogate models leveraging three different frameworks: sample compression, model compression and PAC-Bayes theory. Importantly, such guarantees are achieved without modifying the target model, nor adapting the training procedure to the generalization framework.
[LG-16] Regularized Online RLHF with Generalized Bilinear Preferences
链接: https://arxiv.org/abs/2602.23116
作者: Junghyun Lee,Minju Hong,Kwang-Sung Jun,Chulhee Yun,Se-Young Yun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43 pages, 1 table
Abstract:We consider the problem of contextual online RLHF with general preferences, where the goal is to identify the Nash Equilibrium. We adopt the Generalized Bilinear Preference Model (GBPM) to capture potentially intransitive preferences via low-rank, skew-symmetric matrices. We investigate general preference learning with any strongly convex regularizer (where \eta^-1 is the regularization strength), generalizing beyond prior works limited to reverse KL-regularization. Central to our analysis is proving that the dual gap of the greedy policy is bounded by the square of the estimation error - a result derived solely from strong convexity and the skew-symmetricity of this http URL on this insight and a feature diversity assumption, we establish two regret bounds via two simple algorithms: (1) Greedy Sampling achieves polylogarithmic, e^O(\eta) -free regret \tildeO(\eta d^4 (\log T)^2) . (2) Explore-Then-Commit achieves \mathrmpoly(d) -free regret \tildeO(\sqrt\eta r T) by exploiting the low-rank structure; this is the first statistically efficient guarantee for online RLHF in high-dimensions.
[LG-17] Learning Physical Operators using Neural Operators
链接: https://arxiv.org/abs/2602.23113
作者: Vignesh Gopakumar,Ander Gray,Dan Giles,Lorenzo Zanisi,Matt J. Kusner,Timo Betcke,Stanislas Pamela,Marc Peter Deisenroth
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural operators have emerged as promising surrogate models for solving partial differential equations (PDEs), but struggle to generalise beyond training distributions and are often constrained to a fixed temporal discretisation. This work introduces a physics-informed training framework that addresses these limitations by decomposing PDEs using operator splitting methods, training separate neural operators to learn individual non-linear physical operators while approximating linear operators with fixed finite-difference convolutions. This modular mixture-of-experts architecture enables generalisation to novel physical regimes by explicitly encoding the underlying operator structure. We formulate the modelling task as a neural ordinary differential equation (ODE) where these learned operators constitute the right-hand side, enabling continuous-in-time predictions through standard ODE solvers and implicitly enforcing PDE constraints. Demonstrated on incompressible and compressible Navier-Stokes equations, our approach achieves better convergence and superior performance when generalising to unseen physics. The method remains parameter-efficient, enabling temporal extrapolation beyond training horizons, and provides interpretable components whose behaviour can be verified against known physics.
[LG-18] PRAC: Principal-Random Subspace for LLM Activation Compression and Memory-Efficient Training
链接: https://arxiv.org/abs/2602.23111
作者: Yanyi Li,Yimu Zhang,Cong Fang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Activations have become the primary memory bottleneck in large-batch LLM training. However, existing compression methods fail to exploit the spectral structure of activations, resulting in slow convergence or limited compression. To address this, we bridge the relationship between the algorithm’s fast convergence and the requirements for subspace projection, and show that an effective compression should yield an unbiased estimate of the original activation with low variance. We propose Principal-Random Subspace for LLM Activation Compression (PRAC), which novelly decomposes activations into two components: a principal subspace captured via SVD to retain dominant information, and a random subspace sampled from the orthogonal complement to approximate the tail. By introducing a precise scaling factor, we prove that PRAC yields an unbiased gradient estimator with minimum variance under certain conditions. Extensive experiments on pre-training and fine-tuning tasks demonstrate that PRAC achieves up to 36% total memory reduction with negligible performance degradation and minimal computational cost.
[LG-19] Physics-informed neural particle flow for the Bayesian update step
链接: https://arxiv.org/abs/2602.23089
作者: Domonkos Csuzdi,Tamás Bécsi,Olivér Törő
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Bayesian update step poses significant computational challenges in high-dimensional nonlinear estimation. While log-homotopy particle flow filters offer an alternative to stochastic sampling, existing formulations usually yield stiff differential equations. Conversely, existing deep learning approximations typically treat the update as a black-box task or rely on asymptotic relaxation, neglecting the exact geometric structure of the finite-horizon probability transport. In this work, we propose a physics-informed neural particle flow, which is an amortized inference framework. To construct the flow, we couple the log-homotopy trajectory of the prior to posterior density function with the continuity equation describing the density evolution. This derivation yields a governing partial differential equation (PDE), referred to as the master PDE. By embedding this PDE as a physical constraint into the loss function, we train a neural network to approximate the transport velocity field. This approach enables purely unsupervised training, eliminating the need for ground-truth posterior samples. We demonstrate that the neural parameterization acts as an implicit regularizer, mitigating the numerical stiffness inherent to analytic flows and reducing online computational complexity. Experimental validation on multimodal benchmarks and a challenging nonlinear scenario confirms better mode coverage and robustness compared to state-of-the-art baselines.
[LG-20] RhythmBERT: A Self-Supervised Language Model Based on Latent Representations of ECG Waveforms for Heart Disease Detection
链接: https://arxiv.org/abs/2602.23060
作者: Xin Wang,Burcu Ozek,Aruna Mohan,Amirhossein Ravari,Or Zilbershot,Fatemeh Afghah
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electrocardiogram (ECG) analysis is crucial for diagnosing heart disease, but most self-supervised learning methods treat ECG as a generic time series, overlooking physiologic semantics and rhythm-level structure. Existing contrastive methods utilize augmentations that distort morphology, whereas generative approaches employ fixed-window segmentation, which misaligns cardiac cycles. To address these limitations, we propose RhythmBERT, a generative ECG language model that considers ECG as a language paradigm by encoding P, QRS, and T segments into symbolic tokens via autoencoder-based latent representations. These discrete tokens capture rhythm semantics, while complementary continuous embeddings retain fine-grained morphology, enabling a unified view of waveform structure and rhythm. RhythmBERT is pretrained on approximately 800,000 unlabeled ECG recordings with a masked prediction objective, allowing it to learn contextual representations in a label-efficient manner. Evaluations show that despite using only a single lead, RhythmBERT achieves comparable or superior performance to strong 12-lead baselines. This generalization extends from prevalent conditions such as atrial fibrillation to clinically challenging cases such as subtle ST-T abnormalities and myocardial infarction. Our results suggest that considering ECG as structured language offers a scalable and physiologically aligned pathway for advancing cardiac analysis.
[LG-21] Latent Matters: Learning Deep State-Space Models NEURIPS2021
链接: https://arxiv.org/abs/2602.23050
作者: Alexej Klushyn,Richard Kurle,Maximilian Soelch,Botond Cseke,Patrick van der Smagt
类目: Machine Learning (cs.LG)
*备注: Published at NeurIPS 2021
Abstract:Deep state-space models (DSSMs) enable temporal predictions by learning the underlying dynamics of observed sequence data. They are often trained by maximising the evidence lower bound. However, as we show, this does not ensure the model actually learns the underlying dynamics. We therefore propose a constrained optimisation framework as a general approach for training DSSMs. Building upon this, we introduce the extended Kalman VAE (EKVAE), which combines amortised variational inference with classic Bayesian filtering/smoothing to model dynamics more accurately than RNN-based DSSMs. Our results show that the constrained optimisation framework significantly improves system identification and prediction accuracy on the example of established state-of-the-art DSSMs. The EKVAE outperforms previous models w.r.t. prediction accuracy, achieves remarkable results in identifying dynamical systems, and can furthermore successfully learn state-space representations where static and dynamic features are disentangled.
[LG-22] Learning Disease-Sensitive Latent Interaction Graphs From Noisy Cardiac Flow Measurements
链接: https://arxiv.org/abs/2602.23035
作者: Viraj Patel,Marko Grujic,Philipp Aigner,Theodor Abart,Marcus Granegger,Deblina Bhattacharjee,Katharine Fraser
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cardiac blood flow patterns contain rich information about disease severity and clinical interventions, yet current imaging and computational methods fail to capture underlying relational structures of coherent flow features. We propose a physics-informed, latent relational framework to model cardiac vortices as interacting nodes in a graph. Our model combines a neural relational inference architecture with physics-inspired interaction energy and birth-death dynamics, yielding a latent graph sensitive to disease severity and intervention level. We first apply this to computational fluid dynamics simulations of aortic coarctation. Learned latent graphs reveal that as the aortic radius narrows, vortex interactions become stronger and more frequent. This leads to a higher graph entropy, correlating monotonically with coarctation severity ( R^2=0.78 , Spearman |\rho|=0.96 ). We then extend this method to ultrasound datasets of left ventricles under varying levels of left ventricular assist device support. Again the latent graph representation captures the weakening of coherent vortical structures, thereby demonstrating cross-modal generalisation. Results show latent interaction graphs and entropy serve as robust and interpretable markers of cardiac disease and intervention.
[LG-23] Scaling Laws of Global Weather Models
链接: https://arxiv.org/abs/2602.22962
作者: Yuejiang Yu,Langwen Huang,Alexandru Calotoiu,Torsten Hoefler
类目: Machine Learning (cs.LG)
*备注: 17 pages, 7 figures
Abstract:Data-driven models are revolutionizing weather forecasting. To optimize training efficiency and model performance, this paper analyzes empirical scaling laws within this domain. We investigate the relationship between model performance (validation loss) and three key factors: model size ( N ), dataset size ( D ), and compute budget ( C ). Across a range of models, we find that Aurora exhibits the strongest data-scaling behavior: increasing the training dataset by 10x reduces validation loss by up to 3.2x. GraphCast demonstrates the highest parameter efficiency, yet suffers from limited hardware utilization. Our compute-optimal analysis indicates that, under fixed compute budgets, allocating resources to longer training durations yields greater performance gains than increasing model size. Furthermore, we analyze model shape and uncover scaling behaviors that differ fundamentally from those observed in language models: weather forecasting models consistently favor increased width over depth. These findings suggest that future weather models should prioritize wider architectures and larger effective training datasets to maximize predictive performance.
[LG-24] MSINO: Curvature-Aware Sobolev Optimization for Manifold Neural Networks
链接: https://arxiv.org/abs/2602.22937
作者: Suresan Pareth
类目: Machine Learning (cs.LG)
*备注: 32 pages, 6 figures. Submitted for journal consideration
Abstract:We introduce Manifold Sobolev Informed Neural Optimization (MSINO), a curvature aware training framework for neural networks defined on Riemannian manifolds. The method replaces standard Euclidean derivative supervision with a covariant Sobolev loss that aligns gradients using parallel transport and improves stability via a Laplace Beltrami smoothness regularization term. Building on classical results in Riemannian optimization and Sobolev theory on manifolds, we derive geometry dependent constants that yield (i) a Descent Lemma with a manifold Sobolev smoothness constant, (ii) a Sobolev Polyak Lojasiewicz inequality giving linear convergence guarantees for Riemannian gradient descent and stochastic gradient descent under explicit step size bounds, and (iii) a two step Newton Sobolev method with local quadratic contraction in curvature controlled neighborhoods. Unlike prior Sobolev training in Euclidean space, MSINO provides training time guarantees that explicitly track curvature and transported Jacobians. Applications include surface imaging, physics informed learning settings, and robotics on Lie groups such as SO(3) and SE(3). The framework unifies value and gradient based learning with curvature aware convergence guarantees for neural training on manifolds. Comments: 32 pages, 6 figures. Submitted for journal consideration Subjects: Machine Learning (cs.LG) MSC classes: 49M07, 53C21, 46E35 Cite as: arXiv:2602.22937 [cs.LG] (or arXiv:2602.22937v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.22937 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-25] Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks
链接: https://arxiv.org/abs/2602.22936
作者: Wenquan Ma,Yang Sui,Jiaye Teng,Bohan Wang,Jing Xu,Jingqin Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Algorithmic stability is among the most potent techniques in generalization analysis. However, its derivation usually requires a stepsize \eta_t = \mathcalO(1/t) under non-convex training regimes, where t denotes iterations. This rigid decay of the stepsize potentially impedes optimization and may not align with practical scenarios. In this paper, we derive the generalization bounds under the homogeneous neural network regimes, proving that this regime enables slower stepsize decay of order \Omega(1/\sqrtt) under mild assumptions. We further extend the theoretical results from several aspects, e.g., non-Lipschitz regimes. This finding is broadly applicable, as homogeneous neural networks encompass fully-connected and convolutional neural networks with ReLU and LeakyReLU activations.
[LG-26] A Data-Driven Approach to Support Clinical Renal Replacement Therapy
链接: https://arxiv.org/abs/2602.22902
作者: Alice Balboni,Luis Escobar,Andrea Manno,Fabrizio Rossi,Maria Cristina Ruffa,Gianluca Villa,Giordano D’Aloisio,Antonio Consolo
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study investigates a data-driven machine learning approach to predict membrane fouling in critically ill patients undergoing Continuous Renal Replacement Therapy (CRRT). Using time-series data from an ICU, 16 clinically selected features were identified to train predictive models. To ensure interpretability and enable reliable counterfactual analysis, the researchers adopted a tabular data approach rather than modeling temporal dependencies directly. Given the imbalance between fouling and non-fouling cases, the ADASYN oversampling technique was applied to improve minority class representation. Random Forest, XGBoost, and LightGBM models were tested, achieving balanced performance with 77.6% sensitivity and 96.3% specificity at a 10% rebalancing rate. Results remained robust across different forecasting horizons. Notably, the tabular approach outperformed LSTM recurrent neural networks, suggesting that explicit temporal modeling was not necessary for strong predictive performance. Feature selection further reduced the model to five key variables, improving simplicity and interpretability with minimal loss of accuracy. A Shapley value-based counterfactual analysis was applied to the best-performing model, successfully identifying minimal input changes capable of reversing fouling predictions. Overall, the findings support the viability of interpretable machine learning models for predicting membrane fouling during CRRT. The integration of prediction and counterfactual analysis offers practical clinical value, potentially guiding therapeutic adjustments to reduce fouling risk and improve patient management.
[LG-27] Fair feature attribution for multi-output prediction: a Shapley-based perspective
链接: https://arxiv.org/abs/2602.22882
作者: Umberto Biccari,Alain Ibáñez de Opakua,José María Mato,Óscar Millet,Roberto Morales,Enrique Zuazua
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this article, we provide an axiomatic characterization of feature attribution for multi-output predictors within the Shapley framework. While SHAP explanations are routinely computed independently for each output coordinate, the theoretical necessity of this practice has remained unclear. By extending the classical Shapley axioms to vector-valued cooperative games, we establish a rigidity theorem showing that any attribution rule satisfying efficiency, symmetry, dummy player, and additivity must necessarily decompose component-wise across outputs. Consequently, any joint-output attribution rule must relax at least one of the classical Shapley axioms. This result identifies a previously unformalized structural constraint in Shapley-based interpretability, clarifying the precise scope of fairness-consistent explanations in multi-output learning. Numerical experiments on a biomedical benchmark illustrate that multi-output models can yield computational savings in training and deployment, while producing SHAP explanations that remain fully consistent with the component-wise structure imposed by the Shapley axioms.
[LG-28] Hypernetwork-based approach for grid-independent functional data clustering
链接: https://arxiv.org/abs/2602.22823
作者: Anirudh Thatipelli,Ali Siahkoohi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Functional data clustering is concerned with grouping functions that share similar structure, yet most existing methods implicitly operate on sampled grids, causing cluster assignments to depend on resolution, sampling density, or preprocessing choices rather than on the underlying functions themselves. To address this limitation, we introduce a framework that maps discretized function observations – at arbitrary resolution and on arbitrary grids – into a fixed-dimensional vector space via an auto-encoding architecture. The encoder is a hypernetwork that maps coordinate-value pairs to the weight space of an implicit neural representation (INR), which serves as the decoder. Because INRs represent functions with very few parameters, this design yields compact representations that are decoupled from the sampling grid, while the hypernetwork amortizes weight prediction across the dataset. Clustering is then performed in this weight space using standard algorithms, making the approach agnostic to both the discretization and the choice of clustering method. By means of synthetic and real-world experiments in high-dimensional settings, we demonstrate competitive clustering performance that is robust to changes in sampling resolution – including generalization to resolutions not seen during training.
[LG-29] Accelerating Local LLM s on Resource-Constrained Edge Devices via Distributed Prompt Caching
链接: https://arxiv.org/abs/2602.22812
作者: Hiroki Matsutani,Naoki Matsuda,Naoto Sugiura
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Since local LLM inference on resource-constrained edge devices imposes a severe performance bottleneck, this paper proposes distributed prompt caching to enhance inference performance by cooperatively sharing intermediate processing states across multiple low-end edge devices. To fully utilize prompt similarity, our distributed caching mechanism also supports partial matching. As this approach introduces communication overhead associated with state sharing over a wireless network, we introduce a Bloom-filter-based data structure, referred to as a catalog, to determine whether a remote server possesses the desired internal states, thereby suppressing unnecessary communication. Experiments using the Gemma-3 270M model and the MMLU dataset on the Raspberry Pi Zero 2W platform demonstrate that the proposed approach reduces TTFT (Time to First Token) and TTLT (Time to Last Token) by 93.12% and 50.07% on average, respectively.
[LG-30] Multi-agent imitation learning with function approximation: Linear Markov games and beyond
链接: https://arxiv.org/abs/2602.22810
作者: Luca Viano,Till Freihaut,Emanuele Nevali,Volkan Cevher,Matthieu Geist,Giorgia Ramponi
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we present the first theoretical analysis of multi-agent imitation learning (MAIL) in linear Markov games where both the transition dynamics and each agent’s reward function are linear in some given features. We demonstrate that by leveraging this structure, it is possible to replace the state-action level “all policy deviation concentrability coefficient” (Freihaut et al., arXiv:2510.09325) with a concentrability coefficient defined at the feature level which can be much smaller than the state-action analog when the features are informative about states’ similarity. Furthermore, to circumvent the need for any concentrability coefficient, we turn to the interactive setting. We provide the first, computationally efficient, interactive MAIL algorithm for linear Markov games and show that its sample complexity depends only on the dimension of the feature map d . Building on these theoretical findings, we propose a deep MAIL interactive algorithm which clearly outperforms BC on games such as Tic-Tac-Toe and Connect4.
[LG-31] Doubly Adaptive Channel and Spatial Attention for Semantic Image Communication by IoT Devices
链接: https://arxiv.org/abs/2602.22794
作者: Soroosh Miri,Sepehr Abolhasani,Shahrokh Farahmand,S. Mohammad Razavizadeh
类目: Machine Learning (cs.LG)
*备注: 6 pages, 7 figures, conference
Abstract:Internet of Things (IoT) networks face significant challenges such as limited communication bandwidth, constrained computational and energy resources, and highly dynamic wireless channel conditions. Utilization of deep neural networks (DNNs) combined with semantic communication has emerged as a promising paradigm to address these limitations. Deep joint source-channel coding (DJSCC) has recently been proposed to enable semantic communication of images. Building upon the original DJSCC formulation, low-complexity attention-style architectures has been added to the DNNs for further performance enhancement. As a main hurdle, training these DNNs separately for various signal-to-noise ratios (SNRs) will amount to excessive storage or communication overhead, which can not be maintained by small IoT devices. SNR Adaptive DJSCC (ADJSCC), has been proposed to train the DNNs once but feed the current SNR as part of the data to the channel-wise attention mechanism. We improve upon ADJSCC by a simultaneous utilization of doubly adaptive channel-wise and spatial attention modules at both transmitter and receiver. These modules dynamically adjust to varying channel conditions and spatial feature importance, enabling robust and efficient feature extraction and semantic information recovery. Simulation results corroborate that our proposed doubly adaptive DJSCC (DA-DJSCC) significantly improves upon ADJSCC in several performance criteria, while incurring a mild increase in complexity. These facts render DA-DJSCC a desirable choice for semantic communication in performance demanding but low-complexity IoT networks.
[LG-32] KMLP: A Scalable Hybrid Architecture for Web-Scale Tabular Data Modeling
链接: https://arxiv.org/abs/2602.22777
作者: Mingming Zhang,Pengfei Shi,Zhiqing Xiao,Feng Zhao,Guandong Sun,Yulin Kang,Ruizhe Gao,Ningtao Wang,Xing Fu,Weiqiang Wang,Junbo Zhao
类目: Machine Learning (cs.LG)
*备注: Accepted by THE ACM WEB CONFERENCE 2026
Abstract:Predictive modeling on web-scale tabular data with billions of instances and hundreds of heterogeneous numerical features faces significant scalability challenges. These features exhibit anisotropy, heavy-tailed distributions, and non-stationarity, creating bottlenecks for models like Gradient Boosting Decision Trees and requiring laborious manual feature engineering. We introduce KMLP, a hybrid deep architecture integrating a shallow Kolmogorov-Arnold Network (KAN) front-end with a Gated Multilayer Perceptron (gMLP) backbone. The KAN front-end uses learnable activation functions to automatically model complex non-linear transformations for each feature, while the gMLP backbone captures high-order interactions. Experiments on public benchmarks and an industrial dataset with billions of samples show KMLP achieves state-of-the-art performance, with advantages over baselines like GBDTs increasing at larger scales, validating KMLP as a scalable deep learning paradigm for large-scale web tabular data.
[LG-33] Set-based v.s. Distribution-based Representations of Epistemic Uncertainty: A Comparative Study
链接: https://arxiv.org/abs/2602.22747
作者: Kaizheng Wang,Yunjia Wang,Fabio Cuzzolin,David Moens,Hans Hallez,Siu Lun Chau
类目: Machine Learning (cs.LG)
*备注: 29 pages
Abstract:Epistemic uncertainty in neural networks is commonly modeled using two second-order paradigms: distribution-based representations, which rely on posterior parameter distributions, and set-based representations based on credal sets (convex sets of probability distributions). These frameworks are often regarded as fundamentally non-comparable due to differing semantics, assumptions, and evaluation practices, leaving their relative merits unclear. Empirical comparisons are further confounded by variations in the underlying predictive models. To clarify this issue, we present a controlled comparative study enabling principled, like-for-like evaluation of the two paradigms. Both representations are constructed from the same finite collection of predictive distributions generated by a shared neural network, isolating representational effects from predictive accuracy. Our study evaluates each representation through the lens of 3 uncertainty measures across 8 benchmarks, including selective prediction and out-of-distribution detection, spanning 6 underlying predictive models and 10 independent runs per configuration. Our results show that meaningful comparison between these seemingly non-comparable frameworks is both feasible and informative, providing insights into how second-order representation choices impact practical uncertainty-aware performance.
[LG-34] Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
链接: https://arxiv.org/abs/2602.22719
作者: Vamshi Sunku Mohan,Kaustubh Gupta,Aneesha Das,Chandan Singh
类目: Machine Learning (cs.LG)
*备注:
Abstract:State-space models (SSMs) have emerged as an efficient strategy for building powerful language models, avoiding the quadratic complexity of computing attention in transformers. Despite their promise, the interpretability and steerability of modern SSMs remain relatively underexplored. We take a major step in this direction by identifying activation subspace bottlenecks in the Mamba family of SSM models using tools from mechanistic interpretability. We then introduce a test-time steering intervention that simply multiplies the activations of the identified bottlenecks by a scalar. Across 5 SSMs and 6 diverse benchmarks, this intervention improves performance by an average of 8.27%, without requiring any task-specific tuning. Finally, we validate that the identified bottlenecks are indeed hindering performance by modifying them to yield an architecture we call Stable-Mamba, which achieves long-context performance gains when retrained from scratch.
[LG-35] Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning
链接: https://arxiv.org/abs/2602.22703
作者: Hao Yu,Shuning Jia,Guanghao Li,Wenhao Jiang,Chun Yuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning (RL) framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in reinforcement learning. We assess GeoDPO on both in-domain and out-of-domain datasets, spanning tasks in geometric perception as well as downstream reasoning. Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains: +26.5% on in-domain data, +8.0% on out-of-domain data, and +39.0% on downstream reasoning tasks. These findings underscore the superior performance and generalization ability of GeoDPO over SFT. All codes are released at this https URL to ensure reproducibility. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.22703 [cs.LG] (or arXiv:2602.22703v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.22703 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-36] DPSQL: A Differentially Private SQL Library with a Minimum Frequency Rule
链接: https://arxiv.org/abs/2602.22699
作者: Tomoya Matsumoto,Shokichi Takakura,Shun Takagi,Satoshi Hasegawa
类目: Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:SQL is the de facto interface for exploratory data analysis; however, releasing exact query results can expose sensitive information through membership or attribute inference attacks. Differential privacy (DP) provides rigorous privacy guarantees, but in practice, DP alone may not satisfy governance requirements such as the \emphminimum frequency rule, which requires each released group (cell) to include contributions from at least k distinct individuals. In this paper, we present \textbfDPSQL+, a privacy-preserving SQL library that simultaneously enforces user-level (\varepsilon,\delta) -DP and the minimum frequency rule. DPSQL+ adopts a modular architecture consisting of: (i) a \emphValidator that statically restricts queries to a DP-safe subset of SQL; (ii) an \emphAccountant that consistently tracks cumulative privacy loss across multiple queries; and (iii) a \emphBackend that interfaces with various database engines, ensuring portability and extensibility. Experiments on the TPC-H benchmark demonstrate that DPSQL+ achieves practical accuracy across a wide range of analytical workloads – from basic aggregates to quadratic statistics and join operations – and allows substantially more queries under a fixed global privacy budget than prior libraries in our evaluation.
[LG-37] Switch-Hurdle: A MoE Encoder with AR Hurdle Decoder for Intermittent Demand Forecasting
链接: https://arxiv.org/abs/2602.22685
作者: Fabian Muşat,Simona Căbuz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Intermittent demand, a pattern characterized by long sequences of zero sales punctuated by sporadic, non-zero values, poses a persistent challenge in retail and supply chain forecasting. Both traditional methods, such as ARIMA, exponential smoothing, or Croston variants, as well as modern neural architectures such as DeepAR and Transformer-based models often underperform on such data, as they treat demand as a single continuous process or become computationally expensive when scaled across many sparse series. To address these limitations, we introduce Switch-Hurdle: a new framework that integrates a Mixture-of-Experts (MoE) encoder with a Hurdle-based probabilistic decoder. The encoder uses a sparse Top-1 expert routing during the forward pass yet approximately dense in the backward pass via a straight-through estimator (STE). The decoder follows a cross-attention autoregressive design with a shared hurdle head that explicitly separates the forecasting task into two components: a binary classification component estimating the probability of a sale, and a conditional regression component, predicting the quantity given a sale. This structured separation enables the model to capture both occurrence and magnitude processes inherent to intermittent demand. Empirical results on the M5 benchmark and a large proprietary retail dataset show that Switch-Hurdle achieves state-of-the-art prediction performance while maintaining scalability.
[LG-38] Accelerating LLM Pre-Training through Flat-Direction Dynamics Enhancement
链接: https://arxiv.org/abs/2602.22681
作者: Shuchen Zhu,Rizhen Hu,Mingze Wang,Mou Sun,Xue Wang,Kun Yuan,Zaiwen Wen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pre-training Large Language Models requires immense computational resources, making optimizer efficiency essential. The optimization landscape is highly anisotropic, with loss reduction driven predominantly by progress along flat directions. While matrix-based optimizers such as Muon and SOAP leverage fine-grained curvature information to outperform AdamW, their updates tend toward isotropy – relatively conservative along flat directions yet potentially aggressive along sharp ones. To address this limitation, we first establish a unified Riemannian Ordinary Differential Equation (ODE) framework that elucidates how common adaptive algorithms operate synergistically: the preconditioner induces a Riemannian geometry that mitigates ill-conditioning, while momentum serves as a Riemannian damping term that promotes convergence. Guided by these insights, we propose LITE, a generalized acceleration strategy that enhances training dynamics by applying larger Hessian damping coefficients and learning rates along flat trajectories. Extensive experiments demonstrate that LITE significantly accelerates both Muon and SOAP across diverse architectures (Dense, MoE), parameter scales (130M–1.3B), datasets (C4, Pile), and learning-rate schedules (cosine, warmup-stable-decay). Theoretical analysis confirms that LITE facilitates faster convergence along flat directions in anisotropic landscapes, providing a principled approach to efficient LLM pre-training. The code is available at this https URL.
[LG-39] Forecasting Antimicrobial Resistance Trends Using Machine Learning on WHO GLASS Surveillance Data: A Retrieval-Augmented Generation Approach for Policy Decision Support
链接: https://arxiv.org/abs/2602.22673
作者: Md Tanvir Hasan Turja
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 18 pages, 4 figures, code and data available at this https URL
Abstract:Antimicrobial resistance (AMR) is a growing global crisis projected to cause 10 million deaths per year by 2050. While the WHO Global Antimicrobial Resistance and Use Surveillance System (GLASS) provides standardized surveillance data across 44 countries, few studies have applied machine learning to forecast population-level resistance trends from this data. This paper presents a two-component framework for AMR trend forecasting and evidence-grounded policy decision support. We benchmark six models – Naive, Linear Regression, Ridge Regression, XGBoost, LightGBM, and LSTM – on 5,909 WHO GLASS observations across six WHO regions (2021-2023). XGBoost achieved the best performance with a test MAE of 7.07% and R-squared of 0.854, outperforming the naive baseline by 83.1%. Feature importance analysis identified the prior-year resistance rate as the dominant predictor (50.5% importance), while regional MAE ranged from 4.16% (European Region) to 10.14% (South-East Asia Region). We additionally implemented a Retrieval-Augmented Generation (RAG) pipeline combining a ChromaDB vector store of WHO policy documents with a locally deployed Phi-3 Mini language model, producing source-attributed, hallucination-constrained policy answers. Code and data are available at this https URL
[LG-40] LEDA: Latent Semantic Distribution Alignment for Multi-domain Graph Pre-training WWW-26
链接: https://arxiv.org/abs/2602.22660
作者: Lianze Shan,Jitao Zhao,Dongxiao He,Siqi Liu,Jiaxu Cui,Weixiong Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted by WWW-26, 12 pages, 2 figures
Abstract:Recent advances in generic large models, such as GPT and DeepSeek, have motivated the introduction of universality to graph pre-training, aiming to learn rich and generalizable knowledge across diverse domains using graph representations to improve performance in various downstream applications. However, most existing methods face challenges in learning effective knowledge from generic graphs, primarily due to simplistic data alignment and limited training guidance. The issue of simplistic data alignment arises from the use of a straightforward unification for highly diverse graph data, which fails to align semantics and misleads pre-training models. The problem with limited training guidance lies in the arbitrary application of in-domain pre-training paradigms to cross-domain scenarios. While it is effective in enhancing discriminative representation in one data space, it struggles to capture effective knowledge from many graphs. To address these challenges, we propose a novel Latent sEmantic Distribution Alignment (LEDA) model for universal graph pre-training. Specifically, we first introduce a dimension projection unit to adaptively align diverse domain features into a shared semantic space with minimal information loss. Furthermore, we design a variational semantic inference module to obtain the shared latent distribution. The distribution is then adopted to guide the domain projection, aligning it with shared semantics across domains and ensuring cross-domain semantic learning. LEDA exhibits strong performance across a broad range of graphs and downstream tasks. Remarkably, in few-shot cross-domain settings, it significantly outperforms in-domain baselines and advanced universal pre-training models.
[LG-41] MUG: Meta-path-aware Universal Heterogeneous Graph Pre-Training AAAI-26
链接: https://arxiv.org/abs/2602.22645
作者: Lianze Shan,Jitao Zhao,Dongxiao He,Yongqi Huang,Zhiyong Feng,Weixiong Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI-26, 9 pages, 3 figures
Abstract:Universal graph pre-training has emerged as a key paradigm in graph representation learning, offering a promising way to train encoders to learn transferable representations from unlabeled graphs and to effectively generalize across a wide range of downstream tasks. However, recent explorations in universal graph pre-training primarily focus on homogeneous graphs and it remains unexplored for heterogeneous graphs, which exhibit greater structural and semantic complexity. This heterogeneity makes it highly challenging to train a universal encoder for diverse heterogeneous graphs: (i) the diverse types with dataset-specific semantics hinder the construction of a unified representation space; (ii) the number and semantics of meta-paths vary across datasets, making encoding and aggregation patterns learned from one dataset difficult to apply to others. To address these challenges, we propose a novel Meta-path-aware Universal heterogeneous Graph pre-training (MUG) approach. Specifically, for challenge (i), MUG introduces a input unification module that integrates information from multiple node and relation types within each heterogeneous graph into a unified this http URL representation is then projected into a shared space by a dimension-aware encoder, enabling alignment across graphs with diverse this http URL, for challenge (ii), MUG trains a shared encoder to capture consistent structural patterns across diverse meta-path views rather than relying on dataset-specific aggregation strategies, while a global objective encourages discriminability and reduces dataset-specific biases. Extensive experiments demonstrate the effectiveness of MUG on some real datasets.
[LG-42] Compress the Easy Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning
链接: https://arxiv.org/abs/2602.22642
作者: Qin-Wen Luo,Sheng Ren,Xiang Chen,Rui Liu,Jun Fang,Naiqiang Tan,Sheng-Jun Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Chain-of-Thought (CoT) has substantially empowered Large Language Models (LLMs) to tackle complex reasoning tasks, yet the verbose nature of explicit reasoning steps incurs prohibitive inference latency and computational costs, limiting real-world deployment. While existing compression methods - ranging from self-training to Reinforcement Learning (RL) with length constraints - attempt to mitigate this, they often sacrifice reasoning capability for brevity. We identify a critical failure mode in these approaches: explicitly optimizing for shorter trajectories triggers rapid entropy collapse, which prematurely shrinks the exploration space and stifles the discovery of valid reasoning paths, particularly for challenging questions requiring extensive deduction. To address this issue, we propose Compress responses for Easy questions and Explore Hard ones (CEEH), a difficulty-aware approach to RL-based efficient reasoning. CEEH dynamically assesses instance difficulty to apply selective entropy regularization: it preserves a diverse search space for currently hard questions to ensure robustness, while permitting aggressive compression on easier instances where the reasoning path is well-established. In addition, we introduce a dynamic optimal-length penalty anchored to the historically shortest correct response, which effectively counteracts entropy-induced length inflation and stabilizes the reward signal. Across six reasoning benchmarks, CEEH consistently reduces response length while maintaining accuracy comparable to the base model, and improves Pass@k relative to length-only optimization.
[LG-43] ackling Privacy Heterogeneity in Differentially Private Federated Learning
链接: https://arxiv.org/abs/2602.22633
作者: Ruichen Xu,Ying-Jun Angela Zhang,Jianwei Huang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Differentially private federated learning (DP-FL) enables clients to collaboratively train machine learning models while preserving the privacy of their local data. However, most existing DP-FL approaches assume that all clients share a uniform privacy budget, an assumption that does not hold in real-world scenarios where privacy requirements vary widely. This privacy heterogeneity poses a significant challenge: conventional client selection strategies, which typically rely on data quantity, cannot distinguish between clients providing high-quality updates and those introducing substantial noise due to strict privacy constraints. To address this gap, we present the first systematic study of privacy-aware client selection in DP-FL. We establish a theoretical foundation by deriving a convergence analysis that quantifies the impact of privacy heterogeneity on training error. Building on this analysis, we propose a privacy-aware client selection strategy, formulated as a convex optimization problem, that adaptively adjusts selection probabilities to minimize training error. Extensive experiments on benchmark datasets demonstrate that our approach achieves up to a 10% improvement in test accuracy on CIFAR-10 compared to existing baselines under heterogeneous privacy budgets. These results highlight the importance of incorporating privacy heterogeneity into client selection for practical and effective federated learning.
[LG-44] orchLean: Formalizing Neural Networks in Lean
链接: https://arxiv.org/abs/2602.22631
作者: Robert Joseph George,Jennifer Cruden,Xiangru Zhong,Huan Zhang,Anima Anandkumar
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL); Numerical Analysis (math.NA)
*备注: 35 pages, multiple figures and tables
Abstract:Neural networks are increasingly deployed in safety- and mission-critical pipelines, yet many verification and analysis results are produced outside the programming environment that defines and runs the model. This separation creates a semantic gap between the executed network and the analyzed artifact, so guarantees can hinge on implicit conventions such as operator semantics, tensor layouts, preprocessing, and floating-point corner cases. We introduce TorchLean, a framework in the Lean 4 theorem prover that treats learned models as first-class mathematical objects with a single, precise semantics shared by execution and verification. TorchLean unifies (1) a PyTorch-style verified API with eager and compiled modes that lower to a shared op-tagged SSA/DAG computation-graph IR, (2) explicit Float32 semantics via an executable IEEE-754 binary32 kernel and proof-relevant rounding models, and (3) verification via IBP and CROWN/LiRPA-style bound propagation with certificate checking. We validate TorchLean end-to-end on certified robustness, physics-informed residual bounds for PINNs, and Lyapunov-style neural controller verification, alongside mechanized theoretical results including a universal approximation theorem. These results demonstrate a semantics-first infrastructure for fully formal, end-to-end verification of learning-enabled systems.
[LG-45] HyperKKL: Enabling Non-Autonomous State Estimation through Dynamic Weight Conditioning ICLR2026
链接: https://arxiv.org/abs/2602.22630
作者: Yahia Salaheldin Shaaban,Salem Lahlou,Abdelrahman Sayed Sayed
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 18 pages, 6 figures, Under review in ICLR 2026 AI PDE Workshop
Abstract:This paper proposes HyperKKL, a novel learning approach for designing Kazantzis-Kravaris/Luenberger (KKL) observers for non-autonomous nonlinear systems. While KKL observers offer a rigorous theoretical framework by immersing nonlinear dynamics into a stable linear latent space, its practical realization relies on solving Partial Differential Equations (PDE) that are analytically intractable. Current existing learning-based approximations of the KKL observer are mostly designed for autonomous systems, failing to generalize to driven dynamics without expensive retraining or online gradient updates. HyperKKL addresses this by employing a hypernetwork architecture that encodes the exogenous input signal to instantaneously generate the parameters of the KKL observer, effectively learning a family of immersion maps parameterized by the external drive. We rigorously evaluate this approach against a curriculum learning strategy that attempts to generalize from autonomous regimes via training heuristics alone. The novel approach is illustrated on four numerical simulations in benchmark examples including the Duffing, Van der Pol, Lorenz, and Rössler systems.
[LG-46] Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA
链接: https://arxiv.org/abs/2602.22617
作者: Hai Huang,Yann LeCun,Randall Balestriero
类目: Machine Learning (cs.LG)
*备注: 21 pages, 13 figures
Abstract:Large Language Models (LLMs) obey consistent scaling laws – empirical power-law fits that predict how loss decreases with compute, data, and parameters. While predictive, these laws are descriptive rather than prescriptive: they characterize typical training, not optimal training. Surprisingly few works have successfully challenged the data-efficiency bounds implied by these laws – which is our primary focus. To that end, we introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear. Building on this principle, we propose a novel Semantic Tube Prediction (STP) task, a JEPA-style regularizer that confines hidden-state trajectories to a tubular neighborhood of the geodesic. STP generalizes JEPA to language without requiring explicit multi-view augmentations. We show this constraint improves signal-to-noise ratio, and consequently preserves diversity by preventing trajectory collisions during inference. Empirically, STP allows LLMs to match baseline accuracy with 16 \times less training data on the NL-RX-SYNTH dataset, directly violating the data term of Chinchilla-style scaling laws and demonstrating that principled geometric priors can surpass brute-force scaling. Code is available at this https URL.
[LG-47] Mitigating Membership Inference in Intermediate Representations via Layer-wise MIA-risk-aware DP-SGD
链接: https://arxiv.org/abs/2602.22611
作者: Jiayang Meng,Tao Huang,Chen Hou,Guolong Zheng,Hong Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:In Embedding-as-an-Interface (EaaI) settings, pre-trained models are queried for Intermediate Representations (IRs). The distributional properties of IRs can leak training-set membership signals, enabling Membership Inference Attacks (MIAs) whose strength varies across layers. Although Differentially Private Stochastic Gradient Descent (DP-SGD) mitigates such leakage, existing implementations employ per-example gradient clipping and a uniform, layer-agnostic noise multiplier, ignoring heterogeneous layer-wise MIA vulnerability. This paper introduces Layer-wise MIA-risk-aware DP-SGD (LM-DP-SGD), which adaptively allocates privacy protection across layers in proportion to their MIA risk. Specifically, LM-DP-SGD trains a shadow model on a public shadow dataset, extracts per-layer IRs from its train/test splits, and fits layer-specific MIA adversaries, using their attack error rates as MIA-risk estimates. Leveraging the cross-dataset transferability of MIAs, these estimates are then used to reweight each layer’s contribution to the globally clipped gradient during private training, providing layer-appropriate protection under a fixed noise magnitude. We further establish theoretical guarantees on both privacy and convergence of LM-DP-SGD. Extensive experiments show that, under the same privacy budget, LM-DP-SGD reduces the peak IR-level MIA risk while preserving utility, yielding a superior privacy-utility trade-off.
[LG-48] EvolveGen: Algorithmic Level Hardware Model Checking Benchmark Generation through Reinforcement Learning
链接: https://arxiv.org/abs/2602.22609
作者: Guangyu Hu,Xiaofeng Zhou,Wei Zhang,Hongce Zhang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 19 pages, 8 figures. Accepted by TACAS 2026
Abstract:Progress in hardware model checking depends critically on high-quality benchmarks. However, the community faces a significant benchmark gap: existing suites are limited in number, often distributed only in representations such as BTOR2 without access to the originating register-transfer-level (RTL) designs, and biased toward extreme difficulty where instances are either trivial or intractable. These limitations hinder rigorous evaluation of new verification techniques and encourage overfitting of solver heuristics to a narrow set of problems. To address this, we introduce EvolveGen, a framework for generating hardware model checking benchmarks by combining reinforcement learning (RL) with high-level synthesis (HLS). Our approach operates at an algorithmic level of abstraction in which an RL agent learns to construct computation graphs. By compiling these graphs under different synthesis directives, we produce pairs of functionally equivalent but structurally distinct hardware designs, inducing challenging model checking instances. Solver runtime is used as the reward signal, enabling the agent to autonomously discover and generate small-but-hard instances that expose solver-specific weaknesses. Experiments show that EvolveGen efficiently creates a diverse benchmark set in standard formats (e.g., AIGER and BTOR2) and effectively reveals performance bottlenecks in state-of-the-art model checkers.
[LG-49] IBCircuit: Towards Holistic Circuit Discovery with Information Bottleneck
链接: https://arxiv.org/abs/2602.22581
作者: Tian Bian,Yifan Niu,Chaohao Yuan,Chengzhi Piao,Bingzhe Wu,Long-Kai Huang,Yu Rong,Tingyang Xu,Hong Cheng,Jia Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Circuit discovery has recently attracted attention as a potential research direction to explain the non-trivial behaviors of language models. It aims to find the computational subgraphs, also known as circuits, within the model that are responsible for solving specific tasks. However, most existing studies overlook the holistic nature of these circuits and require designing specific corrupted activations for different tasks, which is inaccurate and inefficient. In this work, we propose an end-to-end approach based on the principle of Information Bottleneck, called IBCircuit, to identify informative circuits holistically. IBCircuit is an optimization framework for holistic circuit discovery and can be applied to any given task without tediously corrupted activation design. In both the Indirect Object Identification (IOI) and Greater-Than tasks, IBCircuit identifies more faithful and minimal circuits in terms of critical node components and edge components compared to recent related work.
[LG-50] Multilingual Safety Alignment Via Sparse Weight Editing
链接: https://arxiv.org/abs/2602.22554
作者: Jiaming Liang,Zhaoxin Wang,Handing Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) exhibit significant safety disparities across languages, with low-resource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and dependent on scarce multilingual safety data. In this work, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a sparse set of safety neurons, we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint. Extensive experiments across 8 languages and multiple model families (Llama-3, Qwen-2.5) demonstrate that our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities, all achieved with a single, data-efficient calculation.
[LG-51] Relatron: Automating Relational Machine Learning over Relational Databases ICLR2026
链接: https://arxiv.org/abs/2602.22552
作者: Zhikai Chen,Han Xie,Jian Zhang,Jiliang Tang,Xiang Song,Huzefa Rangwala
类目: Machine Learning (cs.LG)
*备注: ICLR 2026
Abstract:Predictive modeling over relational databases (RDBs) powers applications, yet remains challenging due to capturing both cross-table dependencies and complex feature interactions. Relational Deep Learning (RDL) methods automate feature engineering via message passing, while classical approaches like Deep Feature Synthesis (DFS) rely on predefined non-parametric aggregators. Despite performance gains, the comparative advantages of RDL over DFS and the design principles for selecting effective architectures remain poorly understood. We present a comprehensive study that unifies RDL and DFS in a shared design space and conducts architecture-centric searches across diverse RDB tasks. Our analysis yields three key findings: (1) RDL does not consistently outperform DFS, with performance being highly task-dependent; (2) no single architecture dominates across tasks, underscoring the need for task-aware model selection; and (3) validation accuracy is an unreliable guide for architecture choice. This search yields a model performance bank that links architecture configurations to their performance; leveraging this bank, we analyze the drivers of the RDL-DFS performance gap and introduce two task signals – RDB task homophily and an affinity embedding that captures size, path, feature, and temporal structure – whose correlation with the gap enables principled routing. Guided by these signals, we propose Relatron, a task embedding-based meta-selector that chooses between RDL and DFS and prunes the within-family search. Lightweight loss-landscape metrics further guard against brittle checkpoints by preferring flatter optima. In experiments, Relatron resolves the “more tuning, worse performance” effect and, in joint hyperparameter-architecture optimization, achieves up to 18.5% improvement over strong baselines with 10x lower cost than Fisher information-based alternatives.
[LG-52] LUMOS: Democratizing SciML Workflows with L0-Regularized Learning for Unified Feature and Parameter Adaptation
链接: https://arxiv.org/abs/2602.22537
作者: Shouwei Gao,Xu Zheng,Dongsheng Luo,Sheng Di,Wenqian Dong
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid growth of scientific machine learning (SciML) has accelerated discovery across diverse domains, yet designing effective SciML models remains a challenging task. In practice, building such models often requires substantial prior knowledge and manual expertise, particularly in determining which input features to use and how large the model should be. We introduce LUMOS, an end-to-end framework based on L0-regularized learning that unifies feature selection and model pruning to democratize SciML model design. By employing semi-stochastic gating and reparameterization techniques, LUMOS dynamically selects informative features and prunes redundant parameters during training, reducing the reliance on manual tuning while maintaining predictive accuracy. We evaluate LUMOS across 13 diverse SciML workloads, including cosmology and molecular sciences, and demonstrate its effectiveness and generalizability. Experiments on 13 SciML models show that LUMOS achieves 71.45% parameter reduction and a 6.4x inference speedup on average. Furthermore, Distributed Data Parallel (DDP) training on up to eight GPUs confirms the scalability of
[LG-53] Persistent Nonnegative Matrix Factorization via Multi-Scale Graph Regularization
链接: https://arxiv.org/abs/2602.22536
作者: Jichao Zhang,Ran Miao,Limin Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Matrix factorization techniques, especially Nonnegative Matrix Factorization (NMF), have been widely used for dimensionality reduction and interpretable data representation. However, existing NMF-based methods are inherently single-scale and fail to capture the evolution of connectivity structures across resolutions. In this work, we propose persistent nonnegative matrix factorization (pNMF), a scale-parameterized family of NMF problems, that produces a sequence of persistence-aligned embeddings rather than a single one. By leveraging persistent homology, we identify a canonical minimal sufficient scale set at which the underlying connectivity undergoes qualitative changes. These canonical scales induce a sequence of graph Laplacians, leading to a coupled NMF formulation with scale-wise geometric regularization and explicit cross-scale consistency constraint. We analyze the structural properties of the embeddings along the scale parameter and establish bounds on their increments between consecutive scales. The resulting model defines a nontrivial solution path across scales, rather than a single factorization, which poses new computational challenges. We develop a sequential alternating optimization algorithm with guaranteed convergence. Numerical experiments on synthetic and single-cell RNA sequencing datasets demonstrate the effectiveness of the proposed approach in multi-scale low-rank embeddings.
[LG-54] Coarse-to-Fine Learning of Dynamic Causal Structures ICLR2026
链接: https://arxiv.org/abs/2602.22532
作者: Dezhi Yang,Qiaoyu Tan,Carlotta Domeniconi,Jun Wang,Lizhen Cui,Guoxian Yu
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR2026
Abstract:Learning the dynamic causal structure of time series is a challenging problem. Most existing approaches rely on distributional or structural invariance to uncover underlying causal dynamics, assuming stationary or partially stationary causality. However, these assumptions often conflict with the complex, time-varying causal relationships observed in real-world systems. This motivates the need for methods that address fully dynamic causality, where both instantaneous and lagged dependencies evolve over time. Such a setting poses significant challenges for the efficiency and stability of causal discovery. To address these challenges, we introduce DyCausal, a dynamic causal structure learning framework. DyCausal leverages convolutional networks to capture causal patterns within coarse-grained time windows, and then applies linear interpolation to refine causal structures at each time step, thereby recovering fine-grained and time-varying causal graphs. In addition, we propose an acyclic constraint based on matrix norm scaling, which improves efficiency while effectively constraining loops in evolving causal structures. Comprehensive evaluations on both synthetic and real-world datasets demonstrate that DyCausal achieves superior performance compared to existing methods, offering a stable and efficient approach for identifying fully dynamic causal structures from coarse to fine.
[LG-55] EFL: Prediction-Residual-Guided Rolling Forecasting for Multi-Horizon Time Series
链接: https://arxiv.org/abs/2602.22520
作者: Xiannan Huang,Shen Fang,Shuhan Qiu,Chengcheng Yu,Jiayuan Du,Chao Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting plays a critical role in domains such as transportation, energy, and meteorology. Despite their success, modern deep forecasting models are typically trained to minimize point-wise prediction loss without leveraging the rich information contained in past prediction residuals from rolling forecasts - residuals that reflect persistent biases, unmodeled patterns, or evolving dynamics. We propose TEFL (Temporal Error Feedback Learning), a unified learning framework that explicitly incorporates these historical residuals into the forecasting pipeline during both training and evaluation. To make this practical in deep multi-step settings, we address three key challenges: (1) selecting observable multi-step residuals under the partial observability of rolling forecasts, (2) integrating them through a lightweight low-rank adapter to preserve efficiency and prevent overfitting, and (3) designing a two-stage training procedure that jointly optimizes the base forecaster and error module. Extensive experiments across 10 real-world datasets and 5 backbone architectures show that TEFL consistently improves accuracy, reducing MAE by 5-10% on average. Moreover, it demonstrates strong robustness under abrupt changes and distribution shifts, with error reductions exceeding 10% (up to 19.5%) in challenging scenarios. By embedding residual-based feedback directly into the learning process, TEFL offers a simple, general, and effective enhancement to modern deep forecasting systems.
[LG-56] Sharp Convergence Rates for Masked Diffusion Models
链接: https://arxiv.org/abs/2602.22505
作者: Yuchen Liang,Zhiheng Tan,Ness Shroff,Yingbin Liang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, with masked (absorbing-rate) variants emerging as competitive alternatives to autoregressive models. Among existing samplers, the Euler method remains the standard choice in many applications, and more recently, the First-Hitting Sampler (FHS) has shown considerable promise for masked diffusion models. Despite their practical success, the theoretical understanding of these samplers remains limited. Existing analyses are conducted in Kullback-Leibler (KL) divergence, which often yields loose parameter dependencies and requires strong assumptions on score estimation. Moreover, these guarantees do not cover recently developed high-performance sampler of FHS. In this work, we first develop a direct total-variation (TV) based analysis for the Euler method that overcomes these limitations. Our results relax assumptions on score estimation, improve parameter dependencies, and establish convergence guarantees without requiring any surrogate initialization. Also for this setting, we provide the first convergence lower bound for the Euler sampler, establishing tightness with respect to both the data dimension d and the target accuracy \varepsilon . Finally, we analyze the FHS sampler and show that it incurs no sampling error beyond that induced by score estimation, which we show to be tight with a matching lower error bound. Overall, our analysis introduces a direct TV-based error decomposition along the CTMC trajectory and a decoupling-based path-wise analysis for FHS, which may be of independent interest.
[LG-57] Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns
链接: https://arxiv.org/abs/2602.22479
作者: Afshin Khadangi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continual learning is a core requirement for deployed language models, yet standard training and fine-tuning pipelines remain brittle under non-stationary data. Online updates often induce catastrophic forgetting, while methods that improve stability frequently increase latency, memory footprint, or dense computation in ways that do not scale well to long contexts. We introduce TRC ^2 (Thalamically Routed Cortical Columns), a decoder-only backbone that addresses continual learning at the architectural level. TRC ^2 combines sparse thalamic routing over cortical columns with mechanisms for modulation, prediction, memory, and feedback, together with a fast corrective pathway that supports rapid adaptation without destabilizing slower parameters. The resulting block is sparse and chunk-parallel, enabling efficient training and inference while preserving clean ablations of each subsystem. We instantiate a reproducible training and evaluation stack and a continual-learning harness that measures proxy forgetting under streaming domain shifts. Across language modeling and continual learning benchmarks, TRC ^2 improves the stability-plasticity tradeoff at comparable compute, enabling rapid on-stream adaptation while preserving previously acquired behavior.
[LG-58] When to Act Ask or Learn: Uncertainty-Aware Policy Steering
链接: https://arxiv.org/abs/2602.22474
作者: Jessie Yuan,Yilin Wu,Andrea Bajcsy
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Policy steering is an emerging way to adapt robot behaviors at deployment-time: a learned verifier analyzes low-level action samples proposed by a pre-trained policy (e.g., diffusion policy) and selects only those aligned with the task. While Vision-Language Models (VLMs) are promising general-purpose verifiers due to their reasoning capabilities, existing frameworks often assume these models are well-calibrated. In practice, the overconfident judgment from VLM can degrade the steering performance under both high-level semantic uncertainty in task specifications and low-level action uncertainty or incapability of the pre-trained policy. We propose uncertainty-aware policy steering (UPS), a framework that jointly reasons about semantic task uncertainty and low-level action feasibility, and selects an uncertainty resolution strategy: execute a high-confidence action, clarify task ambiguity via natural language queries, or ask for action interventions to correct the low-level policy when it is deemed incapable at the task. We leverage conformal prediction to calibrate the composition of the VLM and the pre-trained base policy, providing statistical assurances that the verifier selects the correct strategy. After collecting interventions during deployment, we employ residual learning to improve the capability of the pre-trained policy, enabling the system to learn continually but with minimal expensive human feedback. We demonstrate our framework through experiments in simulation and on hardware, showing that UPS can disentangle confident, ambiguous, and incapable scenarios and minimizes expensive user interventions compared to uncalibrated baselines and prior human- or robot-gated continual learning approaches. Videos can be found at this https URL
[LG-59] Beyond performance-wise Contribution Evaluation in Federated Learning
链接: https://arxiv.org/abs/2602.22470
作者: Balazs Pejo
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Federated learning offers a privacy-friendly collaborative learning framework, yet its success, like any joint venture, hinges on the contributions of its participants. Existing client evaluation methods predominantly focus on model performance, such as accuracy or loss, which represents only one dimension of a machine learning model’s overall utility. In contrast, this work investigates the critical, yet overlooked, issue of client contributions towards a model’s trustworthiness – specifically, its reliability (tolerance to noisy data), resilience (resistance to adversarial examples), and fairness (measured via demographic parity). To quantify these multifaceted contributions, we employ the state-of-the-art approximation of the Shapley value, a principled method for value attribution. Our results reveal that no single client excels across all dimensions, which are largely independent from each other, highlighting a critical flaw in current evaluation scheme: no single metric is adequate for comprehensive evaluation and equitable rewarding allocation.
[LG-60] mmWave Radar Aware Dual-Conditioned GAN for Speech Reconstruction of Signals With Low SNR INTERSPEECH2026
链接: https://arxiv.org/abs/2602.22431
作者: Jash Karani,Adithya Chittem,Deepan Roy,Sandeep Joshi
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Under review at Interspeech 2026
Abstract:Millimeter-wave (mmWave) radar captures are band-limited and noisy, making for difficult reconstruction of intelligible full-bandwidth speech. In this work, we propose a two-stage speech reconstruction pipeline for mmWave using a Radar-Aware Dual-conditioned Generative Adversarial Network (RAD-GAN), which is capable of performing bandwidth extension on signals with low signal-to-noise ratios (-5 dB to -1 dB), captured through glass walls. We propose an mmWave-tailored Multi-Mel Discriminator (MMD) and a Residual Fusion Gate (RFG) to enhance the generator input to process multiple conditioning channels. The proposed two-stage pipeline involves pretraining the model on synthetically clipped clean speech and finetuning on fused mel spectrograms generated by the RFG. We empirically show that the proposed method, trained on a limited dataset, with no pre-trained modules, and no data augmentations, outperformed state-of-the-art approaches for this specific task. Audio examples of RAD-GAN are available online at this https URL.
[LG-61] opoEdit: Fast Post-Optimization Editing of Topology Optimized Structures
链接: https://arxiv.org/abs/2602.22430
作者: Hongrui Chen,Josephine V. Carstensen,Faez Ahmed
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:
Abstract:Despite topology optimization producing high-performance structures, late-stage localized revisions remain brittle: direct density-space edits (e.g., warping pixels, inserting holes, swapping infill) can sever load paths and sharply degrade compliance, while re-running optimization is slow and may drift toward a qualitatively different design. We present TopoEdit, a fast post-optimization editor that demonstrates how structured latent embeddings from a pre-trained topology foundation model (OAT) can be repurposed as an interface for physics-aware engineering edits. Given an optimized topology, TopoEdit encodes it into OAT’s spatial latent, applies partial noising to preserve instance identity while increasing editability, and injects user intent through an edit-then-denoise diffusion pipeline. We instantiate three edit operators: drag-based topology warping with boundary-condition-consistent conditioning updates, shell-infill lattice replacement using a lattice-anchored reference latent with updated volume-fraction conditioning, and late-stage no-design region enforcement via masked latent overwrite followed by diffusion-based recovery. A consistency-preserving guided DDIM procedure localizes changes while allowing global structural adaptation; multiple candidates can be sampled and selected using a compliance-aware criterion, with optional short SIMP refinement for warps. Across diverse case studies and large edit sweeps, TopoEdit produces intention-aligned modifications that better preserve mechanical performance and avoid catastrophic failure modes compared to direct density-space edits, while generating edited candidates in sub-second diffusion time per sample.
[LG-62] Predicting Multi-Drug Resistance in Bacterial Isolates Through Performance Comparison and LIME-based Interpretation of Classification Models
链接: https://arxiv.org/abs/2602.22400
作者: Santanam Wishal,Riad Sahara
类目: Machine Learning (cs.LG)
*备注: 6 pages, 7 figures
Abstract:The rise of Antimicrobial Resistance, particularly Multi-Drug Resistance (MDR), presents a critical challenge for clinical decision-making due to limited treatment options and delays in conventional susceptibility testing. This study proposes an interpretable machine learning framework to predict MDR in bacterial isolates using clinical features and antibiotic susceptibility patterns. Five classification models were evaluated, including Logistic Regression, Random Forest, AdaBoost, XGBoost, and LightGBM. The models were trained on a curated dataset of 9,714 isolates, with resistance encoded at the antibiotic family level to capture cross-class resistance patterns consistent with MDR definitions. Performance assessment included accuracy, F1-score, AUC-ROC, and Matthews Correlation Coefficient. Ensemble models, particularly XGBoost and LightGBM, demonstrated superior predictive capability across all metrics. To address the clinical transparency gap, Local Interpretable Model-agnostic Explanations (LIME) was applied to generate instance-level explanations. LIME identified resistance to quinolones, Co-trimoxazole, Colistin, aminoglycosides, and Furanes as the strongest contributors to MDR predictions, aligning with known biological mechanisms. The results show that combining high-performing models with local interpretability provides both accuracy and actionable insights for antimicrobial stewardship. This framework supports earlier MDR identification and enhances trust in machine learning-assisted clinical decision support.
[LG-63] Disentangling Shared and Target-Enriched Topics via Background-Contrastive Non-negative Matrix Factorization
链接: https://arxiv.org/abs/2602.22387
作者: Yixuan Li,Archer Y. Yang,Yue Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Biological signals of interest in high-dimensional data are often masked by dominant variation shared across conditions. This variation, arising from baseline biological structure or technical effects, can prevent standard dimensionality reduction methods from resolving condition-specific structure. The challenge is that these confounding topics are often unknown and mixed with biological signals. Existing background correction methods are either unscalable to high dimensions or not interpretable. We introduce background contrastive Non-negative Matrix Factorization (\model), which extracts target-enriched latent topics by jointly factorizing a target dataset and a matched background using shared non-negative bases under a contrastive objective that suppresses background-expressed structure. This approach yields non-negative components that are directly interpretable at the feature level, and explicitly isolates target-specific variation. \model is learned by an efficient multiplicative update algorithm via matrix multiplication such that it is highly efficient on GPU hardware and scalable to big data via minibatch training akin to deep learning approach. Across simulations and diverse biological datasets, \model reveals signals obscured by conventional methods, including disease-associated programs in postmortem depressive brain single-cell RNA-seq, genotype-linked protein expression patterns in mice, treatment-specific transcriptional changes in leukemia, and TP53-dependent drug responses in cancer cell lines.
[LG-64] stable Learning of General Halfspaces under Massart Noise
链接: https://arxiv.org/abs/2602.22300
作者: Ilias Diakonikolas,Giannis Iakovidis,Daniel M. Kane,Sihan Liu
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:We study the algorithmic task of testably learning general Massart halfspaces under the Gaussian distribution. In the testable learning setting, the aim is the design of a tester-learner pair satisfying the following properties: (1) if the tester accepts, the learner outputs a hypothesis and a certificate that it achieves near-optimal error, and (2) it is highly unlikely that the tester rejects if the data satisfies the underlying assumptions. Our main result is the first testable learning algorithm for general halfspaces with Massart noise and Gaussian marginals. The complexity of our algorithm is d^\mathrmpolylog(\min\1/\gamma, 1/\epsilon ) , where \epsilon is the excess error and \gamma is the bias of the target halfspace, which qualitatively matches the known quasi-polynomial Statistical Query lower bound for the non-testable setting. The analysis of our algorithm hinges on a novel sandwiching polynomial approximation to the sign function with multiplicative error that may be of broader interest.
[LG-65] When Should a Model Change Its Mind? An Energy-Based Theory and Regularizer for Concept Drift in Electrocardiogram (ECG) Signals
链接: https://arxiv.org/abs/2602.22294
作者: Timothy Oladunni,Blessing Ojeme,Kyndal Maclin,Clyde Baidoo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Models operating on dynamic physiologic signals must distinguish benign, label-preserving variability from true concept change. Existing concept-drift frameworks are largely distributional and provide no principled guidance on how much a model’s internal representation may move when the underlying signal undergoes physiologically plausible fluctuations in energy. As a result, deep models often misinterpret harmless changes in amplitude, rate, or morphology as concept drift, yielding unstable predictions, particularly in multimodal fusion settings. This study introduces Physiologic Energy Conservation Theory (PECT), an energy-based framework for concept stability in dynamic signals. PECT posits that under virtual drift, normalized latent displacement should scale proportionally with normalized signal energy change, while persistent violations of this proportionality indicate real concept drift. We operationalize this principle through Energy-Constrained Representation Learning (ECRL), a lightweight regularizer that penalizes energy-inconsistent latent movement without modifying encoder architectures or adding inference-time cost. Although PECT is formulated for dynamic signals in general, we instantiate and evaluate it on multimodal ECG across seven unimodal and hybrid models. Experiments show that in the strongest trimodal hybrid (1D+2D+Transformer), clean accuracy is largely preserved (96.0% to 94.1%), while perturbed accuracy improves substantially (72.6% to 85.5%) and fused representation drift decreases by over 45%. Similar trends are observed across all architectures, providing empirical evidence that PECT functions as an energy-drift law governing concept stability in continuous physiologic signals. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.22294 [cs.LG] (or arXiv:2602.22294v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.22294 Focus to learn more arXiv-issued DOI via DataCite
[LG-66] Global River Forecasting with a Topology-Informed AI Foundation Model
链接: https://arxiv.org/abs/2602.22293
作者: Hancheng Ren,Gang Zhao,Shuo Wang,Louise Slater,Dai Yamazaki,Shu Liu,Jingfang Fan,Shibo Cui,Ziming Yu,Shengyu Kang,Depeng Zuo,Dingzhi Peng,Zongxue Xu,Bo Pang
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 26 pages, 5 figures, 3 extended data tables, 3 extended data figures
Abstract:River systems operate as inherently interconnected continuous networks, meaning river hydrodynamic simulation ought to be a systemic process. However, widespread hydrology data scarcity often restricts data-driven forecasting to isolated predictions. To achieve systemic simulation and reduce reliance on river observations, we present GraphRiverCast (GRC), a topology-informed AI foundation model designed to simulate multivariate river hydrodynamics in global river systems. GRC is capable of operating in a “ColdStart” mode, generating predictions without relying on historical river states for initialization. In 7-day global pseudo-hindcasts, GRC-ColdStart functions as a robust standalone simulator, achieving a Nash-Sutcliffe Efficiency (NSE) of approximately 0.82 without exhibiting the significant error accumulation typical of autoregressive paradigms. Ablation studies reveal that topological encoding serves as indispensable structural information in the absence of historical states, explicitly guiding hydraulic connectivity and network-scale mass redistribution to reconstruct flow dynamics. Furthermore, when adapted locally via a pre-training and fine-tuning strategy, GRC consistently outperforms physics-based and locally-trained AI baselines. Crucially, this superiority extends from gauged reaches to full river networks, underscoring the necessity of topology encoding and physics-based pre-training. Built on a physics-aligned neural operator architecture, GRC enables rapid and cross-scale adaptive simulation, establishing a collaborative paradigm bridging global hydrodynamic knowledge with local hydrological reality.
[LG-67] Reliable XAI Explanations in Sudden Cardiac Death Prediction for Chagas Cardiomyopathy
链接: https://arxiv.org/abs/2602.22288
作者: Vinícius P. Chagas,Luiz H. T. Viana,Mac M. da S. Carlos,João P. V. Madeiro,Roberto C. Pedrosa,Thiago Alves Rocha,Carlos H. L. Cavalcante
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: Preprint. For the final published version, see the DOI below
Abstract:Sudden cardiac death (SCD) is unpredictable, and its prediction in Chagas cardiomyopathy (CC) remains a significant challenge, especially in patients not classified as high risk. While AI and machine learning models improve risk stratification, their adoption is hindered by a lack of transparency, as they are often perceived as \textitblack boxes with unclear decision-making processes. Some approaches apply heuristic explanations without correctness guarantees, leading to mistakes in the decision-making process. To address this, we apply a logic-based explainability method with correctness guarantees to the problem of SCD prediction in CC. This explainability method, applied to an AI classifier with over 95% accuracy and recall, demonstrated strong predictive performance and 100% explanation fidelity. When compared to state-of-the-art heuristic methods, it showed superior consistency and robustness. This approach enhances clinical trust, facilitates the integration of AI-driven tools into practice, and promotes large-scale deployment, particularly in endemic regions where it is most needed.
[LG-68] OmniZip: Learning a Unified and Lightweight Lossless Compressor for Multi-Modal Data
链接: https://arxiv.org/abs/2602.22286
作者: Yan Zhao,Zhengxue Cheng,Junxuan Zhang,Dajiang Zhou,Qunshan Gu,Qi Wang,Li Song
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 8 figures, 10 tables
Abstract:Lossless compression is essential for efficient data storage and transmission. Although learning-based lossless compressors achieve strong results, most of them are designed for a single modality, leading to redundant compressor deployments in multi-modal settings. Designing a unified multi-modal compressor is critical yet challenging, as different data types vary largely in format, dimension, and statistics. Multi-modal large language models offer a promising resolution but remain too complex for practical use. Thus, we propose \textbfOmniZip, \textbfa unified and lightweight lossless compressor for multi-modal data (like image, text, speech, tactile, database, and gene sequence). Built on a lightweight backbone, OmniZip incorporates three key components to enable efficient multi-modal lossless compression: a modality-unified tokenizer that reversibly transforms diverse data into tokens, a modality-routing context learning mechanism that enables flexible multi-modal context modeling, and a modality-routing feedforward design that further enhances the model’s nonlinear representation flexibility. A reparameterization training strategy is used to enhance model capacity. OmniZip outperforms or matches other state-of-the-art compressors on multiple modalities, achieving 42%, 57%, 62% and 42%, 53% higher compression efficiency than gzip on CLIC-M, TouchandGo, enwik9, LibriSpeech, and WikiSQL datasets, respectively. It also supports near real-time inference on resource-constrained edge devices, reaching about 1MB/s on MacBook CPUs and iPhone NPUs. Our code is released at this https URL.
[LG-69] BrepCoder: A Unified Multimodal Large Language Model for Multi-task B-rep Reasoning
链接: https://arxiv.org/abs/2602.22284
作者: Mingi Kim,Yongjun Kim,Jungwoo Kang,Hyungki Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advancements in deep learning have actively addressed complex challenges within the Computer-Aided Design (CAD) this http URL, most existing approaches rely on task-specifi c models requiring structural modifi cations for new tasks, and they predominantly focus on point clouds or images rather than the industry-standard Boundary Representation (B-rep) format. To address these limitations, we propose BrepCoder, a unifi ed Multimodal Large Language Model (MLLM) that performs diverse CAD tasks from B-rep inputs. By leveraging the code generation capabilities of Large Language Models (LLMs), we convert CAD modeling sequences into Python-like code and align them with B-rep. We then adopt a two-stage training strategy: First, pre-training on reverse engineering to learn geometric features and design logic. Second, eff ectively extending the model to various downstream tasks such as completion, error correction, and CAD-QA. Consequently, by interpreting B-rep as structural code, BrepCoder achieves superior generalization across diverse tasks, demonstrating its potential as a general-purpose CAD agent.
[LG-70] Differentially Private Truncation of Unbounded Data via Public Second Moments
链接: https://arxiv.org/abs/2602.22282
作者: Zilong Cao,Xuan Bi,Hai Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Data privacy is important in the AI era, and differential privacy (DP) is one of the golden solutions. However, DP is typically applicable only if data have a bounded underlying distribution. We address this limitation by leveraging second-moment information from a small amount of public data. We propose Public-moment-guided Truncation (PMT), which transforms private data using the public second-moment matrix and applies a principled truncation whose radius depends only on non-private quantities: data dimension and sample size. This transformation yields a well-conditioned second-moment matrix, enabling its inversion with a significantly strengthened ability to resist the DP noise. Furthermore, we demonstrate the applicability of PMT by using penalized and generalized linear regressions. Specifically, we design new loss functions and algorithms, ensuring that solutions in the transformed space can be mapped back to the original domain. We have established improvements in the models’ DP estimation through theoretical error bounds, robustness guarantees, and convergence results, attributing the gains to the conditioning effect of PMT. Experiments on synthetic and real datasets confirm that PMT substantially improves the accuracy and stability of DP models.
[LG-71] X-REFINE: XAI-based RElevance input-Filtering and archItecture fiNe-tuning for channel Estimation
链接: https://arxiv.org/abs/2602.22277
作者: Abdul Karim Gizzini,Yahia Medjahdi
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This paper has been accepted for publication in the IEEE Transactions on Vehicular Technology (TVT) as a correspondence paper
Abstract:AI-native architectures are vital for 6G wireless communications. The black-box nature and high complexity of deep learning models employed in critical applications, such as channel estimation, limit their practical deployment. While perturbation-based XAI solutions offer input filtering, they often neglect internal structural optimization. We propose X-REFINE, an XAI-based framework for joint input-filtering and architecture fine-tuning. By utilizing a decomposition-based, sign-stabilized LRP epsilon rule, X-REFINE backpropagates predictions to derive high-resolution relevance scores for both subcarriers and hidden neurons. This enables a holistic optimization that identifies the most faithful model components. Simulation results demonstrate that X-REFINE achieves a superior interpretability-performance-complexity trade-off, significantly reducing computational complexity while maintaining robust bit error rate (BER) performance across different scenarios.
[LG-72] Support Tokens Stability Margins and a New Foundation for Robust LLM s
链接: https://arxiv.org/abs/2602.22271
作者: Deepak Agarwal,Dhyey Dharmendrakumar Mavani,Suyash Gupta,Karthik Sethuraman,Tejas Dharamsi
类目: Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 39 pages, 6 figures
Abstract:Self-attention is usually described as a flexible, content-adaptive way to mix a token with information from its past. We re-interpret causal self-attention transformers, the backbone of modern foundation models, within a probabilistic framework, much like how classical PCA is extended to probabilistic PCA. However, this re-formulation reveals a surprising and deeper structural insight: due to a change-of-variables phenomenon, a barrier constraint emerges on the self-attention parameters. This induces a highly structured geometry on the token space, providing theoretical insights into the dynamics of LLM decoding. This reveals a boundary where attention becomes ill-conditioned, leading to a margin interpretation similar to classical support vector machines. Just like support vectors, this naturally gives rise to the concept of support tokens. Furthermore, we show that LLMs can be interpreted as a stochastic process over the power set of the token space, providing a rigorous probabilistic framework for sequence modeling. We propose a Bayesian framework and derive a MAP estimation objective that requires only a minimal modification to standard LLM training: the addition of a smooth log-barrier penalty to the usual cross-entropy loss. We demonstrate that this provides more robust models without sacrificing out-of-sample accuracy and that it is straightforward to incorporate in practice. Comments: 39 pages, 6 figures Subjects: Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST) Cite as: arXiv:2602.22271 [cs.LG] (or arXiv:2602.22271v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.22271 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-73] Prior Knowledge-enhanced Spatio-temporal Epidemic Forecasting
链接: https://arxiv.org/abs/2602.22270
作者: Sijie Ruan(1),Jinyu Li(1),Jia Wei(1),Zenghao Xu(2),Jie Bao(3),Junshi Xu(4),Junyang Qiu(5),Hanning Yuan(1),Xiaoxiao Wang(2),Shuliang Wang(1) ((1) Beijing Institute of Technology, China, (2) Zhejiang Center for Disease Control and Prevention, China, (3) JD Technology, China, (4) The University of Hong Kong, Hong Kong SAR, China, (5) China Mobile Internet, China)
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注: 12 pages, 10 figures
Abstract:Spatio-temporal epidemic forecasting is critical for public health management, yet existing methods often struggle with insensitivity to weak epidemic signals, over-simplified spatial relations, and unstable parameter estimation. To address these challenges, we propose the Spatio-Temporal priOr-aware Epidemic Predictor (STOEP), a novel hybrid framework that integrates implicit spatio-temporal priors and explicit expert priors. STOEP consists of three key components: (1) Case-aware Adjacency Learning (CAL), which dynamically adjusts mobility-based regional dependencies using historical infection patterns; (2) Space-informed Parameter Estimating (SPE), which employs learnable spatial priors to amplify weak epidemic signals; and (3) Filter-based Mechanistic Forecasting (FMF), which uses an expert-guided adaptive thresholding strategy to regularize epidemic parameters. Extensive experiments on real-world COVID-19 and influenza datasets demonstrate that STOEP outperforms the best baseline by 11.1% in RMSE. The system has been deployed at one provincial CDC in China to facilitate downstream applications.
[LG-74] CQSA: Byzantine-robust Clustered Quantum Secure Aggregation in Federated Learning
链接: https://arxiv.org/abs/2602.22269
作者: Arnab Nath,Harsh Kasyap
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures
Abstract:Federated Learning (FL) enables collaborative model training without sharing raw data. However, shared local model updates remain vulnerable to inference and poisoning attacks. Secure aggregation schemes have been proposed to mitigate these attacks. In this work, we aim to understand how these techniques are implemented in quantum-assisted FL. Quantum Secure Aggregation (QSA) has been proposed, offering information-theoretic privacy by encoding client updates into the global phase of multipartite entangled states. Existing QSA protocols, however, rely on a single global Greenberger-Horne-Zeilinger (GHZ) state shared among all participating clients. This design poses fundamental challenges: fidelity of large-scale GHZ states deteriorates rapidly with the increasing number of clients; and (ii) the global aggregation prevents the detection of Byzantine clients. We propose Clustered Quantum Secure Aggregation (CQSA), a modular aggregation framework that reconciles the physical constraints of near-term quantum hardware along with the need for Byzantine-robustness in FL. CQSA randomly partitions the clients into small clusters, each performing local quantum aggregation using high-fidelity, low-qubit GHZ states. The server analyzes statistical relationships between cluster-level aggregates employing common statistical measures such as cosine similarity and Euclidean distance to identify malicious contributions. Through theoretical analysis and simulations under depolarizing noise, we demonstrate that CQSA ensures stable model convergence, achieves superior state fidelity over global QSA.
[LG-75] AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning
链接: https://arxiv.org/abs/2602.22268
作者: Changhai Zhou,Shiyang Zhang,Yuhua Zhou,Qian Qiao,Jun Gao,Cheng Jin,Kaizhou Qin,Weizhong Zhang
类目: Machine Learning (cs.LG)
*备注: 15 pages, 10 figures
Abstract:Quantization followed by parameter-efficient fine-tuning has emerged as a promising paradigm for downstream adaptation under tight GPU memory constraints. However, this sequential pipeline fails to leverage the intricate interaction between quantization bit-width and LoRA rank. Specifically, a carefully optimized quantization allocation with low quantization error does not always translate to strong fine-tuning performance, and different bit-width and rank configurations can lead to significantly varying outcomes under the same memory budget. To address this limitation, we propose AutoQRA, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process. To tackle the challenges posed by the large discrete search space and the high evaluation cost associated with frequent fine-tuning iterations, AutoQRA decomposes the optimization process into two stages. First, it first conducts a global multi-fidelity evolutionary search, where the initial population is warm-started by injecting layer-wise importance priors. This stage employs specific operators and a performance model to efficiently screen candidate configurations. Second, trust-region Bayesian optimization is applied to locally refine promising regions of the search space and identify optimal configurations under the given memory budget. This approach enables active compensation for quantization noise in specific layers during training. Experiments show that AutoQRA achieves performance close to full-precision fine-tuning with a memory footprint comparable to uniform 4-bit methods.
[LG-76] Data-Driven Supervision of a Thermal-Hydraulic Process Towards a Physics-Based Digital Twin
链接: https://arxiv.org/abs/2602.22267
作者: Osimone Imhogiemhe(LS2N, LS2N - équipe SIMS, Nantes Univ - ECN),Yoann Jus(Cetim),Hubert Lejeune(Cetim),Saïd Moussaoui(LS2N, LS2N - équipe SIMS, Nantes Univ - ECN)
类目: Machine Learning (cs.LG)
*备注:
Abstract:The real-time supervision of production processes is a common challenge across several industries. It targets process component monitoring and its predictive maintenance in order to ensure safety, uninterrupted production and maintain high efficiency level. The rise of advanced tools for the simulation of physical systems in addition to data-driven machine learning models offers the possibility to design numerical tools dedicated to efficient system monitoring. In that respect, the digital twin concept presents an adequate framework that proffers solution to these challenges. The main purpose of this paper is to develop such a digital twin dedicated to fault detection and diagnosis in the context of a thermal-hydraulic process supervision. Based on a numerical simulation of the system, in addition to machine learning methods, we propose different modules dedicated to process parameter change detection and their on-line estimation. The proposed fault detection and diagnosis algorithm is validated on a specific test scenario, with single one-off parameter change occurrences in the system. The numerical results show good accuracy in terms of parameter variation localization and the update of their values.
[LG-77] WaveSSM: Multiscale State-Space Models for Non-stationary Signal Attention
链接: https://arxiv.org/abs/2602.22266
作者: Ruben Solozabal,Velibor Bojkovic,Hilal Alquabeh,Klea Ziu,Kentaro Inui,Martin Takac
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:
Abstract:State-space models (SSMs) have emerged as a powerful foundation for long-range sequence modeling, with the HiPPO framework showing that continuous-time projection operators can be used to derive stable, memory-efficient dynamical systems that encode the past history of the input signal. However, existing projection-based SSMs often rely on polynomial bases with global temporal support, whose inductive biases are poorly matched to signals exhibiting localized or transient structure. In this work, we introduce \emphWaveSSM, a collection of SSMs constructed over wavelet frames. Our key observation is that wavelet frames yield a localized support on the temporal dimension, useful for tasks requiring precise localization. Empirically, we show that on equal conditions, \textitWaveSSM outperforms orthogonal counterparts as S4 on real-world datasets with transient dynamics, including physiological signals on the PTB-XL dataset and raw audio on Speech Commands.
[LG-78] Sustainable LLM Inference using Context-Aware Model Switching
链接: https://arxiv.org/abs/2602.22261
作者: Yuvarani,Akashdeep Singh,Zahra Fathanah,Salsabila Harlen,Syeikha Syafura Al-Zahra binti Zahari,Hema Subramaniam
类目: Machine Learning (cs.LG)
*备注: 15 pages, 6 figures
Abstract:Large language models have become central to many AI applications, but their growing energy consumption raises serious sustainability concerns. A key limitation in current AI deployments is the reliance on a one-size-fits-all inference strategy where most systems route every request to the same large model, regardless of task complexity, leading to substantial and unnecessary energy waste. To address this issue, we propose a context-aware model switching approach that dynamically selects an appropriate language model based on query complexity. The proposed system uses a Context-Aware Model Switching for Energy-Efficient LLM Inference that combines caching for repeated queries, rulebased complexity scoring for fast and explainable decisions, machine learning classification to capture semantic intent, and a user-adaptive component that learns from interaction patterns over time. The proposed architecture was evaluated using real conversation workloads and three open-source language models (Gemma3 1B, Gemma3 4B and Qwen3 4B) with different computational costs, measuring energy consumption (via NVML GPU power telemetry), response latency, routing accuracy, and output quality (BERTScore F1) to reflect real-world usage conditions. Experimental results show that the model switching approach can reduce energy consumption by up to 67.5% compared to always using the largest model while maintaining a response quality of 93.6%. In addition, the response time for simple queries also improved significantly by approximately 68%. These results show that model switching inference offers a practical and scalable path toward more energy-efficient and sustainable AI systems, demonstrating that significant efficiency gains can be achieved without major sacrifices in response quality.
[LG-79] Code World Models for Parameter Control in Evolutionary Algorithms
链接: https://arxiv.org/abs/2602.22260
作者: Camilo Chacón Sartori,Guillem Rodríguez Corominas
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Can an LLM learn how an optimizer behaves – and use that knowledge to control it? We extend Code World Models (CWMs), LLM-synthesized Python programs that predict environment dynamics, from deterministic games to stochastic combinatorial optimization. Given suboptimal trajectories of (1+1) - \textRLS_k , the LLM synthesizes a simulator of the optimizer’s dynamics; greedy planning over this simulator then selects the mutation strength k at each step. On \lo and \onemax, CWM-greedy performs within 6% of the theoretically optimal policy – without ever seeing optimal-policy trajectories. On \jump _k , where a deceptive valley causes all adaptive baselines to fail (0% success rate), CWM-greedy achieves 100% success rate – without any collection policy using oracle knowledge of the gap parameter. On the NK-Landscape, where no closed-form model exists, CWM-greedy outperforms all baselines across fifteen independently generated instances ( 36.94 vs.\ 36.32 ; p0.001 ) when the prompt includes empirical transition statistics. The CWM also outperforms DQN in sample efficiency (200 offline trajectories vs.\ 500 online episodes), success rate (100% vs.\ 58%), and generalization ( k=3 : 78% vs.\ 0%). Robustness experiments confirm stable synthesis across 5 independent runs.
[LG-80] Orthogonal Weight Modification Enhances Learning Scalability and Convergence Efficiency without Gradient Backpropagation
链接: https://arxiv.org/abs/2602.22259
作者: Guoqing Ma,Shan Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recognizing the substantial computational cost of backpropagation (BP), non-BP methods have emerged as attractive alternatives for efficient learning on emerging neuromorphic systems. However, existing non-BP approaches still face critical challenges in efficiency and scalability. Inspired by neural representations and dynamic mechanisms in the brain, we propose a perturbation-based approach called LOw-rank Cluster Orthogonal (LOCO) weight modification. We find that low-rank is an inherent property of perturbation-based algorithms. Under this condition, the orthogonality constraint limits the variance of the node perturbation (NP) gradient estimates and enhances the convergence efficiency. Through extensive evaluations on multiple datasets, LOCO demonstrates the capability to locally train the deepest spiking neural networks to date (more than 10 layers), while exhibiting strong continual learning ability, improved convergence efficiency, and better task performance compared to other brain-inspired non-BP algorithms. Notably, LOCO requires only O(1) parallel time complexity for weight updates, which is significantly lower than that of BP methods. This offers a promising direction for achieving high-performance, real-time, and lifelong learning on neuromorphic systems.
[LG-81] Improving Spatial Allocation for Energy System Coupling with Graph Neural Networks
链接: https://arxiv.org/abs/2602.22249
作者: Xuanhao Mu,Jakob Geiges,Nan Liu,Thorsten Schlachter,Veit Hagenmeyer
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:In energy system analysis, coupling models with mismatched spatial resolutions is a significant challenge. A common solution is assigning weights to high-resolution geographic units for aggregation, but traditional models are limited by using only a single geospatial attribute. This paper presents an innovative method employing a self-supervised Heterogeneous Graph Neural Network to address this issue. This method models high-resolution geographic units as graph nodes, integrating various geographical features to generate physically meaningful weights for each grid point. These weights enhance the conventional Voronoi-based allocation method, allowing it to go beyond simply geographic proximity by incorporating essential geographic this http URL addition, the self-supervised learning paradigm overcomes the lack of accurate ground-truth data. Experimental results demonstrate that applying weights generated by this method to cluster-based Voronoi Diagrams significantly enhances scalability, accuracy, and physical plausibility, while increasing precision compared to traditional methods.
[LG-82] Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models
链接: https://arxiv.org/abs/2602.22246
作者: Guangnian Wan,Qi Li,Gongfan Fang,Xinyin Ma,Xinchao Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Multimodal Diffusion Language Models (MDLMs) have recently emerged as a competitive alternative to their autoregressive counterparts. Yet their vulnerability to backdoor attacks remains largely unexplored. In this work, we show that well-established data-poisoning pipelines can successfully implant backdoors into MDLMs, enabling attackers to manipulate model behavior via specific triggers while maintaining normal performance on clean inputs. However, defense strategies effective to these models are yet to emerge. To bridge this gap, we introduce a backdoor defense framework for MDLMs named DiSP (Diffusion Self-Purification). DiSP is driven by a key observation: selectively masking certain vision tokens at inference time can neutralize a backdoored model’s trigger-induced behaviors and restore normal functionality. Building on this, we purify the poisoned dataset using the compromised model itself, then fine-tune the model on the purified data to recover it to a clean one. Given such a specific design, DiSP can remove backdoors without requiring any auxiliary models or clean reference data. Extensive experiments demonstrate that our approach effectively mitigates backdoor effects, reducing the attack success rate (ASR) from over 90% to typically under 5%, while maintaining model performance on benign tasks.
[LG-83] Patient-Centered Graph-Augmented Artificial Intelligence-Enabled Passive Surveillance for Early Stroke Risk Detection in High-Risk Individuals
链接: https://arxiv.org/abs/2602.22228
作者: Jiyeong Kim,Stephen P. Ma,Nirali Vora,Nicholas W. Larsen,Julia Adler-Milstein,Jonathan H. Chen,Selen Bozkurt,Abeed Sarker,Juhee Cho,Jindeok Joo,Natali Pageler,Fatima Rodriguez,Christopher Sharp,Eleni Linos
类目: Machine Learning (cs.LG)
*备注:
Abstract:Stroke affected millions annually, yet poor symptom recognition often delayed care-seeking. To address risk recognition gap, we developed a passive surveillance system for early stroke risk detection using patient-reported symptoms among individuals with diabetes. Constructing a symptom taxonomy grounded in patients own language and a dual machine learning pipeline (heterogeneous GNN and EN/LASSO), we identified symptom patterns associated with subsequent stroke. We translated findings into a hybrid risk screening system integrating symptom relevance and temporal proximity, evaluated across 3-90 day windows through EHR-based simulations. Under conservative thresholds, intentionally designed to minimize false alerts, the screening system achieved high specificity (1.00) and prevalence-adjusted positive predictive value (1.00), with good sensitivity (0.72), an expected trade-off prioritizing precision, that was highest in 90-day window. Patient-reported language alone supported high-precision, low-burden early stroke risk detection, that could offer a valuable time window for clinical evaluation and intervention for high-risk individuals.
[LG-84] Deep ensemble graph neural networks for probabilistic cosmic-ray direction and energy reconstruction in autonomous radio arrays
链接: https://arxiv.org/abs/2602.23321
作者: Arsène Ferrière,Aurélien Benoit-Lévy,Olivier Martineau-Huynh,Matías Tueros
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Submitted to Astroparticle Physics Journal
Abstract:Using advanced machine learning techniques, we developed a method for reconstructing precisely the arrival direction and energy of ultra-high-energy cosmic rays from the voltage traces they induced on ground-based radio detector arrays. In our approach, triggered antennas are represented as a graph structure, which serves as input for a graph neural network (GNN). By incorporating physical knowledge into both the GNN architecture and the input data, we improve the precision and reduce the required size of the training set with respect to a fully data-driven approach. This method achieves an angular resolution of 0.092° and an electromagnetic energy reconstruction resolution of 16.4% on simulated data with realistic noise conditions. We also employ uncertainty estimation methods to enhance the reliability of our predictions, quantifying the confidence of the GNN’s outputs and providing confidence intervals for both direction and energy reconstruction. Finally, we investigate strategies to verify the model’s consistency and robustness under real life variations, with the goal of identifying scenarios in which predictions remain reliable despite domain shifts between simulation and reality. Comments: Submitted to Astroparticle Physics Journal Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG) Cite as: arXiv:2602.23321 [astro-ph.IM] (or arXiv:2602.23321v1 [astro-ph.IM] for this version) https://doi.org/10.48550/arXiv.2602.23321 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-85] Q-Tag: Watermarking Quantum Circuit Generative Models
链接: https://arxiv.org/abs/2602.23085
作者: Yang Yang,Yuzhu Long,Han Fang,Zhaoyun Chen,Zhonghui Li,Weiming Zhang,Guoping Guo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures
Abstract:Quantum cloud platforms have become the most widely adopted and mainstream approach for accessing quantum computing resources, due to the scarcity and operational complexity of quantum hardware. In this service-oriented paradigm, quantum circuits, which constitute high-value intellectual property, are exposed to risks of unauthorized access, reuse, and misuse. Digital watermarking has been explored as a promising mechanism for protecting quantum circuits by embedding ownership information for tracing and verification. However, driven by recent advances in generative artificial intelligence, the paradigm of quantum circuit design is shifting from individually and manually constructed circuits to automated synthesis based on quantum circuit generative models (QCGMs). In such generative settings, protecting only individual output circuits is insufficient, and existing post hoc, circuit-centric watermarking methods are not designed to integrate with the generative process, often failing to simultaneously ensure stealthiness, functional correctness, and robustness at scale. These limitations highlight the need for a new watermarking paradigm that is natively integrated with quantum circuit generative models. In this work, we present the first watermarking framework for QCGMs, which embeds ownership signals into the generation process while preserving circuit fidelity. We introduce a symmetric sampling strategy that aligns watermark encoding with the model’s Gaussian prior, and a synchronization mechanism that counteracts adversarial watermark attack through latent drift correction. Empirical results confirm that our method achieves high-fidelity circuit generation and robust watermark detection across a range of perturbations, paving the way for scalable, secure copyright protection in AI-powered quantum design.
[LG-86] Low-degree Lower bounds for clustering in moderate dimension
链接: https://arxiv.org/abs/2602.23023
作者: Alexandra Carpentier,Nicolas Verzelen
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:We study the fundamental problem of clustering n points into K groups drawn from a mixture of isotropic Gaussians in \mathbbR^d . Specifically, we investigate the requisite minimal distance \Delta between mean vectors to partially recover the underlying partition. While the minimax-optimal threshold for \Delta is well-established, a significant gap exists between this information-theoretic limit and the performance of known polynomial-time procedures. Although this gap was recently characterized in the high-dimensional regime ( n \leq dK ), it remains largely unexplored in the moderate-dimensional regime ( n \geq dK ). In this manuscript, we address this regime by establishing a new low-degree polynomial lower bound for the moderate-dimensional case when d \geq K . We show that while the difficulty of clustering for n \leq dK is primarily driven by dimension reduction and spectral methods, the moderate-dimensional regime involves more delicate phenomena leading to a “non-parametric rate”. We provide a novel non-spectral algorithm matching this rate, shedding new light on the computational limits of the clustering problem in moderate dimension.
[LG-87] Regular Fourier Features for Nonstationary Gaussian Processes
链接: https://arxiv.org/abs/2602.23006
作者: Arsalan Jawaid,Abdullah Karatas,Jörg Seewig
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures
Abstract:Simulating a Gaussian process requires sampling from a high-dimensional Gaussian distribution, which scales cubically with the number of sample locations. Spectral methods address this challenge by exploiting the Fourier representation, treating the spectral density as a probability distribution for Monte Carlo approximation. Although this probabilistic interpretation works for stationary processes, it is overly restrictive for the nonstationary case, where spectral densities are generally not probability measures. We propose regular Fourier features for harmonizable processes that avoid this limitation. Our method discretizes the spectral representation directly, preserving the correlation structure among spectral weights without requiring probability assumptions. Under a finite spectral support assumption, this yields an efficient low-rank approximation that is positive semi-definite by construction. When the spectral density is unknown, the framework extends naturally to kernel learning from data. We demonstrate the method on locally stationary kernels and on harmonizable mixture kernels with complex-valued spectral densities.
[LG-88] Kernel Integrated R2: A Measure of Dependence
链接: https://arxiv.org/abs/2602.22985
作者: Pouya Roudaki,Shakeel Gavioli-Akilagun,Florian Kalinke,Mona Azadkia,Zoltán Szabó
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:We introduce kernel integrated R^2 , a new measure of statistical dependence that combines the local normalization principle of the recently introduced integrated R^2 with the flexibility of reproducing kernel Hilbert spaces (RKHSs). The proposed measure extends integrated R^2 from scalar responses to responses taking values on general spaces equipped with a characteristic kernel, allowing to measure dependence of multivariate, functional, and structured data, while remaining sensitive to tail behaviour and oscillatory dependence structures. We establish that (i) this new measure takes values in [0,1] , (ii) equals zero if and only if independence holds, and (iii) equals one if and only if the response is almost surely a measurable function of the covariates. Two estimators are proposed: a graph-based method using K -nearest neighbours and an RKHS-based method built on conditional mean embeddings. We prove consistency and derive convergence rates for the graph-based estimator, showing its adaptation to intrinsic dimensionality. Numerical experiments on simulated data and a real data experiment in the context of dependency testing for media annotations demonstrate competitive power against state-of-the-art dependence measures, particularly in settings involving non-linear and structured relationships.
[LG-89] Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks
链接: https://arxiv.org/abs/2602.22925
作者: Katerina Papagiannouli,Dario Trevisan,Giuseppe Pio Zitto
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study wide Bayesian neural networks focusing on the rare but statistically dominant fluctuations that govern posterior concentration, beyond Gaussian-process limits. Large-deviation theory provides explicit variational objectives-rate functions-on predictors, providing an emerging notion of complexity and feature learning directly at the functional level. We show that the posterior output rate function is obtained by a joint optimization over predictors and internal kernels, in contrast with fixed-kernel (NNGP) theory. Numerical experiments demonstrate that the resulting predictions accurately describe finite-width behavior for moderately sized networks, capturing non-Gaussian tails, posterior deformation, and data-dependent kernel selection effects.
[LG-90] SPD Learn: A Geometric Deep Learning Python Library for Neural Decoding Through Trivialization
链接: https://arxiv.org/abs/2602.22895
作者: Bruno Aristimunha,Ce Ju,Antoine Collas,Florent Bouchard,Ammar Mian,Bertrand Thirion,Sylvain Chevallier,Reinmar Kobler
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 9 Pages
Abstract:Implementations of symmetric positive definite (SPD) matrix-based neural networks for neural decoding remain fragmented across research codebases and Python packages. Existing implementations often employ ad hoc handling of manifold constraints and non-unified training setups, which hinders reproducibility and integration into modern deep-learning workflows. To address this gap, we introduce SPD Learn, a unified and modular Python package for geometric deep learning with SPD matrices. SPD Learn provides core SPD operators and neural-network layers, including numerically stable spectral operators, and enforces Stiefel/SPD constraints via trivialization-based parameterizations. This design enables standard backpropagation and optimization in unconstrained Euclidean spaces while producing manifold-constrained parameters by construction. The package also offers reference implementations of representative SPDNet-based models and interfaces with widely used brain computer interface/neuroimaging toolkits and modern machine-learning libraries (e.g., MOABB, Braindecode, Nilearn, and SKADA), facilitating reproducible benchmarking and practical deployment.
[LG-91] Unsupervised Continual Learning for Amortized Bayesian Inference
链接: https://arxiv.org/abs/2602.22884
作者: Aayush Mishra,Šimon Kucharský,Paul-Christian Bürkner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Amortized Bayesian Inference (ABI) enables efficient posterior estimation using generative neural networks trained on simulated data, but often suffers from performance degradation under model misspecification. While self-consistency (SC) training on unlabeled empirical data can enhance network robustness, current approaches are limited to static, single-task settings and fail to handle sequentially arriving data or distribution shifts. We propose a continual learning framework for ABI that decouples simulation-based pre-training from unsupervised sequential SC fine-tuning on real-world data. To address the challenge of catastrophic forgetting, we introduce two adaptation strategies: (1) SC with episodic replay, utilizing a memory buffer of past observations, and (2) SC with elastic weight consolidation, which regularizes updates to preserve task-critical parameters. Across three diverse case studies, our methods significantly mitigate forgetting and yield posterior estimates that outperform standard simulation-based training, achieving estimates closer to MCMC reference, providing a viable path for trustworthy ABI across a range of different tasks.
[LG-92] Advancing accelerator virtual beam diagnostics through latent evolution modeling: an integrated solution to forward inverse tuning and UQ problems
链接: https://arxiv.org/abs/2602.22618
作者: Mahindra Rautela,Alexander Scheinker
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG)
*备注:
Abstract:Virtual beam diagnostics relies on computationally intensive beam dynamics simulations where high-dimensional charged particle beams evolve through the accelerator. We propose Latent Evolution Model (LEM), a hybrid machine learning framework with an autoencoder that projects high-dimensional phase spaces into lower-dimensional representations, coupled with transformers to learn temporal dynamics in the latent space. This approach provides a common foundational framework addressing multiple interconnected challenges in beam diagnostics. For \textitforward modeling, a Conditional Variational Autoencoder (CVAE) encodes 15 unique projections of the 6D phase space into a latent representation, while a transformer predicts downstream latent states from upstream inputs. For \textitinverse problems, we address two distinct challenges: (a) predicting upstream phase spaces from downstream observations by utilizing the same CVAE architecture with transformers trained on reversed temporal sequences along with aleatoric uncertainty quantification, and (b) estimating RF settings from the latent space of the trained LEM using a dedicated dense neural network that maps latent representations to RF parameters. For \textittuning problems, we leverage the trained LEM and RF estimator within a Bayesian optimization framework to determine optimal RF settings that minimize beam loss. This paper summarizes our recent efforts and demonstrates how this unified approach effectively addresses these traditionally separate challenges.
[LG-93] A Fast and Practical Column Generation Approach for Identifying Carcinogenic Multi-Hit Gene Combinations
链接: https://arxiv.org/abs/2602.22551
作者: Rick S. H. Willemsen,Tenindra Abeywickrama,Ramu Anandakrishnan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Cancer is often driven by specific combinations of an estimated two to nine gene mutations, known as multi-hit combinations. Identifying these combinations is critical for understanding carcinogenesis and designing targeted therapies. We formalise this challenge as the Multi-Hit Cancer Driver Set Cover Problem (MHCDSCP), a binary classification problem that selects gene combinations to maximise coverage of tumor samples while minimising coverage of normal samples. Existing approaches typically rely on exhaustive search and supercomputing infrastructure. In this paper, we present constraint programming and mixed integer programming formulations of the MHCDSCP. Evaluated on real-world cancer genomics data, our methods achieve performance comparable to state-of-the-art methods while running on a single commodity CPU in under a minute. Furthermore, we introduce a column generation heuristic capable of solving small instances to optimality. These results suggest that solving the MHCDSCP is less computationally intensive than previously believed, thereby opening research directions for exploring modelling assumptions.
[LG-94] A Synergistic Approach: Dynamics-AI Ensemble in Tropical Cyclone Forecasting
链接: https://arxiv.org/abs/2602.22533
作者: Yonghui Li,Wansuo Duan,Hao Li,Wei Han,Han Zhang,Yinuo Li
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:This study addresses a critical challenge in AI-based weather forecasting by developing an AI-driven optimized ensemble forecast system using Orthogonal Conditional Nonlinear Optimal Perturbations (O-CNOPs). The system bridges the gap between computational efficiency and dynamic consistency in tropical cyclone (TC) forecasting. Unlike conventional ensembles limited by computational costs or AI ensembles constrained by inadequate perturbation methods, O-CNOPs generate dynamically optimized perturbations that capture fast-growing errors of FuXi model while maintaining plausibility. The key innovation lies in producing orthogonal perturbations that respect FuXi nonlinear dynamics, yielding structures reflecting dominant dynamical controls and physically interpretable probabilistic forecasts. Demonstrating superior deterministic and probabilistic skills over the operational Integrated Forecasting System Ensemble Prediction System, this work establishes a new paradigm combining AI computational advantages with rigorous dynamical constraints. Success in TC track forecasting paves the way for reliable ensemble forecasts of other high-impact weather systems, marking a major step toward operational AI-based ensemble forecasting.
[LG-95] Flow Matching is Adaptive to Manifold Structures
链接: https://arxiv.org/abs/2602.22486
作者: Shivam Kumar,Yixin Wang,Lizhen Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Flow matching has emerged as a simulation-free alternative to diffusion-based generative modeling, producing samples by solving an ODE whose time-dependent velocity field is learned along an interpolation between a simple source distribution (e.g., a standard normal) and a target data distribution. Flow-based methods often exhibit greater training stability and have achieved strong empirical performance in high-dimensional settings where data concentrate near a low-dimensional manifold, such as text-to-image synthesis, video generation, and molecular structure generation. Despite this success, existing theoretical analyses of flow matching assume target distributions with smooth, full-dimensional densities, leaving its effectiveness in manifold-supported settings largely unexplained. To this end, we theoretically analyze flow matching with linear interpolation when the target distribution is supported on a smooth manifold. We establish a non-asymptotic convergence guarantee for the learned velocity field, and then propagate this estimation error through the ODE to obtain statistical consistency of the implicit density estimator induced by the flow-matching objective. The resulting convergence rate is near minimax-optimal, depends only on the intrinsic dimension, and reflects the smoothness of both the manifold and the target distribution. Together, these results provide a principled explanation for how flow matching adapts to intrinsic data geometry and circumvents the curse of dimensionality.
[LG-96] LoBoost: Fast Model-Native Local Conformal Prediction for Gradient-Boosted Trees
链接: https://arxiv.org/abs/2602.22432
作者: Vagner Santos,Victor Coscrato,Luben Cabezas,Rafael Izbicki,Thiago Ramos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Gradient-boosted decision trees are among the strongest off-the-shelf predictors for tabular regression, but point predictions alone do not quantify uncertainty. Conformal prediction provides distribution-free marginal coverage, yet split conformal uses a single global residual quantile and can be poorly adaptive under heteroscedasticity. Methods that improve adaptivity typically fit auxiliary nuisance models or introduce additional data splits/partitions to learn the conformal score, increasing cost and reducing data efficiency. We propose LoBoost, a model-native local conformal method that reuses the fitted ensemble’s leaf structure to define multiscale calibration groups. Each input is encoded by its sequence of visited leaves; at resolution level k, we group points by matching prefixes of leaf indices across the first k trees and calibrate residual quantiles within each group. LoBoost requires no retraining, auxiliary models, or extra splitting beyond the standard train/calibration split. Experiments show competitive interval quality, improved test MSE on most datasets, and large calibration speedups.
[LG-97] What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses
链接: https://arxiv.org/abs/2602.22289
作者: Ihor Kendiukhov
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:
Abstract:When biological foundation models such as scGPT and Geneformer process single-cell gene expression, what geometric and topological structure forms in their internal representations? Is that structure biologically meaningful or a training artifact, and how confident should we be in such claims? We address these questions through autonomous large-scale hypothesis screening: an AI-driven executor-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations, covering persistent homology, manifold distances, cross-model alignment, community structure, and directed topology, all with explicit null controls and disjoint gene-pool splits. Three principal findings emerge. First, the models learn genuine geometric structure. Gene embedding neighborhoods exhibit non-trivial topology, with persistent homology significant in 11 of 12 transformer layers at p 0.05 in the weakest domain and 12 of 12 in the other two. A multi-level distance hierarchy shows that manifold-aware metrics outperform Euclidean distance for identifying regulatory gene pairs, and graph community partitions track known transcription factor target relationships. Second, this structure is shared across independently trained models. CCA alignment between scGPT and Geneformer yields canonical correlation of 0.80 and gene retrieval accuracy of 72 percent, yet none of 19 tested methods reliably recover gene-level correspondences. The models agree on the global shape of gene space but not on precise gene placement. Third, the structure is more localized than it first appears. Under stringent null controls applied across all null families, robust signal concentrates in immune tissue, while lung and external lung signals weaken substantially. Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Genomics (q-bio.GN) Cite as: arXiv:2602.22289 [q-bio.QM] (or arXiv:2602.22289v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2602.22289 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-98] Deep Accurate Solver for the Geodesic Problem
链接: https://arxiv.org/abs/2602.22275
作者: Saar Huberman,Amit Bracha,Ron Kimmel
类目: Image and Video Processing (eess.IV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Extended version of Deep Accurate Solver for the Geodesic Problem originally published in Scale Space and Variational Methods in Computer Vision (SSVM 2023), Lecture Notes in Computer Science, Springer. This version includes additional experiments and detailed analysis
Abstract:A common approach to compute distances on continuous surfaces is by considering a discretized polygonal mesh approximating the surface and estimating distances on the polygon. We show that exact geodesic distances restricted to the polygon are at most second-order accurate with respect to the distances on the corresponding continuous surface. By order of accuracy we refer to the convergence rate as a function of the average distance between sampled points. Next, a higher-order accurate deep learning method for computing geodesic distances on surfaces is introduced. Traditionally, one considers two main components when computing distances on surfaces: a numerical solver that locally approximates the distance function, and an efficient causal ordering scheme by which surface points are updated. Classical minimal path methods often exploit a dynamic programming principle with quasi-linear computational complexity in the number of sampled points. The quality of the distance approximation is determined by the local solver that is revisited in this paper. To improve state of the art accuracy, we consider a neural network-based local solver which implicitly approximates the structure of the continuous surface. We supply numerical evidence that the proposed learned update scheme provides better accuracy compared to the best possible polyhedral approximations and previous learning-based methods. The result is a third-order accurate solver with a bootstrapping-recipe for further improvement.
[LG-99] Stochastic Neural Networks for Quantum Devices
链接: https://arxiv.org/abs/2602.22241
作者: Bodo Rosenhahn,Tobias J. Osborne,Christoph Hirche
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 15 pages
Abstract:This work presents a formulation to express and optimize stochastic neural networks as quantum circuits in gate-based quantum computing. Motivated by a classical perceptron, stochastic neurons are introduced and combined into a quantum neural network. The Kiefer-Wolfowitz algorithm in combination with simulated annealing is used for training the network weights. Several topologies and models are presented, including shallow fully connected networks, Hopfield Networks, Restricted Boltzmann Machines, Autoencoders and convolutional neural networks. We also demonstrate the combination of our optimized neural networks as an oracle for the Grover algorithm to realize a quantum generative AI model.
[LG-100] VAE-MS: An Asymmetric Variational Autoencoder for Mutational Signature Extraction
链接: https://arxiv.org/abs/2602.22239
作者: Ida Egendal,Rasmus Froberg Brøndum,Dan J Woodcock,Christopher Yau,Martin Bøgsted
类目: Applications (stat.AP); Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: Keywords: Variational Autoencoders, Mutational Signatures
Abstract:Mutational signature analysis has emerged as a powerful method for uncovering the underlying biological processes driving cancer development. However, the signature extraction process, typically performed using non-negative matrix factorization (NMF), often lacks reliability and clinical applicability. To address these limitations, several solutions have been introduced, including the use of neural networks to achieve more accurate estimates and probabilistic methods to better capture natural variation in the data. In this work, we introduce a Variational Autoencoder for Mutational Signatures (VAE-MS), a novel model that leverages both an asymmetric architecture and probabilistic methods for the extraction of mutational signatures. VAE-MS is compared to with three state-of-the-art models for mutational signature extraction: SigProfilerExtractor, the NMF-based gold standard; MUSE-XAE, an autoencoder that employs an asymmetric design without probabilistic components; and SigneR, a Bayesian NMF model, to illustrate the strength in combining a nonlinear extraction with a probabilistic model. In the ability to reconstruct input data and generalize to unseen data, models with probabilistic components (VAE-MS, SigneR) dramatically outperformed models without (SigProfilerExtractor, MUSE-XAE). The NMF-baed models (SigneR, SigProfilerExtractor) had the most accurate reconstructions in simulated data, while VAE-MS reconstructed more accurately on real cancer data. Upon evaluating the ability to extract signatures consistently, no model exhibited a clear advantage over the others. Software for VAE-MS is available at this https URL.
[LG-101] Solving stiff dark matter equations via Jacobian Normalization with Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2602.21988
作者: M. P. Bento,H. B. Câmara,J. R. Rocha,J. F. Seabra
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: 16 LaTeX pages; 6 figures
Abstract:Stiff differential equations pose a major challenge for Physics-Informed Neural Networks (PINNs), often causing poor convergence. We propose a simple, hyperparameter-free method to address stiffness by normalizing loss residuals with the Jacobian. We provide theoretical indications that Jacobian-based normalization can improve gradient descent and validate it on benchmark stiff ordinary differential equations. We then apply it to a realistic system: the stiff Boltzmann equations (BEs) governing weakly interacting massive particle (WIMP) dark matter (DM). Our approach achieves higher accuracy than attention mechanisms previously proposed for handling stiffness, recovering the full solution where prior methods fail. This is further demonstrated in an inverse problem with a single experimental data point - the observed DM relic density - where our inverse PINNs correctly infer the cross section that solves the BEs in both Standard and alternative cosmologies.
附件下载






